Let’s talk about search: Some lessons from building Lantern

Eric Hoyt — Wed, 14 Aug 2013 18:32:09 +0000

This week, Lante rn reached its first wide public.

Lantern is a search and visualization platform for the Media History Digital Library (MHDL), an open access digitization initiative that I lead with David Pierce. The project was in development for two years, and teams from the MHDL and UW-Madison Department of Communication Arts collaborated to bring version 1.0 online toward the end of July. After a whole bunch of testing, we decided that the platform could indeed withstand the scrutiny of the blogosphere. It’s been a pleasure to see that we were right. We’re grateful for supportive posts from Indiewire and David Bordwell and web traffic surpassing anything we’ve experienced before.

I will leave it in the capable and eloquent hands of David Bordwell to explain what the searchability of the MHDL’s 800,000 pages of books and magazines offers to film and broadcasting historians. In this Antenna post, I wanted to more broadly touch on how the search process works. I will address visualization more fully in another post or essay.

We run searches online all the time. Most of us are inclined to focus on the end results rather than the algorithms and design choices take us there. Cultural studies scholars such as Alexander Halavais have offered critical commentary on search engines, but it wasn’t until I began developing Lantern in 2011 that I bothered to peek under the hood of a search engine for myself. Here are five lessons I learned about search that I hope will prove useful to you too the next time you search Lantern or see a query box online.

1. The collection of content you are searching matters a lot.

It would have been great if the first time Carl Hagenmaier, Wendy Hagenmaier, and I sat down to add fulltext search capability to the MHDL’s collections we had been 100% successful. Instead, it took a two year journey of starts, stops, and reboots to get there. But in other ways, it’s a really good thing that we initially failed. If we had been successful in the Fall of 2011, users would have only been able to search a roughly 100,000 page collection comprised primarily of The Film Daily, Photoplay, and Business Screen. Don’t get me wrong, those are great publications. And we now have many more volumes of Photoplay and The Film Daily than we did back then. But over the last two years, our collections have boomed in breadth and diversity along with size and depth. Thanks to our partnerships with the Library of Congress Packard Campus, Museum of Modern Art Library, Niles Essasany Silent Film Museum, Domitor, and others, we have added a tremendous number of magazines, broadcasting, early cinema journals, and books. In 2011, a search for “Jimmy Stewart” would have probably resulted in some hits from the fan magazine Photoplay (our Film Daily volumes at that time didn’t go past 1930). Today, the Lantern query “Jimmy Stewart” yields 407 matching page hits. Take a look at the top 10 results ranked by relevancy. Sure enough, 5 of the top 10 results come from Photoplay. But there are also matching pages from Radio and TV Mirror, Independent Exhibitors Film Bulletin, and International Projectionist — all sources that a James Stewart biographer probably would not think to look. And who would guess that International Projectionist would refer to the star with the casual “Jimmy”? These sorts of discoveries are already possible within Lantern, and as the content collection further expands, there will only be more of them.

2. Always remember, you are searching an index, not the actual content.

This point is an important caveat to the first point. Content matters, but it is only discoverable through an index, which is itself dependent upon the available data and metadata. A search index is a lot like the index at the back of a book — it stores information in a special way that helps you find what you are looking for quickly. A search engine index, like the open source Solr index that Lantern uses, takes a document and blows it apart into lots of small bits so that a computer can quickly search it. Solr comes loaded with the search algorithms that do most of the mathematical heavy lifting. But as developers, we still had to decide exactly what metadata to capture and how to index it. In my “Working Theory” essay co-written with Carl and Wendy, I’ve described how MARC records offered insufficient metadata for the search experience our users wanted. In this post, I want to emphasize is that if something isn’t in the index, and if the index doesn’t play nicely with the search algorithms, then you won’t have a happy search experience. Lesson #3 should make this point more clear.

3. Search algorithms are designed for breadth and scale, so don’t ask them to search in depth

Open source search algorithms are better at searching 2 million short documents, each containing 500 words of text, than at searching 500 very long documents containing 200,000 words each. I learned this lesson the hard way. At the Media History Digital Library, we scan magazines that have been collected and bound together into volumes. So in our early experiments with Lantern, we turned every volume into a discrete XML file with metadata fields for title, creator, date, etc., plus the metadata field “body” where we pasted all the text from the scanned work. Big mistake. Some of the “body” fields had half a million words! After indexing these XML documents, our search speed was dreadfully slow and, worse yet, the results were inaccurate or only partially accurate. In some cases, the search algorithms would find a few hits within a particular work and then time out without searching the full document. The solution — beautifully scripted in Python by Andy Myers — was to turn every page inside a volume into its own XML document, then index all 800,000 MHDL pages as unique documents. This is the only way we can deliver the fast, accurate search results that you want. But we also recognize that it risks de-contextualizing the single page from the larger work. We believe the “Read in Context” option and the catalog descriptions offer partial answers to this challenge of preserving context, and we’re working on developing additional methods too.

4. Good OCR matters for searchability, but OCR isn’t the whole story

You don’t need OCR (optical character recognition) to search a blog or docx Word file. Those textual works were born digital; a computer can clearly see whether that was an “a” or “o” that the author typed. In contrast, Moving Picture World, Radio Mirror, and the MHDL’s other books and magazines were born in the print age. In order to make them machine readable, we need to run optical character recognition — a process that occurs on the Internet Archive’s servers using Abbyy Fine Reader. Abbyy has to make a lot of guesses about particular words and characters. We tend to scan works that are in good condition at a high resolution, and this leads to Abbyy making better guesses and the MHDL having high quality OCR. Nevertheless, the OCR isn’t perfect, and the imperfections are immediately visible in a snippet like this one from a 1930 volume of Film Daily: “Bette Davis, stage actress, has been signed by Carl Taemmle. Jr.” The snippet should say “Carl Laemmle, Jr.” That is the Universal executive listed on the page, and I wish our database model enabled users to log in and fix these blemishes (hopefully, we’ll get to this point in 2014). But — you may have guessed there was a but coming — our search algorithms use some probabilistic guessing and “stemming,” which splinters individual words and allows your query to search for related words (for instance, returning “reissue” and “reissuing” for a “reissue” query. The aggressiveness of stemming and probabilistic word guessing (aka “fuzzyness”) is something that developers can boost or turn down. I’m still trying to flavor Lantern’s stew just right. The big takeaway point, though, is that you’ll quickly notice the OCR quality, but there are other hidden processes going on shaping your results.

5. The search experience has become increasingly visual.

As my colleague Jeremy Morris pointed out to me during one of our food cart lunches outside the UW Library, the search experience has become highly visual. Googling a restaurant now renders a map within the results page. Proquest queries now return icons that display the format of the work — article, book, etc. — but not an image of the actual work. I’d like to think Lantern’s results view one-ups Proquest. We display a full color thumbnail of the matching page in the results view, not simply an icon. The thumbnail communicates a tremendous amount of information very efficiently. You quickly get a sense about whether the page is an advertisement or news story, whether it comes from a glossy fan magazine or a trade paper published in broadsheet layout. Even before you read the highlighted text snippet, you get some impression of the page and source. The thumbnails also help compensate for the lack of our metadata’s granularity. We haven’t had the resources to generate metadata on the level of individual magazine issues, pages, or articles (it’s here that Proquest one-ups us). By exposing the thumbnail page image, though, you visually glean some essential information from the source. Plus, the thumbnails showcase one of the strengths of the MHDL collection: the colorful, photo rich, and graphically interesting nature of the historic magazines.

Ok, now it’s your turn to think algorithmically. When you search for a movie star and sort by relevancy, why is it that the most visually rich pages — often featuring a large photo — tend to rank the highest?

The answer is that those pages tend to have relatively few words. If there are only eight words on a portrait page from The New Movie Magazine and two of them are “Joan Crawford,” then her name occupies a far higher word frequency-to-page percentage than a page from Variety that is jam packed with over 1,000 words of text, including a story announcing Joan Crawford’s next picture.

Should I tweak the relevancy algorithm so that image-heavy pages aren’t listed so high? Should I ascribe greater relevancy to certain canonical sources, like Photoplay and Variety, rather than magazines outside the canon, like New Movie and Hollywood Filmograph? Or should we weight things the other way around — try to nudge users toward under-utilized sources? I would be curious to know what Antenna readers and Lantern users think.

There are advantages and disadvantages no matter what you choose. The best approach, as I see it, may just to be to let the ranking algorithm run as is and use forums like this one to make their workings more transparent.

eric hoyt – Antenna

Let’s talk about search: Some lessons from building Lantern