Let’s talk about search: Some lessons from building Lantern

August 14, 2013
By | 11 Comments

This week, LanteScreen Shot 2013-08-14 at 1.27.58 PMrn reached its first wide public.

Lantern is a search and visualization platform for the Media History Digital Library (MHDL), an open access digitization initiative that I lead with David Pierce. The project was in development for two years, and teams from the MHDL and UW-Madison Department of Communication Arts collaborated to bring version 1.0 online toward the end of July. After a whole bunch of testing, we decided that the platform could indeed withstand the scrutiny of the blogosphere. It’s been a pleasure to see that we were right. We’re grateful for supportive posts from Indiewire and David Bordwell and web traffic surpassing anything we’ve experienced before.

I will leave it in the capable and eloquent hands of David Bordwell to explain what the searchability of the MHDL’s 800,000 pages of books and magazines offers to film and broadcasting historians. In this Antenna post, I wanted to more broadly touch on how the search process works. I will address visualization more fully in another post or essay.

We run searches online all the time. Most of us are inclined to focus on the end results rather than the algorithms and design choices take us there. Cultural studies scholars such as Alexander Halavais have offered critical commentary on search engines, but it wasn’t until I began developing Lantern in 2011 that I bothered to peek under the hood of a search engine for myself. Here are five lessons I learned about search that I hope will prove useful to you too the next time you search Lantern or see a query box online.

1. The collection of content you are searching matters a lot.

It would have been great if the first time Carl Hagenmaier, Wendy Hagenmaier, and I sat down to add fulltext search capability to the MHDL’s collections we had been 100% successful. Instead, it took a two year journey of starts, stops, and reboots to get there. But in other ways, it’s a really good thing that we initially failed. If we had been successful in the Fall of 2011, users would have only been able to search a roughly 100,000 page collection comprised primarily of The Film Daily, Photoplay, and Business Screen. Don’t get me wrong, those are great publications. And we now have many more volumes of Photoplay and The Film Daily than we did back then. But over the last two years, our collections have boomed in breadth and diversity along with size and depth. Thanks to our partnerships with the Library of Congress Packard Campus, Museum of Modern Art Library, Niles Essasany Silent Film Museum, Domitor, and others, we have added a tremendous number of magazines, broadcasting, early cinema journals, and books. In 2011, a search for “Jimmy Stewart” would have probably resulted in some hits from the fan magazine Photoplay (our Film Daily volumes at that time didn’t go past 1930). Today, the Lantern query “Jimmy Stewart” yields 407 matching page hits. Take a look at the top 10 results ranked by relevancy. Sure enough, 5 of the top 10 results come from Photoplay. But there are also matching pages from Radio and TV Mirror, Independent Exhibitors Film Bulletin, and International Projectionist — all sources that a James Stewart biographer probably would not think to look. And who would guess that International Projectionist would refer to the star with the casual “Jimmy”? These sorts of discoveries are already possible within Lantern, and as the content collection further expands, there will only be more of them.

2. Always remember, you are searching an index, not the actual content.

This point is an important caveat to the first point. Content matters, but it is only discoverable through an index, which is itself dependent upon the available data and metadata. A search index is a lot like the index at the back of a book — it stores information in a special way that helps you find what you are looking for quickly. A search engine index, like the open source Solr index that Lantern uses, takes a document and blows it apart into lots of small bits so that a computer can quickly search it. Solr comes loaded with the search algorithms that do most of the mathematical heavy lifting. But as developers, we still had to decide exactly what metadata to capture and how to index it. In my “Working Theory” essay co-written with Carl and Wendy, I’ve described how MARC records offered insufficient metadata for the search experience our users wanted. In this post, I want to emphasize is that if something isn’t in the index, and if the index doesn’t play nicely with the search algorithms, then you won’t have a happy search experience. Lesson #3 should make this point more clear.

3. Search algorithms are designed for breadth and scale, so don’t ask them to search in depth

Open source search algorithms are better at searching 2 million short documents, each containing 500 words of text, than at searching 500 very long documents containing 200,000 words each. I learned this lesson the hard way. At the Media History Digital Library, we scan magazines that have been collected and bound together into volumes. So in our early experiments with Lantern, we turned every volume into a discrete XML file with metadata fields for title, creator, date, etc., plus the metadata field “body” where we pasted all the text from the scanned work. Big mistake. Some of the “body” fields had half a million words! After indexing these XML documents, our search speed was dreadfully slow and, worse yet, the results were inaccurate or only partially accurate. In some cases, the search algorithms would find a few hits within a particular work and then time out without searching the full document. The solution — beautifully scripted in Python by Andy Myers — was to turn every page inside a volume into its own XML document, then index all 800,000 MHDL pages as unique documents. This is the only way we can deliver the fast, accurate search results that you want. But we also recognize that it risks de-contextualizing the single page from the larger work. We believe the “Read in Context” option and the catalog descriptions offer partial answers to this challenge of preserving context, and we’re working on developing additional methods too.

4. Good OCR matters for searchability, but OCR isn’t the whole story

You don’t need OCR (optical character recognition) to search a blog or docx Word file. Those textual works were born digital; a computer can clearly see whether that was an “a” or “o” that the author typed. In contrast, Moving Picture World, Radio Mirror, and the MHDL’s other books and magazines were born in the print age. In order to make them machine readable, we need to run optical character recognition — a process that occurs on the Internet Archive’s servers using Abbyy Fine Reader. Abbyy has to make a lot of guesses about particular words and characters. We tend to scan works that are in good condition at a high resolution, and this leads to Abbyy making better guesses and the MHDL having high quality OCR. Nevertheless, the OCR isn’t perfect, and the imperfections are immediately visible in a snippet like this one from a 1930 volume of Film Daily: “Bette Davis, stage actress, has been signed by Carl Taemmle. Jr.” The snippet should say “Carl Laemmle, Jr.” That is the Universal executive listed on the page, and I wish our database model enabled users to log in and fix these blemishes (hopefully, we’ll get to this point in 2014). But — you may have guessed there was a but coming — our search algorithms use some probabilistic guessing and “stemming,” which splinters individual words and allows your query to search for related words (for instance, returning “reissue” and “reissuing” for a “reissue” query. The aggressiveness of stemming and probabilistic word guessing (aka “fuzzyness”) is something that developers can boost or turn down. I’m still trying to flavor Lantern’s stew just right. The big takeaway point, though, is that you’ll quickly notice the OCR quality, but there are other hidden processes going on shaping your results.

5. The search experience has become increasingly visual.

As my colleague Jeremy Morris pointed out to me during one of our food cart lunches outside the UW Library, the search experience has become highly visual. Googling a restaurant now renders a map within the results page. Proquest queries now return icons that display the format of the work — article, book, etc. — but not an image of the actual work. I’d like to think Lantern’s results view one-ups Proquest. We display a full color thumbnail of the matching page in the results view, not simply an icon. The thumbnail communicates a tremendous amount of information very efficiently. You quickly get a sense about whether the page is an advertisement or news story, whether it comes from a glossy fan magazine or a trade paper published in broadsheet layout. Even before you read the highlighted text snippet, you get some impression of the page and source. The thumbnails also help compensate for the lack of our metadata’s granularity. We haven’t had the resources to generate metadata on the level of individual magazine issues, pages, or articles (it’s here that Proquest one-ups us). By exposing the thumbnail page image, though, you visually glean some essential information from the source. Plus, the thumbnails showcase one of the strengths of the MHDL collection: the colorful, photo rich, and graphically interesting nature of the historic magazines.

Ok, now it’s your turn to think algorithmically. When you search for a movie star and sort by relevancy, why is it that the most visually rich pages — often featuring a large photo — tend to rank the highest?

The answer is that those pages tend to have relatively few words. If there are only eight words on a portrait page from The New Movie Magazine and two of them are “Joan Crawford,” then her name occupies a far higher word frequency-to-page percentage than a page from Variety that is jam packed with over 1,000 words of text, including a story announcing Joan Crawford’s next picture.

Should I tweak the relevancy algorithm so that image-heavy pages aren’t listed so high? Should I ascribe greater relevancy to certain canonical sources, like Photoplay and Variety, rather than magazines outside the canon, like New Movie and Hollywood Filmograph? Or should we weight things the other way around — try to nudge users toward under-utilized sources? I would be curious to know what Antenna readers and Lantern users think.

There are advantages and disadvantages no matter what you choose. The best approach, as I see it, may just to be to let the ranking algorithm run as is and use forums like this one to make their workings more transparent.

Share

Tags: , , , , ,

11 Responses to “ Let’s talk about search: Some lessons from building Lantern ”

  1. Derek Long on August 14, 2013 at 5:32 PM

    Great post, Eric, with some valuable tips, context, and caveats. I would especially echo points 2 and 3.

    I would also add that the early approach to XML indexing, despite its adverse impact on the speed of Lantern (and the long time it took), was actually somewhat useful so far as it helped train us to think about context from the get-go. Grouping together individual volumes in single XML documents likely helped to ensure the consistency of the metadata when it was eventually split into individual pages. It’s true that from a technical standpoint we had to use a page-level approach, but the more commonsensical volume-level technique we were using earlier helped embed XML context into all of those split pages organically and consistently. As part of the practical process of starting the indexing and of thinking through the organization of the metadata (which is what makes an index like this work), I’m not sure you could have started any other way (at least not without some seriously acrobatic scripting).

    Finally, as regards point 5, I would add that Lantern’s “Filter” drop-down menu on the left side of the interface can be used as a rough guide to how much a particular source has been used, at least in major film studies journals. I’d encourage any readers to check out that feature if they’d like to venture off the beaten path. The “List” feature also gives you more in-depth information about JSTOR citations and historical circulation. That data should hopefully be even richer by 2014.

  2. Eric Hoyt on August 15, 2013 at 7:03 PM

    Derek, these are all excellent points. Thanks for bringing them into the discussion.

    You are right about the XML and metadata process. Coding metadata with XML is sort of like marking up a poem using TEI schema. It’s time consuming and labor intensive, but it insures a certain level of quality. Better still, the process forces you to reflect on what you are doing and, often times, reevaluate your entire understanding of both the underlying work and the approach to the mark-up.

    Because generating metadata, structuring data, and creating an index is so time consuming, however, the Humanities really needs to do a better job at valuing this form of labor and the contribution it makes. I was glad that Christa Williford and Charles Henry raised this point in their CLIR report “One Culture”: http://www.clir.org/pubs/reports/pub151 Hopefully, more people are coming around to understanding this.

    Thanks for reminding users to engage with the interactive magazine gallery / data visualization. I’ll be writing more about that soon myself.

    Eric

  3. […] One of the web’s best resources for buffs of film and radio history, the 800,000 scanned pages of books and periodicals (fan magazines such as Photoplay and Motion Picture; industry news sources including Variety and Motion Picture Daily; even specialist journals with sexy titles like Projection Engineering and Exhibitors Trade Review) in the Media History Digital Library, has just become far easier to explore thanks to Lantern, a search engine developed by MHDL co-director Eric Hoyt. Happy news Hoyt’s fellow Badger David Bordwell passed along, along with some tips on using the interface from Hoyt himself. […]

  4. Jeremy Butler on August 17, 2013 at 7:17 AM

    This is an utterly brilliant resource, Eric. Thank you so much for it and for this blog post that pulls back the curtain and allows us to understand better how Lantern’s search engine was conceived and built. I’m sure I’ll be able to draw many, many resources from it for my film/TV history classes.

    As I work with it more, I should be better able to respond to your questions about the search algorithm, but I can say that for pedagogical purposes the extra weight to image-heavy pages is useful; but then I’m often looking for images to help illuminate lectures. For those doing more hardcore research, a text-centric algorithm might be preferable.

    And here’s a tiny comment about the UI:

    In the Chrome browser, on Windows, the opening of tabs by clicking on links does not function quite as I’d expect. When I control-click, a new tab opens, but not in the background, as it should. Instead, it takes the focus away from the search-results page. And a middle-button click doesn’t open a tab, as it should, but, instead, takes me to that page. A right-button click, followed by choosing “open link in a new tab”, works correctly.

    As I said, a small thing, but I suspect many users have gotten used to being able to open tabs in the background from a search-results page.

    I just tested Lantern on Microsoft Internet Explorer 10 and the tabs open correctly, but there’s a bigger issue: the thumbnails are way too big. And that screws up the page layout on both the search-results page and the details page.

    Oh, the joys of designing for myriad browsers!

    Thanks again for creating such a fabulous service. I look forward to watching it evolve and grow!

    • Eric Hoyt on August 18, 2013 at 2:59 PM

      Hi Jeremy,

      Thank you for the positive feedback about Lantern and my Antenna post. I’m very glad to hear you will be able to put this to use in your teaching. And I like your point about how different users may prefer image-heavy vs. text-centric page results (or even the same user running searches for different purposes — locating examples for teaching vs. evidence for a research article).

      I also appreciate hearing about the aspects of the interface that aren’t working so well. I tested Lantern in Firefox, Chrome, and Safari, though always on a Mac. I now realize that was shortsighted. I will give things a spin next week on a Windows machine and using IE. I think/hope that writing a few more CSS rules will provide the belt & suspenders that makes the formatting work across PCs and Macs in all modern browsers.

      Eric

  5. Jeremy Butler on August 17, 2013 at 7:28 AM

    PS Just ran into an issue with “read in context.” On this Photoplay search result

    http://lantern.mediahist.org/catalog/photoplayvolume552chic_0257

    the caption refers to Lana Turner as being “far left,” but she’s not actually in the image as it’s a two-page spread. I clicked on “read in context” to see the conjoining page and was taken to the wrong page in the original document.

    • Eric Hoyt on August 18, 2013 at 3:24 PM

      Hi Jeremy,

      Thanks for flagging this error. In the spirit of the initial Antenna post, I will try to explain what is happening (though, as you’ll see, there is something going on in this Photoplay volume that even I don’t understand).

      1. We’ve indexed every page from every magazine volume as a XML document. This page-level document’s unique ID maps to the JPEG2000 page image on the Internet Archive’s servers.

      2. Lantern uses the unique ID (in this case, photoplayvolume552chic_0257) to quickly pull that page image from the Internet Archive and display it within Lantern’s results page and Downloading/More Options page.

      3. When you click “Read in Context,” you open a stream of the JPEG2000 images within the Internet Archive’s BookReader. Ideally, this opens to you to the exact page you want. However, due to the way the BookReader works and some problematic pagination metadata on our end, we sometimes open you one page too far. As a user, if you don’t see what you are looking for immediately, just hit the left arrow key once and it should get you to the right page.

      In this case, though, that page spread of Lana Turner from Photoplay comes a whole 16 pages behind where you are dropped: http://archive.org/stream/photoplayvolume552chic#page/n233/mode/2up

      This is not a good user experience, to say the least.

      The problem for this Photoplay volume may have to do with the default page that the BookReader assumes is page one. I may be able to fix this in our metadata at the Internet Archive.

      I think the solution to the bigger problem (Read in Context not delivering you 100% of the time to the right page) will only come when we hack open the BookReader’s code and customize our own version of the BookReader. This is high on my priority list for the next year. And thanks to the Internet Archive’s open source license for the BookReader, it is completely possible.

      Eric

      Well, this one is a bit of a head scratcher. Not all of Lantern’s pagination metadata is accurate.

      • Jeremy on August 19, 2013 at 12:20 PM

        Thanks for the additional info. I’m sure, as these bugs get squashed, that Lantern will become more and more useful!

        Regards,

        Jeremy

  6. Jeremy Butler on September 6, 2013 at 12:44 PM

    It would appear that Lantern does not like foreign characters. I just tried a search for “Méliès” and Lantern responded with the following.

    Oh wait, it would appear that there’s a more systematic problem with Lantern right now. I tried “Melies” and got the same response. Every search I try right now (Friday, 12:43 CDT, 9/6) is bouncing.

    ActiveRecord::StatementInvalid in CatalogController#index

    SQLite3::SQLException: cannot rollback – no transaction is active: rollback transaction

    Request

    Parameters:

    {“utf8″=>”✓”,
    “q”=>”Méliès”}

    • Eric Hoyt on September 6, 2013 at 3:40 PM

      Hi Jeremy,

      Thanks for flagging this error. I got quite a few messages from folks today encountering the same issue.

      I’m happy to say that the error has been resolved. My department’s computer media and server specialist, Pete Sengstock, deserves all the credit for properly diagnosing and fixing the glitch. We’ll be making more improvements in the coming monthstoo.

      You can now resume searching away to your heart’s delight. As you’ll see, Lantern holds no grudge against foreign names. A search for “Méliès” returns 57 hits. And if you want to cast a much wider net, the search “melies” returns 2,680 results!:

      http://lantern.mediahist.org/?utf8=%E2%9C%93&utf8=%E2%9C%93&q=melies

  7. Eric Hoyt on September 6, 2013 at 3:34 PM

    Hi Jeremy,

    Thanks for flagging this error. I got quite a few messages from folks today encountering the same issue.

    I’m happy to say that the error has been resolved. My department’s computer media and server specialist, Pete Sengstock, deserves all the credit for properly diagnosing and fixing the glitch. We’ll be making more improvements in the coming monthstoo.

    You can now resume searching away to your heart’s delight. As you’ll see, Lantern holds no grudge against foreign names. A search for “Méliès” returns 57 hits. And if you want to cast a much wider net, the search “melies” returns 2,680 results!:

    http://lantern.mediahist.org/?utf8=%E2%9C%93&utf8=%E2%9C%93&q=melies