Are You Searching Digitized Texts ...

by Craig Stark

8 November 2015

And Coming Up Empty-Handed?

Printer Friendly Version

Say what you will about the effect the so-called Digital Revolution has had on the print book marketplace, I for one have found myself accessing digitized books and other publications often of late - for research, mainly - and am thankful that I can. Though far from complete, at least some of the reference materials that booksellers once had to darken doors of libraries to access (or spend big money to acquire) are now freely available in various digitized formats at the Hathi Trust Digital Library, the Internet Archive, etc., not to mention numerous institutions. Better yet, unlike print content, these publications are readily searchable.

Or are they?

In recent weeks, as I've worked on an upcoming BookThink report, I've spent several hours daily digging through some of these resources, and, though I'd had issues with digitized content in the past, there's nothing like living every day with something that brings things into bold relief that previously were only marginally so. Of these, one of the more frustrating things I came up against was that a significant portion of the digitized texts I was working with were ... junk. Resolution was poor, in some cases text was unintelligible, even to the human eye - and good luck searching them. In some cases I just wasn't able to.

Earlier this week, while attempting to search several volumes of the US Copyright Office's Catalog of Copyright Entries, things ground to a halt. For those of you who don't know, this is an important resource for identifying official publication dates of, among other things, books, also for compiling lists of publisher's First Editions. I had been using it for both, the focus being on a specific publisher. Starting with the 1899 edition, I had been progressing steadily through the decades until, without warning, I hit a major snag. When I attempted to search two specific years in the 1940's, nothing. No results whatsoever. When I paged down through the text, it was obvious why. Many of the scanned pages were borderline illegible. Seemingly, I had the choice of moving on and leaving a gaping hole in my report or slogging through hundreds and hundreds of pages, manually hunting for what I needed, assuming I could even make out what it was - and who knows how many days that would've burned?

I attempted several workarounds. Both files were available for download in PDF format, and when I opened them in Adobe Acrobat Pro, I tried two things - optimization, which is supposed to improve clarity, then the OCR option to convert them to (hopefully) searchable text. The optimization/conversion took place soon enough, and text appeared - but again, searches returned absolutely nothing, even searches for parts of words. Then, as a last resort, I tried saving the files in DOC format, but when I opened them in Word, almost all the pages were blank.

Fortunately, my wife had a suggestion - an application on her laptop called PDF Converter Enterprise Pro (latest version 8.2 at the time of this writing). Developed by Nuance (also makers of Dragon Naturally Speaking and Omnipage Pro), this is a feature-rich tool that, for one thing, enables you to back-convert PDF files into other formats - and recognize recognizable text in the process. It took several hours to do its thing on my files, but the result was a DOC version of each that was to my great relief searchable.

Not everything was perfect, but there wasn't anything that I couldn't work with, even things that, at first glance, looked less than viable. Here is an especially egregious example an origiinal PDF entry - a William Harris Hardy book titled No Compromise With Principle:

Note that, while pretty much a blur, everything can be puzzled out; it's just not searchable. For comparison, here is the converted, searchable DOC entry:

Yes, a lot of this is more or less garbage - and some funny garbage at that - but note here that Hardy's name remains intact, and it was the work of a moment to search for it, then make a parallel comparison to the PDF entry to make sense of everything. And - I should stress that, in most cases, the DOC versions were clear enough on their own that it wasn't necessary to make the PDF comparison. Here is a more representative DOC example:

So, at a hundred bucks or so, the application isn't cheap, but it turned out to be a massive time saver for me, and it effectively brings what otherwise would be many unsearchable digitized texts back into play.

          < to previous article       to previous feature article >

Questions or comments?
Contact the editor, Craig Stark


Comment Comment Comment Comment Comment Comment Comment Comment Comment