Feedback on OCRing tabular data
Last week, we asked for feedback on your experience of OCRing tabular data.
Christy Henshaw, working on the JISC funded digitisation of the Medical Officers of Health reports, has summarised the responses received so far:
I recently posted a request to digital library mailing lists, asking the community to share their experiences and knowledge about encoding historical tabular data. It was also posted on this blog. Since then, I have received many useful emails, documents and links from others with experience in encoding tabular data from historic documents. A big thank you to everyone who got in touch. Here is a summary of what I’ve learned from these responses.
The original records. You need to know your content in order to make sense of it in a digital format as layout is all-important. Tables come in many configurations, and their layout must be assessed before digitisaton to make sure the data are transferred correctly.
For example, you may need to split single columns with two sets of data in each cell into two columns, or include dashes that were printed in the original to indicate missing information. If you know your tables, you can identify natural checksums that allow you to test the data add up correctly, or make sense when loaded into statistical software such as SPSS. If possible, it is useful to make a note of printer errors (although I imagine that will come to light during QA – when something looks incorrect in the digitised version, but turns out to have been wrong in the original). I imagine you could at least try get a sense of how good the editing was and whether to anticipate errors.
OCR v. rekeying. Optical character recognition (OCR) which programmatically decodes text from images rarely – if ever – works for tabular data. My impression is that OCR engines may accurately pick up the words and numbers in the table, but are not configured to reproduce the layouts. As already stated, layout is key! Therefore, rekeying is almost universally done. Some have OCRed first, and then corrected the tables by hand, but in most cases the advice is to not waste time OCRing first. In our case, if we decide to rekey every table in the reports (it can range from 0 to 150 tables in any single report), we probably might as well rekey the whole report and dispense with OCR completely.
Output formats. During rekeying the text can be marked up in XML or HTML, flexible data formats that can be made available as is and/or converted to other formats. Tables in these documents can be marked up, and I assume extracted for reuse as raw data.
For a good example of what historical tables can look like in HTML see Statistics New Zealand’s digitised year books. HTML table mark-up isn’t complicated; see www.w3.org/wiki/HTML_tables (as long as you set the formatting rules for translating a printed table into an electronic one).
Searching the data. Searching within tables is useful if the tables are very large. Users can then drill down to the specific areas of the table they are interested in. I can see this could be useful for our reports, where we have tables showing instances of notifiable diseases across different sectors of the population, for example. See www.tandf.co.uk/journals/titles/01615440.asp for a paper on metadata for a statistical database (we don’t subscribe to it so I haven’t been able to read it).
We could look into constructing queries based on the full-text data in our Library Catalogue that somehow merges the word search with a structural search. This search would of course result in a list of catalogue records in the normal way and you would have to delve into each individual record to get to the data. A dedicated database would allow much greater access, and is something to consider.
There may be other ideas or opinions out there, or I may have misunderstood something – please feel free to comment on this blog post!
Thank you, Christy.