Categories
Uncategorized

Do you have experience in OCRing tabular data?

If you have experience in dealing with OCR and tabular data, one of the current JISC-funded mass digitisation projects, the Medical Officer of Health reports, led by the Wellcome Library, would like to hear from you.

Christy Henshaw, from the Wellcome Digital Library:

For our Medical Officer of Health project, we will be digitising health reports that contain a lot of information in tables (as well as charts and graphs). We plan to OCR the reports for full-text indexing, but realise that OCR’ing tabular data isn’t going to be easy, and that double- or triple-rekeying may be necessary.

I would love to hear from anyone who has had any experience with OCR’ing or rekeying tabular data (tables with both text and numbers, including merged cells both horizontally and vertically, text printed on a vertical plane, etc.).

Not only do we plan to get the tabular data into a state that can be searched (the text elements, at least), but to provide the data as CSV or Excel for downloading (as well as visible on the page images themselves). If anyone has ever provided such data from digitised content before, I’d be really interested to hear about your experiences on that too.

Many thanks!

You can post comments to this blog or contact Christy, c.henshaw AT wellcome.ac.uk, and we will then summarise them in a new post.

4 replies on “Do you have experience in OCRing tabular data?”

Dear Christy

I personally (since 2000) along with my team have excellent experience and expertise in the field of OCRing or any kind of tables in MS word or Excel format.

I live in Nottingham, UK. You can contact me on my email address for detail discussion.

Thank

Dear Christy

Statisitcs New Zealand has recently completed a project to digitise our entire run of New Zealand Official Yearbooks from 1893-2008, as well as early census data from 1860-1916. I have written a conference paper on the subject of digitising tabular data. The paper will be available from this url http://www.vala.org.au/vala2012-proceedings/vala2012-session-11-stent

I am happy to discuss our issues with digitising tabular data and our solution with you.

Claire

Hi Christy
Not OCR, but we did a big project digitising historical Australian census reports, including many complex tables. We had them keyed in and marked up on the fly in India, using DocBook xml schema. See above website.
Table recognition is a challenge. The markup captures the syntax or structure of the tables, but only partly captures the semantics (rows are easy but columns are not). This needs something extra like a DDI markup or generalised table language.
Len Smith

Leave a Reply

Your email address will not be published. Required fields are marked *