Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Categories
Uncategorized

Do you have experience in OCRing tabular data?

If you have experience in dealing with OCR and tabular data, one of the current JISC-funded mass digitisation projects, the Medical Officer of Health reports, led by the Wellcome Library, would like to hear from you.

Christy Henshaw, from the Wellcome Digital Library:

For our Medical Officer of Health project, we will be digitising health reports that contain a lot of information in tables (as well as charts and graphs). We plan to OCR the reports for full-text indexing, but realise that OCR’ing tabular data isn’t going to be easy, and that double- or triple-rekeying may be necessary.

I would love to hear from anyone who has had any experience with OCR’ing or rekeying tabular data (tables with both text and numbers, including merged cells both horizontally and vertically, text printed on a vertical plane, etc.).

Not only do we plan to get the tabular data into a state that can be searched (the text elements, at least), but to provide the data as CSV or Excel for downloading (as well as visible on the page images themselves). If anyone has ever provided such data from digitised content before, I’d be really interested to hear about your experiences on that too.

Many thanks!

You can post comments to this blog or contact Christy, c.henshaw AT wellcome.ac.uk, and we will then summarise them in a new post.

4 replies on “Do you have experience in OCRing tabular data?”

Dear Christy

I personally (since 2000) along with my team have excellent experience and expertise in the field of OCRing or any kind of tables in MS word or Excel format.

I live in Nottingham, UK. You can contact me on my email address for detail discussion.

Thank

Dear Christy

Statisitcs New Zealand has recently completed a project to digitise our entire run of New Zealand Official Yearbooks from 1893-2008, as well as early census data from 1860-1916. I have written a conference paper on the subject of digitising tabular data. The paper will be available from this url http://www.vala.org.au/vala2012-proceedings/vala2012-session-11-stent

I am happy to discuss our issues with digitising tabular data and our solution with you.

Claire

Hi Christy
Not OCR, but we did a big project digitising historical Australian census reports, including many complex tables. We had them keyed in and marked up on the fly in India, using DocBook xml schema. See above website.
Table recognition is a challenge. The markup captures the syntax or structure of the tables, but only partly captures the semantics (rows are easy but columns are not). This needs something extra like a DDI markup or generalised table language.
Len Smith

Leave a Reply

Your email address will not be published. Required fields are marked *