Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Categories
Digitisation conference 2007

Conference 2007: Workshop: Overcoming OCR challenges

digibook.jpgDigital capture and conversion of text – overcoming the Optical Character Recognition (OCR) challenges

Paul Ell (moderator), Director, Centre for Data Digitisation and Analysis, Queen’s University Belfast
Aly Conteh, Head of Digitisation, British Library
Julian Ball, Project Manager, University of Southampton

Martin Locock, National Library of Wales
Aly Conteh

The British Library perspective – have digitised 3m pages of newspapers from 17, 18th and 19th centuries and just about to digitise our fourth century – also digitising 25m pages of 19th century pages in conjunction with Microsoft – doing about 1m pages a month

Challenges we’ve seen:

OCR tech tends to be tuned for modern printed materials. There have been some initiatives working with the fracta fonts but that just deals with only a small part of the problem. There does not seem to be a comprehensive set of tools that allows us to work with these historic materials.

We understand the issues very well but how do we stimulate research into the problems and resultant tool development? We’ve thought about doing a kind of OCR Challenge and it does work in our space. Also thought about putting in a proposal under EU funded FP7 to help us deal with this and have put in a proposal with other organisations and we hope that in next few days we will hear that we have been successful. Another problem is the ‘character word accuracy’ issue – what is the level of accuracy that we want? 99.9% not possible without manual intervention. Need to start the unit of currency as word and if we’re only getting 50% then we’re losing a great deal. National Library of the Netherlands with their newspaper digitisation were seeing 70% character accuracy so may be down at the 50% level if look at word accuracy

There is a lack of methodology and benchmarks to measure the effectiveness of OCR. To measure it we have to physically count it.

Julian Ball

Just finished digitisation of 18th century parliamentary texts. Showed example of the DigiBook robot which was used for large folio volumes. Capture of material included diagrams and maps. Applying OCR gives the place names and allows searching. Found 1200 colour images tucked away in the resources. Can also make images available for the digitally impaired, highlighting boundaries like Braille.

Used Abbey 8 and the old version which copes with different fonts. Introduced OCR and built it into into workflow so can point it at directories and it chugs through by itself. When start using OCR need a lexicon to start translating some of the material. also used OCR and mark up within the text to automatically produce a table of contents.

Martin Locock

  • Welsh orthography
  • Digraphs and diacritics
  • Scanning for OCR
  • Books from the ast
  • Welsh Journals Online

Welsh orthography changes over time. Accents are not incidental characters but are quite common and if we don’t read them correctly we will devalue the resource.

How we capture images: over 300dpi is necessary and use greyscale and not bi-tonal otherwise the text blobs together.

How run it through the OCR: Books from the Past project deliberately done as a pilot to look at the issues of digitising text from a range of periods. Intention to provide scanned image and clean TEI text to accompany scans. Required OCR contractor to identify diacritics but then when cleaned by hand it was discovered that most were wrong. To measure accuracy need to compare it against a Welsh dictionary.

Is it essential? question of can we just ignore it – at the time of Books from the Past we decided no, we couldn’t just ignore it. Accents change the meaning so no cannot but in the future could perhaps handle it in different ways. There are strict rules about alternatives so may be able enhance search capabilities without harming the dataset. Stop words – need to think about Welsh stop words (equivalent of English ‘the’).

Welsh Journals Online – different project with different approach. Dealing with 20th century texts and assumption is that most people will be looking at the scanned page. About 40% of the content is Welsh so need to get it right. May be able to differentiate searching behaviour according to Welsh or English interface. Can apply special rules to Welsh language content. Have to work out how to deal with accents. May decide to normalise to unaccented form and work round it. Or could have a go at identifying them where possible although will be doing no manual editing to TEI so would have ton accept mistakes but it’s more a search issue than anything else as will be presenting the scanned form. One solution may be to have silent fuzzy searching (lookup tables) or ‘did you mean…”

Discussion

Paul Ell: we can set the software so that operater has to key in uncommon characters

JB: we just use plain ascii text. it’s all done automatically

PE: a lot of what we’ve done is based on statistical data and things really do have to be right. people quickly find the errors and it really destroys their faith in the material. so we have to do lots of testing

AC: do we present the OCR? on a complex thing like a newspaper it’s just a block of text but I think it’s a good thing to do for those who understand it but what do you do with all the people who don’t understand what OCR is and why it looks so terrible.

PE: I like the get-out clause of saying that this is the trade-off of digitising x million pages…

JB: we’re 98% character correct

PE: so there are quite a few word errors in that…

JB: my feeling is that it’s early days with OCR and admit that and be aware of those problems

PE: 98.99% accuracy asked for by Histpop and they kept something back. they did do some rekeying.

Leave a Reply

Your email address will not be published. Required fields are marked *