Digital capture and conversion of text – overcoming the Optical Character Recognition (OCR) challenges
Paul Ell (moderator), Director, Centre for Data Digitisation and Analysis, Queen’s University Belfast
Aly Conteh, Head of Digitisation, British Library
Julian Ball, Project Manager, University of Southampton
Martin Locock, National Library of Wales
Aly Conteh
The British Library perspective – have digitised 3m pages of newspapers from 17, 18th and 19th centuries and just about to digitise our fourth century – also digitising 25m pages of 19th century pages in conjunction with Microsoft – doing about 1m pages a month
Challenges we’ve seen:
OCR tech tends to be tuned for modern printed materials. There have been some initiatives working with the fracta fonts but that just deals with only a small part of the problem. There does not seem to be a comprehensive set of tools that allows us to work with these historic materials.
We understand the issues very well but how do we stimulate research into the problems and resultant tool development? We’ve thought about doing a kind of OCR Challenge and it does work in our space. Also thought about putting in a proposal under EU funded FP7 to help us deal with this and have put in a proposal with other organisations and we hope that in next few days we will hear that we have been successful. Another problem is the ‘character word accuracy’ issue – what is the level of accuracy that we want? 99.9% not possible without manual intervention. Need to start the unit of currency as word and if we’re only getting 50% then we’re losing a great deal. National Library of the Netherlands with their newspaper digitisation were seeing 70% character accuracy so may be down at the 50% level if look at word accuracy
There is a lack of methodology and benchmarks to measure the effectiveness of OCR. To measure it we have to physically count it.
Julian Ball
Just finished digitisation of 18th century parliamentary texts. Showed example of the DigiBook robot which was used for large folio volumes. Capture of material included diagrams and maps. Applying OCR gives the place names and allows searching. Found 1200 colour images tucked away in the resources. Can also make images available for the digitally impaired, highlighting boundaries like Braille.
Used Abbey 8 and the old version which copes with different fonts. Introduced OCR and built it into into workflow so can point it at directories and it chugs through by itself. When start using OCR need a lexicon to start translating some of the material. also used OCR and mark up within the text to automatically produce a table of contents.
Martin Locock
- Welsh orthography
- Digraphs and diacritics
- Scanning for OCR
- Books from the ast
- Welsh Journals Online
Welsh orthography changes over time. Accents are not incidental characters but are quite common and if we don’t read them correctly we will devalue the resource.
How we capture images: over 300dpi is necessary and use greyscale and not bi-tonal otherwise the text blobs together.
How run it through the OCR: Books from the Past project deliberately done as a pilot to look at the issues of digitising text from a range of periods. Intention to provide scanned image and clean TEI text to accompany scans. Required OCR contractor to identify diacritics but then when cleaned by hand it was discovered that most were wrong. To measure accuracy need to compare it against a Welsh dictionary.
Is it essential? question of can we just ignore it – at the time of Books from the Past we decided no, we couldn’t just ignore it. Accents change the meaning so no cannot but in the future could perhaps handle it in different ways. There are strict rules about alternatives so may be able enhance search capabilities without harming the dataset. Stop words – need to think about Welsh stop words (equivalent of English ‘the’).
Welsh Journals Online – different project with different approach. Dealing with 20th century texts and assumption is that most people will be looking at the scanned page. About 40% of the content is Welsh so need to get it right. May be able to differentiate searching behaviour according to Welsh or English interface. Can apply special rules to Welsh language content. Have to work out how to deal with accents. May decide to normalise to unaccented form and work round it. Or could have a go at identifying them where possible although will be doing no manual editing to TEI so would have ton accept mistakes but it’s more a search issue than anything else as will be presenting the scanned form. One solution may be to have silent fuzzy searching (lookup tables) or ‘did you mean…”
Discussion
Paul Ell: we can set the software so that operater has to key in uncommon characters
JB: we just use plain ascii text. it’s all done automatically
PE: a lot of what we’ve done is based on statistical data and things really do have to be right. people quickly find the errors and it really destroys their faith in the material. so we have to do lots of testing
AC: do we present the OCR? on a complex thing like a newspaper it’s just a block of text but I think it’s a good thing to do for those who understand it but what do you do with all the people who don’t understand what OCR is and why it looks so terrible.
PE: I like the get-out clause of saying that this is the trade-off of digitising x million pages…
JB: we’re 98% character correct
PE: so there are quite a few word errors in that…
JB: my feeling is that it’s early days with OCR and admit that and be aware of those problems
PE: 98.99% accuracy asked for by Histpop and they kept something back. they did do some rekeying.