Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Categories
Data capture Jisc digitisation programmes OCR Projects 2006-2009

The challenges of “useful” OCR

The National Archive’s digitisation project, British Governance in the 20th century – Cabinet Papers, 1914-1975, has been grappling with issues of “useful” OCR. It might be stating the obvious, but accurate OCR is as useful as the search results it produces.

War Cabinet paper

If OCRd text consistently misspells particularly relevant key words for retrieving certain documents, than the search results against these key words will not always bring up appropriate documents, and will lack in accuracy.

For the National Archives, it was not enough to establish a range of acceptable OCR performance levels purely from a quantitative point of view, eg OCR performance accuracy should not be below 88%. This is because if the remaining 12% of text that is not accurate includes particularly relevant key words for retrieving a certain document that users are likely to search by, the discovery of that document is impeded or made less likely. Eg, if the word “submarine” is particularly relevant to the subject of a document, and it’s consistently misspelt by the OCR software, the likelihood of discovering that document is less than if another, less relevant, word, had been misspelled. So, even matching an established minimum percentage of performance (eg 88%), does not necessarily mean that search results will be accurate or useful.

The National Archives are also adopting a more qualitative approach to run alongside the quantitative one described above. They are concentrating on identifying the most relevant and frequently misspelt “key” words across all of the OCRd documents. They are then planning to run a global “search and replace” to reinstate the correctly spelt words.

Although this will have marginal effect on the overall accuracy ratings, this will increase the usefulness of OCR to the end user.

2 replies on “The challenges of “useful” OCR”

Thanks Steve – we are using some proprietary software on the originals. I’ll take a look at these though, sounds interesting – thanks for your help.

Leave a Reply

Your email address will not be published. Required fields are marked *