Archive forOCR

Final IMPACT Conference on OCR, October 2011, The Hague

The final conference of the IMPACT project will take place on 24-25 October 2011 at the British Library in London, with the title:

“Digitisation & OCR: Better, faster, cheaper. Solutions of the IMPACT Centre of Competence and future challenges”

The IMPACT Project (Improving Access to Text) started on 1st January 2008 with the aim to significantly improve the accessibility of historical printed text. It planned to do this by pushing innovation in optical character recognition (OCR) and language technology for historical document processing and retrieval. IMPACT also aimed to remove the barriers that stand in the way of the mass digitisation of the European cultural heritage by sharing expertise and building capacity in digitisation across Europe.

In 2011 the IMPACT Centre of Competence will be launched, with the aim to make digitisation of historical printed text in Europe better, faster, cheaper, and provide tools, services and facilities for further advancement of the state of the art in this field.

Please join us for this two day conference to find out more about the project outputs and how you can use them in your own digitisation initiatives. Further details regarding this conference programme hosted by the British Library will become available later in the year. Registration is now possible at the following rates:

* Early bird rate of £100 GBP (max. 100 tickets), available until 1 July 2011
* Normal rate of £120 GBP, available from 2 July – 21 October

For bookings, please visit: http://purchase.tickets.com/buy/TicketPurchase?agency=BRITISHLIB&organ_val=25385 and click October.

It is also still possible to give us input on the topics and speakers you would like to see at this conference through the IMPACT LinkedIn discussion: Final conference 2011 at http://tinyurl.com/5sjk789.

This information, along with a brand new KB video on IMPACT, is also available from the IMPACT website: http://www.impact-project.eu/home/

Comments

Challenging our understanding of Digitisation

dev8d1.jpg

At the forthcoming Developer Happiness Days one of the sessions planned to take place will be exploring a DIY digitisation workflow:

Taking you from the act of scanning images and objects, learning how to process and edit them with software like ocrupus, blender and OpenCV, storing and manipulating them online and finally, through to printing their digital forms out, mashed together with comments, citations, automatic qr codes and even other digital objects!

While this session is not intended to showcase the same results one would expect to find on large scale institutional and heritage digitisation projects, the session might just force a consideration of digitisation practices and trigger off some interesting questions and dialogue.

So, if this confrontation with digitisation sounds interesting then there is an opportunity for attendance at this session by project members from JISC digitisation and eContent projects.

Spaces will be limited, so please contact me directly if you wish to register your interest: b.showers@jisc.ac.uk.

And to find out a little more about this session you can read Ben O’Steen’s blog and his ideas for the “The Secret Life of the Book” session at the event.

And further information about the Dev8d programme is available on the Developer Happiness website

#dev8d

Comments

OCR for the mass digitisation of textual materials

A workshop was held at the University of Bath on 24th September 2009, looking at some of the current issues in using Optical Character Recognition for digitisation, organised in the context of the EU Impact project.

Videos, slideshows, notes and questions from the day are now all available from the workshop webpages

Comments (1)

Workshop: Optical Character Recognition (OCR) for the mass digitisation of textual materials: Improving Access to Text

24 September 2009 – UKOLN, University of Bath
http://www.ukoln.ac.uk/events/ocr-2009/

newspaper image

FREE one-day workshop for

* Collection holders in HE and Cultural Heritage organisations
* Users of digitised content for teaching, learning and research

This workshop is funded by the Joint Information Systems Committee (JISC) as
part of a series of workshops & seminars on Achievements & Challenges in
Digitisation & e-Content
.

The workshop will provide an opportunity for participants to learn about the
current state-of-the-art in the digitisation of historical texts
, to look at
improvements in digitisation techniques currently being explored in research
projects such as the EU-funded IMPACT project, and to explore how Optical
Character Recognition (OCR) is used in practical digitisation
contexts and
workflows.

There will also be an opportunity for participants to investigate
the opportunities and challenges of the scholarly use of what is an
ever-increasing range of digitised content, supporting new interdisciplinary
ways of exploring cultural and social history, philology, and the history of
ideas.

Comments

European Conference on OCR and Mass Digitisation

From the IMPACT project, a European Union project which is aiming to create a centre of excellent for the digitisation of textual cultural heritage

Introduction

On 6 and 7 April 2009 the IMPACT project will organise a conference on OCR in mass digitisation projects. This conference will focus on exchanging views with other researchers and suppliers in the OCR field, as well as presenting some preliminary results from the first year of the IMPACT project.

Tentative programme

Monday 6 April 2009: New advances in OCR technology, such as collaborative correction and adaptive OCR techniques, a possible way forward for future large-scale digitisation programmes.

Tuesday 7 April 2009: Current and future challenges facing OCR technology, such as image enhancement and linguistic issues that come up when digitising historical text material.

Both days will feature key speakers from outside of the project, in addition to experts from the IMPACT consortium (to be announced in the near future).

Each day’s programme will last from 10.00 – 18.00, with a conference dinner on the first day.

Practical information

The venue will be the Koninklijke Bibliotheek (KB – National Library of the Netherlands) in The Hague. There is a maximum of 150 participants. Registration is now possible at an early bird fee of € 95. After 1 January 2009, the regular fee will be € 110. This fee includes coffee breaks, lunches and a conference dinner on Monday 6 April.

Comments (2)

The challenges of “useful” OCR

The National Archive’s digitisation project, British Governance in the 20th century – Cabinet Papers, 1914-1975, has been grappling with issues of “useful” OCR. It might be stating the obvious, but accurate OCR is as useful as the search results it produces.

War Cabinet paper

If OCRd text consistently misspells particularly relevant key words for retrieving certain documents, than the search results against these key words will not always bring up appropriate documents, and will lack in accuracy.

For the National Archives, it was not enough to establish a range of acceptable OCR performance levels purely from a quantitative point of view, eg OCR performance accuracy should not be below 88%. This is because if the remaining 12% of text that is not accurate includes particularly relevant key words for retrieving a certain document that users are likely to search by, the discovery of that document is impeded or made less likely. Eg, if the word “submarine” is particularly relevant to the subject of a document, and it’s consistently misspelt by the OCR software, the likelihood of discovering that document is less than if another, less relevant, word, had been misspelled. So, even matching an established minimum percentage of performance (eg 88%), does not necessarily mean that search results will be accurate or useful.

The National Archives are also adopting a more qualitative approach to run alongside the quantitative one described above. They are concentrating on identifying the most relevant and frequently misspelt “key” words across all of the OCRd documents. They are then planning to run a global “search and replace” to reinstate the correctly spelt words.

Although this will have marginal effect on the overall accuracy ratings, this will increase the usefulness of OCR to the end user.

Comments (2)