The challenges of “useful” OCR

The National Archive’s digitisation project, British Governance in the 20th century – Cabinet Papers, 1914-1975, has been grappling with issues of “useful” OCR. It might be stating the obvious, but accurate OCR is as useful as the search results it produces.

War Cabinet paper

If OCRd text consistently misspells particularly relevant key words for retrieving certain documents, than the search results against these key words will not always bring up appropriate documents, and will lack in accuracy.

For the National Archives, it was not enough to establish a range of acceptable OCR performance levels purely from a quantitative point of view, eg OCR performance accuracy should not be below 88%. This is because if the remaining 12% of text that is not accurate includes particularly relevant key words for retrieving a certain document that users are likely to search by, the discovery of that document is impeded or made less likely. Eg, if the word “submarine” is particularly relevant to the subject of a document, and it’s consistently misspelt by the OCR software, the likelihood of discovering that document is less than if another, less relevant, word, had been misspelled. So, even matching an established minimum percentage of performance (eg 88%), does not necessarily mean that search results will be accurate or useful.

The National Archives are also adopting a more qualitative approach to run alongside the quantitative one described above. They are concentrating on identifying the most relevant and frequently misspelt “key” words across all of the OCRd documents. They are then planning to run a global “search and replace” to reinstate the correctly spelt words.

Although this will have marginal effect on the overall accuracy ratings, this will increase the usefulness of OCR to the end user.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Reports from JISC Metadata Consultancy

As part of its Digitisation Programme, the JISC appointed consultant Hervé L’Hours to assist the 16 projects in defining their metadata requirements. In particular, Hervé looked at issues relating to technical / preservation metadata and how these were being built into project workflows.

The bulk of this work took place between June and November 2007.

Hervé’s work involved a number of strands, from which various materials are being made publicly available

Hervé was also asked to provide assistance to each of the projects, reviewing their metadata policies and offering guidance where required. For some projects with more experience or reasonably straightforward workflows this was a ‘light touch’; for others, especially those importing complex legacy metadata, more involved help was given.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Digitise a book in 15 minutes!

JISC recently met with representatives of QIDENUS TECHNOLOGIES, who are prototyping new robotic book scanning technologies.

QiScan RBSpro scanner

QiScan RBSpro is a fully automated robotic scanner that uses a robotic rubber “finger”, and no suction technologies, to turn the pages of a book. The “finger” senses the type of paper and the machine sets the right angle for handling the paper. The scanner has been successfully tested with 15th and 16th century books.

Key advantages of this new scanner, QUIDENUS say, are more efficiency in the workflow and lower labour costs, as one operator can work on up to five machines at the same time. Capture and post-processing activities, such as OCR, are very speedy and the scanner is said to produce a digitised and searchable book in 15 minutes!

To see for yourself, you can attend the event at the Bayerische Staatsbibliothek München on 18-20 June 2008 , where QUIDENUS will be demonstrating their new products next to their competitors.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Librarians on the way out?

The JISC and BL-commissioned Google Generation report highlights a number of key points that will have an effect on current and future digitisation projects.

It’s worth reading.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

You Spin Me Round – Record Players Exhibition

As well as digitising several thousand sound files, the British Library Archival Sound Recordings project has made multiple digital images of record and music players from its artefacts collections.

Screenshot of Bing ‘Pigmyphone’ toy gramophone, 1920s from Archival Sound Recordings website

This includes gramophones from the 1890s right up to Sony cassette decks from the 1970s.

The players have been photographed from multiple angles, allowing for the objects themselves to be rotated by the user.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

‘Read all about it’

19th Century NewspapersThe JISC-funded 19th Century Newspapers digitisation project was highlighted in today’s Guardian as part of a growing number of online newspaper archives which constitute an invaluable resource for historians and researchers.

Stephen Hoare commented:

“The digitisation of the British Library’s 19th-century newspaper collection – the most comprehensive archive ever to go online – was launched in November 2007 after three years of preparation and scanning. The archive covers billions of words and its two million computer-readable pages are a historian’s treasure trove. It represents 48 titles such as the Morning Chronicle, the Graphic, the Examiner and a cluster of Chartist publications.”

Read the full article on The Guardian web site.

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

Developing International Collaboration for Digitisation: the JISC – National Endowment for Humanities perspective

Developing International Collaboration for Digitisation: the JISC – National Endowment for Humanities (NEH) perspective

Hosted by King’s College London. Monday 21st January, 5.30pm – 6.45pm (Room 2B08, Strand Campus)

With presentations and commentary from

Chaired by Sarah Porter, Head of Development, JISC

In celebration of their transatlantic digitisation collaboration grants, JISC (Joint Information Systems Committee) and the NEH (National Endowment for Humanities) are hosting an evening panel session looking at issues related to international digitisation. The evening will draw on the experiences of projects in the area and will also involve discussion to inform future directions.

The event is open to all. The evening will be followed by a wine reception for all attendees.

JISC and the NEH are grateful to the Centre for Computing in the Humanities at King’s College London for hosting the event.

Location: http://www.kcl.ac.uk/about/campuses/strand-det.html

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS