Reports from the Live Lab - Jisc Content and Digitisation

Jisc recently joined forces with the Centre for the History of Science, Technology and Medicine to run one of our Live Labs (see this previous post about our Cardiff lab) at the University of Manchester on 11 May. Some 20 academics, students and convenors assembled to explore the UK Medical Heritage Library (UKMHL) which sits on Jisc’s Historical Texts website. We were keen to look at a variety of ways of working with text contained in such resources.

Participants exploring UK Medical Heritage Library

After a brief introduction from the Centre’s Professor Pratik Chakrabarti, I set out what we would explore on the day. Most importantly, I suggested that the conception of the lab was for it to act as a feedback loop, rather than being a workshop in the classic sense. We were looking for feedback on the resources we had built. Equally, we were concerned to understand how participants might want to make use of these resources in the future and also what kind of support they might need to make best use of them. We were also there to offer opportunities to explore content as text, image and relationships to other content.

Owen Stephens, who would lead the day, then took people through his approach, saying that he wanted to look at existing interfaces but also to explore less visible interfaces such as the Historical Texts API. Owen had been involved in the development of both Historical Texts and the UKMHL so he is very well-placed to help people explore these resources. We also had some highly knowledgeable members of the Historical Texts team on hand which made the day even more informative. People seemed very impressed by the resources, even though some of them were coming to them for the first time.

After a short comfort break, Pratik shared his views on how the visualisation tools we have built on UKMHL can inform explorations of the History of Medicine. Pratik was a member of the advisory group for the UKMHL project and had made significant contributions to the development of the tools. They work in combination with the Historical Text’s Elastic Search functions to provide readers with new insights into content which may remain hidden if they used the classic search functions on their own. Pratik suggested that they can provide good ways into what is contained in the archive but they also provide biased views, as they rely on various categorisations of the content and the determinants for these are somewhat arbitrary.

One attendee made the point that resources like UKMHL can lead readers to think that the dataset is comprehensive or exhaustive; that they contain everything, when in fact they are a subset. These tools only offer a starting point and their limitations should be understood by those using them [paraphrased]. There is certainly a danger of our being overly technological deterministic in the provision of these resources. I think the critical thing to remember is the relationships between things. Martin Poulter, in a previous lab, was able to demonstrate how Wikimedia tools can help to link various kinds of knowledge relating to entities but making relationships apparent is a great challenge for electronic knowledge systems and is reliant on people working with tools.

After lunch, Dr Elizabeth Toon, a Lecturer at the Centre, presented on Text Mining the History of Medicine, a project which used text data from the British Medical Journal under license from Medline and from the Wellcome’s (Jisc funded) London’s Pulse: Medical Officer of Health reports 1848-1972. The project was run in conjunction with The National Centre for Text Mining (NaCTem)

She spoke about her own learning experiences using text mining tools, explaining how it was to start from a position of initially not having much knowledge about the tools and techniques. She showed some texts that had been subjected to Natural Language Processing (NLP) to allow searching in new ways.

Elizabeth made a critical point about the limits of text mining techniques. She said that it was the academics who were the arbiters of how the systems works and that it requires lots of human input to ‘train’ they system to recognise entities. In academia it works well to support complex searching trained on a particular corpus but one has to know what the parameters are for it to be effective. [paraphrased]

Elizabeth highlighted one potential pitfall: when you are doing this work your searches may not be searching in the way that you think. You must never take the data for granted and should read all the documentation before setting out to use the techniques to support your research.

Elizabeth thinks that Google Ngrams is a good tool but the dataset (Google books) is not consistent and is only a subset of the total of available texts so again care needs to be taken when interrogating these resources.

Dr Toon’s final comments were that Text Mining and Topic Modelling allowed researchers to encounter existing resources in new ways and can support new research questions and reinvigorate old ones. Throughout the rest of the afternoon her experiences were most informative to those who were encountering books as data for the first time. This is what make labs so interesting; the interactions between people, some looking at content for the very first time, and others who have lots of experience of working with text as data bringing their knowledge to bear.

Owen then took us through Open Refine and he also explored other tools (see the previous labs post for links). People were enjoying loading texts into the tools to see what they could discover. Data looks different when you open it up in Voyant tools and some found that there was data missing which they had expected but this in itself was interesting.

A participant thought it would be a good idea to allow in-viewer user corrections of the Optical Character Recognition (OCR), which appears in one of the slide-out windows. He also thought that the Elastic Search was providing for a pretty sophisticated search. We have been improving this over time and are looking to make more improvements, so feedback is very much appreciated.

One person was excited by new things being revealed when the texts were displayed in Voyant tools and this shows that people can suddenly see new possibilities when the data is presented in new ways making it vital to provide multiple entry-points. This was something that also came up in a report we commissioned in partnership with ProQuest on our separate provisions of the Early English Books Online (EEBO), one of the collections on Historical Texts. People use interfaces for a diversity of purposes and interfaces can provide particular views upon a corpus. This is their nature. They reflect the limitations of decision making processes and of the underlying technologies, but these renderings also open up unexpected vistas. This has been an abiding theme of our labs.

Another attendee suggested that the UKMHL content was not very diverse in terms of spread of UK locations, and that it seemed be focussed on particular metropolitan areas. There was some discussion as to why this might be. Was it a condition of the project, a selection decision, the nature of the content? So the discussion started to focus on digitisation and the kind of decisions which are made during content selection. This is an old theme of Jisc’s digitisation programmes. Collections accrue in libraries over time. They come from diverse sources and are collected for all kinds of reasons. When library collections are selected for digitisation, and librarians start to apply decisions about what can be digitised and what should be prioritised, curatorial decisions have an impact on what we have to explore when encountering resources. UKMHL is made up of 10 collections drawn from university libraries and professional associations, so there are biases all along the way.

A participant explained that she is researching medical archaeology. She said the lab had provide her with a deeper understanding of why we had decided certain things in regard to design after hearing Owen’s presentation. She liked the possibility of the introduction of crowdsourcing tools (a possibility we are exploring) which could lead to improved image metadata which would improve the accuracy of the image wall.

Finally, there were some observations about how using visualisation tools is quite different to using text tools eg voyant, and that it is important to select the right tools for the job and to have clear research questions. The same thing lies at the root of all academic work; one needs to decide upon a research question before selecting the right kind of tools. It doesn’t really matter if the tools are electronic or traditional (card catalogues and archive boxes). Looking at large aggregations of content as data does offer new approaches to questions and providers of interfaces need to provide sufficient aid to ensure the researcher can pick effectively from a range of options from the microscopic (close reading) to the macroscopic (distant reading). The transition to electronic research tools in the Humanities has been slow but more scholars are starting to see the benefits of mixed methods both in discovery and in process.

An abiding memory of this lab were some closing comments made by one participant who said that she wanted to be able to cross reference a resource like UKMHL with other sources of texts eg Google Books, Hathi Trust Digital Library, Internet Archive and others. This is a matter of some complexity, because Google search does not do that and, to my knowledge, nothing else does either; something to think about!

This post draws on extensive notes taken at the event by Paul Flieshman, Historical Texts and Journal Archives Support Officer.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

By Peter Findlay

Leave a Reply