Why are we only talking about ChatGPT?
This post is part of a series about approaches academic libraries in the UK might take in response to the emergence of AI. Their purpose is to signpost to relevant initiatives and to encourage debate about the way ahead. As ChatGPT is not the only Large Language Model, we look at one being exploited by the National Library of Sweden, and we also look at some means of helping library patrons make sense of data.
Bidirectional Encoder Representations from Transformers (BERT) and the library
The National Library of Sweden or the Kungliga Biblioteket (KB) has recently used KB’s collections to “train its own model for the Swedish language: KB-BERT1”. This is like the early model underlying ChatGPT a self-learning model known as a Transformer.
BERT uses transfer learning in a two-step process. First the model is exposed to a large data set. It then makes sense of the data itself through something called self-attention, like our own attention, it becomes aware of particular features. It makes predictions about the next word in a sequence by probability.
In the second stage trained data is introduced to the model to ensure a focus on a required task and to optimise its performance.
The KB decided it needed its own Swedish language model using its extensive holdings of Swedish textual material, including newspapers, to train the model. Ultimately the idea is that “the model could be applied to improving access to collections for researchers, by (i) providing an automated form of classification, (ii) enhancing the searchability, and (iii) improving the OCR cohesion of digital collections2”.
So, is this the way forward, for libraries to train their own models? It may well be part of the answer but, following the argument made in the preceding post, perhaps a simpler, less resource intensive, starting point would be to find mechanisms for releasing collections as data.
When we talk of Collections as Data, we do not just mean digitised material drawn from physical holdings. The data in question may be large aggregations of information about physical or digital holdings. Releasing such metadata could be a great starting point for researchers, enabling them to use their own tools to undertake their work on for example collection characteristics, collection uses or bibliographic and cataloguing history.
For such Collections as Data to be released, this metadata needs to be comprehensive and well formed. The resulting dataset would itself need to be described to ensure the person accessing the set has a full record of when, where and how the data was created and what it contains. Equally they need to know how it can be used and also how it has been used by others. There are already some initiatives underway to address some of these needs.
Datasheets for datasets
The Datasheets for Datasets initiative, from Microsoft Research, is about developing detailed packaging information for collections. Even a metadata collection would have a meta-metadata description in the form of a data sheet. For creators it is about, “the process of creating, distributing, and maintaining a dataset, including any underlying assumptions, potential risks or harms, and implications of use3”.
Data set consumers want to avoid the misuse of the data set and ensure they are getting to the right data for their particular purpose, “transparency on the part of dataset creators is necessary for dataset consumers to be sufficiently well informed that they can select appropriate datasets for their chosen tasks and avoid unintentional misuse4”.
“Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset’s lifecycle for responsible AI development”5. This technology is being developed by Google as one answer to the issue of transparency and traceability of data sources. The report provides case studies and examples of Data Card uses.
Libraries taking control
Providing data about the collections they hold has been historically central to the academic library’s mission. If libraries are to maintain their stake in ensuring the integrity of information sources, they need to create datasets as Arts and Humanities and Social Science scholars are increasingly demanding access to data.
Some libraries like the KB are taking control of AI by developing their own instances and the Data sheets for Datasets and Data Cards initiatives seek to address the challenges presented by the emergence of AI regarding transparency and accountability. Academic libraries in the UK can build on these endeavours and take ownership of their own data as a strategy to address the many layered issues which AI presents. In this way they will build new services for academic research.
If you are interested in these issues, you might like to tune into our podcast mini series on ‘Is AI for me? Perspectives from the humanities’ This week it is with James Baker of the University of Southampton. This mini-series is part of our Research Talks series.
 HAFFENDEN, Chris et al. Making and Using AI in the Library: Creating a BERT Model at the National Library of Sweden. College & Research Libraries, [S.l.], v. 84, n. 1, p. 30, jan. 2023. ISSN 2150-6701. Available at: < https://crl.acrl.org/index.php/crl/article/view/25748/33686>. Date accessed: 05 June 2023. doi: https://doi.org/10.5860/crl.84.1.30
 PUSHKARNA, MAHIMA et al, Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI (arXiv:1803.09010 [cs.DB] or arXiv:1803.09010v8 [cs.DB] for this version) https://doi.org/10.48550/arXiv.1803.09010