As reported in a previous post A couple of weeks ago we were pleased to be joined by 180 people for our Getting Your Collections AI Ready webinar. The focus of the session was on how academic libraries can get their collections online in forms consumable by people and machines.
Though AI was in the title and is a very topical at the moment, we were really more focused on the data that such technologies might consume.
An ongoing debate is the lack of transparency when big players in the AI game train their models. We posit that libraries can gain an influential footing in the provision of transparent machine ready collections by doing what they have long done, make information available and describe it so that it can be found, used and preserved.
To do this they can prepare their collections as a full dataset of all the images and text in a collection or simply provide metadata about a full collection. Traditionally collections are provided as individual images, text and metadata in a digital library but we can provide collections in ways to allow researchers to more easily work with them using machines.
What do we mean by a training dataset?
It is a word banded about a lot. The recent, Critical Field Guide for Working with Machine Learning Datasets* says, “You may choose a dataset based on what it contains, how it is formatted, or other needs. For example, computer vision datasets include thousands of IMAGE or VIDEO files, while natural language processing datasets contain millions of bytes of TEXT.”
These data come in many forms, but all data shapes the resulting outputs of an AI process. The guide says, “DATA are values assigned to any ‘thing’, and the term can be applied to almost anything. Numbers, of course, can be data; but so can emails, a collection of scanned manuscripts, the steps you walked to the train, the pose of a dancer, or the breach of a whale. How you think about the information is what makes it data.”
Datasets determine outputs of AI processes
We know that the words we type into ChatGPT influence the output. We can be careful with our prompts or our pasted input text, but most importantly, it is all the textual data from across the web, which has been processed through the model, shaping the returned outputs. These also depend on the shape of the model which has been used and how it has been developed and the kinds of weights and biases built into it. What is little known is that there are two types of learning in current systems; learning during training and learning during tasks or in-context learning . So the system learns from all the data which has been processed previously and this leads the system to be able to make inferences and address generalised problems (but beware of hallucinations).
I will come back to models in the next post which will share some more information about the guide and the Knowing Machines project which is a valuable, and well constructed, resources if you want to understand datasets and the application of machine learning in ways related to information providing activities.
What we learned about data from the webinar
During our webinar Ines Byrne of the National Library of Scotland showed examples of how their Data Foundry has provided simple renderings of data from digitised collections (PDF of Ines Byrne’s slides). Ines demonstrated that we could start with small sets and that the information we gather about them does not have to be over-elaborate. In terms of describing the data provided on the foundry, she spoke the use of a simplified version of the datasheets for datasets model.
One of the great dangers Ines identified is the desire for perfectionism. She also showed that consumers of data can teach the library about the data it has published, and that uses can be unexpected such as the use of data for art installations or for doing interesting things with maps or volumes of the Encyclopedia Britannica. Whoever uses a dataset will need to process it in some way to meet their own desired outcomes.
The datasets on the Foundry are provided via a WordPress site which again demonstrates that we can take straightforward approaches to delivering machine ready collections.
Jodie Double of the University of Leeds described a journey the library has been on to get their collections ready for new forms of research (PDF of Jodie Double’s slides). It is most interesting that a significant research library is at the beginning of the machine ready journey. They have been carefully waying up the risks and benefits and now with their newly established innovation lab in place we should soon start to see more datasets being prepared.
You can watch a recording of the first webinar and we have provided a transcript for ease of access.
This post forms part of our series on AI and library collections. Over the next few posts we will take a closer look at the field guide to start to elucidate some of the concepts we need to consider when developing datasets.