Should universities build their own LLMs for academic research?

Taking bold steps

A recent THE article by Sorin Adam Matei proposes that academia could invest in building its own LLMs to consume data from verifiable sources.

I’m wondering if the UK university community thinks this is something we should invest in to ensure we feel confident that this valuable technology fulfills sector requirements.

Governments as investors

The UK government does not doubt the value of generative AI as they express in their recent Generative AI Framework, “Generative AI has the potential to unlock significant productivity benefits.” But we are aware of significant unease, in many universities, about the deployment of AI in its current state and, so far, there seems to be no workable solution.

UKRI delivered £900m of co-investment in 21/22 and Research England launched a £100 million infrastructure investment scheme so I wonder if it would be a strategic benefit to allocate some funding over a number of years to actually build our own LLMs? As Dr Matei points out, developing our own LLMs would of course cost a fair amount if you think that Open AI’s product cost $100m but that is not a huge amount of money in the scheme of things.

Initiatives already under way

Taking the technology back into academia, where much of it originated anyway, would answer some of the questions posed by A Critical Field Guide for Working with Machine Learning Datasets, a resource I highlighted in a previous blog post. It says:

“How does mishandling datasets contribute to harm? Like any messy, multifaceted material, datasets must be treated with care. Taking time to see the broader implications of making and using datasets can save you time, create projects that are easier to explain, and help you build stronger relationships with the communities your datasets impact.”

The guide goes on to detail many considerations researchers must take when working with machine ready collections but would taking an even bolder step by having control of the algorithms lead to much greater benefits? Certainly, the Dutch Government has decided this is the way forward as the Dutch education cooperative Surf’s report on their efforts to build a Dutch language LLM demostrates.

Taking control

If we were to take control of both the data and the LLMs then we could much reduce what it takes for researchers to work with data as they would have more confidence in the technologies and they would be integral to developing requirements for the LLM as it is built. Remember that the algorithm is actually impacted by the dataset. This is often forgotten. The LLM changes as it consumes data. That is why developers don’t know, any more than you or I, what their technology might produce. This will remain true even if we build our own, but it would mean we have better control of the inputs, by knowing the data, its origins and its biases and knowing the machinery which consumes the data. This should lead to better understood outputs and confidence in results.

So, what do we think? Would it be a good idea for universities and/or their funders to invest to build better LLMs controlled by the academic community?

And if we want to have historical collections in a machine ready form so that we have adequate control, we will probably also need to invest in converting non-digital sources to digital. Universities would also need to get behind this and invest in giving better machine ready access to their collections.

Join us

In the meantime, we are talking a lot about machine and research ready collections. Why not join us on 8 February when I will be talking with Jane Gallagher, Ian Gifford of the University of Manchester Library and Helena Byrne from the British Library about all that. Follow the link to join our webinar to discuss AI ready library collections; the practicalities.

This post forms part of a longish series, focusing on matters to do with AI and machine/research ready collections, which you can access via the tag below.

By Peter Findlay

Subject Matter Expert, Digital Scholarship, Content and Discovery, Jisc

Working with Jisc's Higher Education members to improve access to to their special collections in the age of data-centric arts, humanities and social science research.

I am a site admin for this website.

Leave a Reply

Your email address will not be published. Required fields are marked *