Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Should universities build their own LLMs for academic research?

Taking bold steps

A recent THE article by Sorin Adam Matei proposes that academia could invest in building its own LLMs to consume data from verifiable sources.

I’m wondering if the UK university community thinks this is something we should invest in to ensure we feel confident that this valuable technology fulfills sector requirements.

Governments as investors

The UK government does not doubt the value of generative AI as they express in their recent Generative AI Framework, “Generative AI has the potential to unlock significant productivity benefits.” But we are aware of significant unease, in many universities, about the deployment of AI in its current state and, so far, there seems to be no workable solution.

UKRI delivered £900m of co-investment in 21/22 and Research England launched a £100 million infrastructure investment scheme so I wonder if it would be a strategic benefit to allocate some funding over a number of years to actually build our own LLMs? As Dr Matei points out, developing our own LLMs would of course cost a fair amount if you think that Open AI’s product cost $100m but that is not a huge amount of money in the scheme of things.

Initiatives already under way

Taking the technology back into academia, where much of it originated anyway, would answer some of the questions posed by A Critical Field Guide for Working with Machine Learning Datasets, a resource I highlighted in a previous blog post. It says:

“How does mishandling datasets contribute to harm? Like any messy, multifaceted material, datasets must be treated with care. Taking time to see the broader implications of making and using datasets can save you time, create projects that are easier to explain, and help you build stronger relationships with the communities your datasets impact.”

The guide goes on to detail many considerations researchers must take when working with machine ready collections but would taking an even bolder step by having control of the algorithms lead to much greater benefits? Certainly, the Dutch Government has decided this is the way forward as the Dutch education cooperative Surf’s report on their efforts to build a Dutch language LLM demostrates.

Taking control

If we were to take control of both the data and the LLMs then we could much reduce what it takes for researchers to work with data as they would have more confidence in the technologies and they would be integral to developing requirements for the LLM as it is built. Remember that the algorithm is actually impacted by the dataset. This is often forgotten. The LLM changes as it consumes data. That is why developers don’t know, any more than you or I, what their technology might produce. This will remain true even if we build our own, but it would mean we have better control of the inputs, by knowing the data, its origins and its biases and knowing the machinery which consumes the data. This should lead to better understood outputs and confidence in results.

So, what do we think? Would it be a good idea for universities and/or their funders to invest to build better LLMs controlled by the academic community?

And if we want to have historical collections in a machine ready form so that we have adequate control, we will probably also need to invest in converting non-digital sources to digital. Universities would also need to get behind this and invest in giving better machine ready access to their collections.

Join us

In the meantime, we are talking a lot about machine and research ready collections. Why not join us on 8 February when I will be talking with Jane Gallagher, Ian Gifford of the University of Manchester Library and Helena Byrne from the British Library about all that. Follow the link to join our webinar to discuss AI ready library collections; the practicalities.

This post forms part of a longish series, focusing on matters to do with AI and machine/research ready collections, which you can access via the tag below.

By Peter Findlay

Subject Matter Expert, Digital Scholarship, Content and Discovery, Jisc

Working with Jisc's Higher Education members in support of digital scholarship and digital library strategy in the age of data-centric arts, humanities and social science research.

I am a site admin for this website.

Leave a Reply

Your email address will not be published. Required fields are marked *