Outcomes of task and finish group investigations into the preparation of datasets for Artificial Intelligence (AI)
Recently we established a short-form task and finish group, made up of senior academics and librarians to explore the question: Are universities sufficiently prepared to support the development of datasets and computational methods for data driven research in the Arts & Humanities?
The group concluded that many are not sufficiently prepared and that Jisc can help to coordinate action.
Do we need to think about AI in the humanities?
We examined issues of Artificial Intelligence (AI) and its subdomain Machine Learning (ML) in relation to humanities data. Many Arts and Humanities researchers do not see their work as being concerned with data, rather they will talk about sources; primary, secondary and tertiary.
Some members of the group were clear that there are specific considerations when we examine AI and ML in the context of the Arts and Humanities, whilst others felt it important to recognise that data literacy and awareness of these technologies needs to increase across all disciplines, especially as libraries and archives are now more likely to provide functional services rather than subject specific ones. All were clear that our focus should be on groups we termed the ‘digitally obliged’ (those who need to provide services eg libraries, archives and research support teams) in their role of supporting the ‘digitally curious’ (those who want to know but currently may not). We’ll initially focus on the Arts and Humanities but with the intention of providing support across all disciplines through our primary audience, the digitally obliged.
The group came to focus on knowledge about data and how such knowledge determines any subsequent computational processes such as AI and ML. It was acknowledged that the Humanities has not really worked out what AI is for in relation to its disciplines and there was much discussion about AI as a term and whether we should use it but, in the end, we decided that is has much currency at the moment, so we will. The real need is for fundamental information for improved decision making about these technologies.
The recent Shaping Data and Software Policy in the Arts and Humanities Research Community report (Software Sustainability Institute 2022 – AHRC commissioned), shows that many researchers do work with data and that there are growing skills and knowledge in handling data and in the use of technologies to aid research, however, it also states that “
“The use of digital tools and associated knowledge and skills – Overall, this is low (at least compared to other sectors). Pockets of innovation and engagement exist in and beyond the digital humanities, and among researchers at all career stages. There is scope for a substantial investment in signposting, supporting and promoting training opportunities, providing a creative environment for skills development that mainstreams digital skills alongside others.”
We went on to talk about what kind of support services there are in the universities for data use and for AI application deployment. Librarians, archivists and research support teams, all digitally obliged, seem central to supporting researchers, yet there can be difficulty in engendering cross-disciplinarity, not least because of communication issues between various interest groups (eg humanities researchers have quite different concerns to the computer scientist and the same goes for Research Software Engineers (RSEs) and the research support team).
Some of this has to do with the language and terminology used by various interest groups. The term AI means different things to different people. ML is perhaps the dominant form of AI, but its complexity varies significantly, from straightforward statistical analysis (what many humanists need) or, commercially, perhaps a recommendation function in a streaming service right through to highly evolved models such as GPT-3 (ChatGPT’s deep learning model).
Issues of terminology also affect senior decision-makers. For example, there is a question for senior managers when a proposal is made for funding or for a research grant: is AI an appropriate or necessary technology for this proposal? As another example, they may be confronted with a term like vector and not know what that means in the context of Computer Science. These challenges can get in the way of good decision-making and the affordances of cross-disciplinarity.
We even discussed whether we should allude to AI at all, but it was felt that this is certainly a well-used term, with various strategies being in evidence from the UK government and from UK research bodies. The group agreed that it was important to give people the means of making the right decision when they are considering AI technologies for their research but more about this below.
Libraries and archives
We are aware that there are libraries and archives that already offer support for data in computational methods or at least want to offer such support. Some of the kinds of service on offer, or planned to be, might be:
- providing their own collections as data
- finding data for users in a complex data search landscape
- licensing data collections from publishers
- explaining how data can and cannot be collected and used (eg copyright)
- managing derived data and preserving it
- promoting data sharing
This list was provided by Andrew Cox. See Cox A The impact of AI, machine learning, automation and robotics on the information professions: A report for CILIP (CILIP 2021).
A librarian in the group wanted more help with providing data support to library customers, but the issue is knowing where to start. It was felt that Jisc has a role in developing information for libraries and archives so that they can improve their provision of critical services to cross-disciplinary teams and improve the delivery of support services to researchers wanting to use collections as data.
Outcomes of the conversations
We invited the group to a couple of meetings and a workshop, where we asked them to define audiences, decide on what Jisc’s focus should be, and to develop some basic ideas for innovation. Their conclusions were:
Information – we should focus on providing fundamental information which is tailored to the context of where people are operating. People may not be thinking about data but are thinking about sources. Help them to think about how they can transform sources into data, not as a classic data management process but as Collections as Data (Padilla et all 2016-18).
Community use cases – we should develop focused use cases and case studies by arranging interactions with communities of interest and by maximising connections across relevant networks. This also chimes with reports from OCLC (Padilla 2019), Library of Congress (Cordell 2022) and University of Washington (Lee 2022).
The audience – our primary audience should be the ‘digitally obliged’. It is noted that the digitally obliged may also be digitally curious.
Language & terminology – we need to be mindful of the language we use. For example, the digitally obliged need to interact with specialists (eg in computer science departments) in language the specialist are familiar with. We must ensure we use familiar terminologies when communicating with and between various interest groups.
Jisc’s USP – bringing people together by convening, brokering and providing forms of advice and guidance. A proposition might be to enable small groups of researchers and librarians/archivists to undertake work and then document the lessons learned which we can then disseminate, perhaps with a 3rd party partner.
Is AI for me? – this was one of the ideas from the workshop. It is about developing decision-making information resources based on well-defined use cases. This idea will help us shape a programme of work to develop the right resources for our audience.
Next steps – we will develop a programme of work around the concept Is AI for Me? At a high level, this is to develop community driven use cases specific to the functional needs of librarians and archivists (the digitally obliged). This will allow us to find out more about who the digitally obliged are and to develop cases documenting their requirements. Most likely we will work with partners to disseminate what we learn.
Prior to the group’s formation, we conducted a horizon scan which showed more than 30 or organisations exploring the use of AI in a tertiary education context. Some, such as The Alan Turing Institute, have undertaken significant projects such as Living with Machines, yet there is still a chasm between the day-to-day activities of the researcher and the potential for using the outputs of a deep learning algorithm in their research. How do we bridge this gap? This is becoming a burning issue because AI is going to have a huge impact on all our lives, not least in the education space. We have already seen how ChatGPT has taken the world by storm but this is only the start of developments that most likely will revolutionise how information is formed and consumed.
In parallel to the activities of the task and finish group, our National Centre for AI in Tertiary Education is exploring chatbots and digital assistants, adaptive learning platforms and predictive analytics. It is testing with our members if artificial intelligence has the potential to help educators better understand and meet the needs of their learners. The centre’s Explore AI site is particularly useful for testing out applications of AI to gain insights as to how they work.
Similarly, our Archives Hub team has also been exploring the use of Machine Learning to support improved discovery of archives and their collections, with a particular focus on image applications and archival discovery methods. The reading of this set of posts is strongly recommended as they provide explorations of some significant issues when applying these technologies to archival content.
We would like to thank the members of our task and finish group, listed below, for helping us to explore these topics and for helping us to develop ideas we can take forward.
- Andrew Cox, Senior Lecturer, Information School, University of Sheffield (https://orcid.org/0000-0002-2587-245X).
- Alex Fenlon, Head of Copyright and Licensing in Library Services, the University of Birmingham
- Paul Gooding, Senior Lecturer in Information Studies, School of Humanities, University of Glasgow (https://orcid.org/0000-0003-1044-509X). Paul is UK Co-Investigator for the AEOLIAN network.
- Leif Isaksen, Professor in Digital Humanities, University of Exeter and Turing Fellow at The Alan Turing Institute (https://orcid.org/0000-0003-4027-1764).
- Katherine McDonough, Senior Research Associate, The Alan Turing Institute (https://orcid.org/0000-0001-7506-1025). Katie works on the Living with Machines project.
- Paola Marchionni, Head of product, content and discovery, Jisc.
- Jenny Mitcham, Head of Good Practice and Standards, Digital Preservation Coalition (DPC).
- Oonagh Murphy, Lecturer in Arts Management, Goldsmiths University of London (https://orcid.org/0000-0002-5095-8861). Oonagh has authored the AI: A Museum Planning Toolkit (The Museums + AI Network, 2020).
- Michael Pidd, Director of the Digital Humanities Institute, University of Sheffield. Mike is currently Co-Investigator of the AHRC-funded Scoping Future Data Services for the Arts and Humanities, testing the establishment of an active and interactive national data centre for arts and humanities data.
- Leontien Talboom, Ph.D. candidate UCL, Web Archivist & Technical Analyst, University of Cambridge (UL). Leontien has collaboratively authored First steps to a guide for computational access to digital repositories (DPC 2022).
- Jane Winters, Professor of Digital Humanities & Director of the Digital Humanities Research Hub (https://orcid.org/0000-0001-5502-5887). Jane has recently authored Web archives and the problem of access: prototyping a researcher dashboard for the UK Government Web Archive’, in Archives, Access and Artificial Intelligence: Working with Born Digital and Digitized Archival Collections (Bielefeld: Bielefeld University Press, forthcoming, 2021).
- Special thanks also go to library colleagues at the University of Leeds and the University of Sussex and to Jisc colleagues Stephen Brooks (co-organiser) and Neil Grindley for their contributions to the workshop held in November 2022.