JDCC09: Looking into the future: The impact of user generated content

What is the impact of user generated content on research and scholarship?

All three speakers made it clear that the impact can be massive – but only when the user generated content is sourced and employed intelligently, imaginatively and respectfully.

Key points made in the session:

‘Many hands make light work’
Users are collaborators
Prepare for your project to evolve away from your control in order to have long-term sustainability

Galaxy Zoo, Arfon Smith, University of Oxford

GalaxyZoo Arfon Smith kicked off the session by explaining how integral user generated content is to the Galaxy Zoo project.

Galaxy Zoo is an enormous store of thousands of images of galaxies. The problem is that while computers are excellent at taking photographs of galaxies, they are terrible at distinguishing the different types of galaxy.

Humans, however, are rather good at seeing the difference between a spiral galaxy and an elliptical galaxy. And to the delight of the Oxford team, thousands of people from around the world enjoy spending their evenings (or Friday afternoons at work, according to the web stats) doing just that.

Galaxy Zoo launched in 2007. The team were hoping to be able to classify 4 million images in 2 years. But the project really captured the imagination of the public (thanks in part to a story on the BBC) and within 48 hours the system was receiving 70,000 classifications an hour. Which, incidentally, crashed the server.

The Oxford team have done their utmost to embrace this ‘citizen science’ as they call their brand of user generated content. To take part, users – or ‘collaborators’, as Smith was careful to call them – simply complete a quick tutorial teaching them to tell apart different kinds of galaxies. If a user can label 8 of the 15 pictures correctly, he or she can take part.

Allowing people to take part who get almost half of their galaxies wrong sounds a risky strategy, but Galaxy Zoo avoid the problem by sheer number of users. Each galaxy is checked an average of 7 times, which results in a far more accurate result than could be achieved by individual professional astronomers.

So citizen science is great news for the scientists, but what could the general public possibly get from it? Do they think it’s a game? Do they like looking at pretty pictures of galaxies? Do they enjoy taking part in such a massive project? The answer to all these questions is yes, but according to a survey of the users, by far the greatest reason for taking part is simply that people like helping with real science.

Arfon emphasised the importance of respecting the collaborators generating content for a project such as Galaxy Zoo. ‘Zooites’ as the users call themselves are given credit on academic papers, they see the impact of their contributions, and they are given the freedom to develop their own lines of interest. For example, Galaxy Zoo users can discuss what they find in the website forum, and this has led to the discovery of completely unexpected objects. A dutch school teacher saw a strange green cloud in one photograph, and after much zooite-led discussion and research, the cloud was identified as a rare super heated gas cloud which had been heated by a black hole. This is the sort of object that a computer examining pictures of galaxies would never have noticed.

Galaxy Zoo is now in its second version of the website, which allows collaborators to give more detail, such as the number of arms they can see on a spiral galaxy. It now has 250,000 users, and 35 million classifications.

Old Bailey project, Sharon Howard, University of Sheffield

The Old Bailey project is the digitisation of printed trial reports of the Old Bailey in London during the 18th and 19th centuries. It is the world’s largest historical resource of the lives of ordinary people from that period.

The aim of the project is to reconstruct the lives of ordinary Londoners. A key element of this is nominal record linking – going through the 2-3 million names in the records and working out which records refer to the same people.

The main problem experienced by the University of Sheffield team has been in digitising the poor quality manuscripts. OCR is cheap but error-prone, and ‘double-keying’ – when 2 people transcribe a text and the differences are compared with a computer – is more accurate but unfortunately it’s also more expensive.

The Old Bailey team’s solution is to invite the public to post notifications of errors and additional information to a wiki. In particular, the team want help in correcting spelling errors in names, place names and any other key information that users are likely to use as search terms. The bulk of users are family historians with specialist skills in tracing individuals.

Using user generated content presents more problems for the Old Bailey project compared to the Galaxy Zoo project, because there are more variables in terms of the quality of the factual content, the way it is written and the way that content is structured. In addition, the Old Bailey team identified a need to distinguish user-generated content from content from a more authoritative source.

For Sharon, the most practical way to sustain the resource is to encourage the more enthusiastic users to become administrators who can help clean up and standardise other users’ efforts.

As with all user-generated resource there is also an issue in how users can be encouraged to continue submitting content. Happily for the Old Bailey project, feedback from users is that improving the digitised versions of the Old Bailey trials “is addictive”. Users instantly see the result of their work, which is satisfying for them.

Sharon finished by pointing out that ultimately, encouraging user generated content means handing over partial control of your project to users with little expertise. There will be errors, trivial information and untidiness, and it’s up to the project team to decide how much of that they can tolerate.

Great War Archive, Kate Lindsay, University of Oxford

The Great War Archive was an off-shoot of an archive of first world war poetry manuscripts.

The archive contains 6,000 digital images of primary source materials and 500 multimedia objects, plus supporting educational materials. It’s a fascinating archive, sourced from around the world, which includes poems stained by the blood and mud of the trenches. As well as being a valuable resource it’s extremely well managed and documented – the materials have been catalogued by trained first world war experts and quality assured twice.

But while the First World War Poetry Digital Archive was compiled by librarians and academics, the Great War Archive was built from the generosity of the general public.

The University of Oxford team asked the public to dig out their artifacts from the Great War, take pictures and then upload those digital records to the online archive. The project was unusual in that there were no experts independently verifying facts – individuals were simply trusted to record their own information, as well as any stories or memories relating to the materials. The academics added information, but they deleted nothing.

The project appealed to family historians, genealogists, military collectors and the elderly, who still still had memories of stories told by their parents. But in this last – and arguably most important group – lay a significant problem: the elderly didn’t tend to have the necessary skills or equipment to add to the online archive.

To overcome this problem, the Great War project team went on a roadshow, visiting libraries across the UK and inviting the public to come to them directly with any materials: photographs, medals, death pennies (the coins sent to the families of soldiers who had died in battle), even a matchbox with a message inside flung from a train by a homesick soldier.

Over 6500 items were collected in 4 months, and after the project ended a Flickr group was created which continues to collect digitised images of items – and which currently has almost 2000 items.

The Great War archive has been a fantastic resource for academics, individuals and also teachers, who have used it to find local items to generate interest in students.

Kate pointed out that while the archive has undoubtedly been a resounding success, the current funding models don’t necessarily support user generated content. When the funding runs out the project can’t be supported, and communities which have built up must find an alternative method of developing online – in this case through Flickr.

But in general the positive impact of user generated content vastly outweighs the negatives. Previously untapped knowledge and unreleased materials is made available in the public domain, opening up new avenues of research and teaching and preserving histories. These are histories that might otherwise be lost forever – with every generation the emotional ties to the materials weakens, and in fact the project team received some materials from dustmen who had found discarded objects in bins and skips.

Kate agreed with Arfon that we should be calling those involved in the projects collaborators not users, and that we need to have trust in non-academic communities.

Q & A

Joy Palmer, Mimas manager, Library and Archival services: “Was there a community which was already existed, which could be mobilised?”

Arfon Smith: “Brian May blogged about Galaxy Zoo, and you’ll find that a lot of our users are Brian May fans! We were also lucky to be picked up so early in the press.”

Nick Poole, Collections Trust: “Are these projects core to all the organisations you represent, or are they early-stage projects?”

Kate Lindsay: “Oxford doesn’t have a central power which tells us what to do, but this project has raised awareness of alternative models, and demonstrated the potential of user generated content. Money is definitely a big issue. The cost per image of poetry project was £40 per image. The Great War project cost £3.50 per image, and that would have been halved if we didn’t have to digitise all the stuff got through the post. It was definitely a big impact for a relatively small amount of money.”

Arfon Smith: “We struggle with the server cost. It costs us $1500 a month, and it’s difficult to find funds for that. It requires a different funding strategy.”

Sharon Howard: “I can’t speak for the University of Sheffield, our department tends to work quite independently.”

Delegate: “Do school teachers worry that the facts aren’t entirely verifiable?”

Kate Lindsay: “Yes it’s a problem. In our archive the stories are conceptual, rather than a record of facts. Our steering group were very keen to separate the poetry archive from the Great War archive for this reason, so it’s not cross-searchable. We believe it’s all about changing the mindset – experts are littered everywhere in general public.”

Sharon Howard: “With our project a lot of it is subjective anyway, which causes problems in how we tag, for example.”

Sarah Fahmy, Strategic Content Alliance: “Was sustainability a driver for the projects? It clearly cost an awful lot of money in marketing – was it sustainable? could you make cost savings?”

Kate Lindsay: “You have to ensure you do targetted marketing. We sent posters and flyers to city libraries. Key areas of the press – the Mirror, the Daily Mail also brought in a lot of people.
We were shifting the cost from digitisation to marketing, but not on the same scale – the cost went from £40 to £3.50 per item.”

2 replies on “JDCC09: Looking into the future: The impact of user generated content”

Well, it seems I need to do some clarification here. The Old Bailey Proceedings project did not use cheap OCR and we are not inviting the public to help correct the text via a simple content editing tool (we merely have a wiki where they can post notifications of errors, among other things), which is “addictive”. That is the Australian Newspapers Digitisation Project. Apparently I was not explicit enough that I was using it as an example and it had nothing to do with us, even though I put the full title of the project and its URL on a slide, and my practical example was a news article on Wimbledon in the 1950s. I wasn’t very happy with my presentation and it wasn’t the most well-organised I’ve ever put together, but I had no idea it was quite that confusing. I’m very sorry about that.

Apologies for the misunderstanding Sharon, and thank you for posting a clarification here. I have now updated the post.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

2 replies on “JDCC09: Looking into the future: The impact of user generated content”

Leave a Reply