Crowdsourcing and Variant Digital Editions – some troubles ahead

Projects like UCL’s Transcribe Bentham and New York Public Library’s What’s on the Menu? have done groundbreaking work in engaging the public to transcribe their manuscript collections.

Crowdsourcing allows rapid, and it seems high-quality, creation of transcribed data from original documents. Transcribe Bentham has so far created 1,330 transcribed versions, and only a handful have been rejected for a lack of quality. Previously, such scholarly transcription would have taken considerable time and effort, spanning many years.

With notable successes like these, crowdsourcing is now becoming more familiar as an academic tool. But for certain datasets, particularly ones of considerable academic importance, this could bring some problems with crowdsourcing having the ability to create multiple editions.

For example, the much-lauded Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO) are now beginning to appear on many different digital platforms.

ProQuest currently hold a licence that allows users to search over the entire EEBO corpus, while Gale-Cengage own the rights to ECCO.

Meanwhile, JISC Collections are planning to release a platform entitled JISC Historic Books, which makes licenced versions of EEBO and ECCO available to UK Higher Education users.

And finally, the Universities of Michigan and Oxford are heading the Text Creation Partnership (TCP), which is methodically working its way through releasing full-text versions of EEBO, ECCO and other resources. These versions are available online, and are also being harvested out to sites like 18th Century Connect.

So this gives us four entry points into ECCO – and it’s not inconceivable that there could be more in the future.

What’s more, there have been some initial discussions about introducing crowdsourcing techniques to some of these licensed versions; allowing permitted users to transcribe and interpret the original historical documents. But of course this crowdsourcing would happen on different platforms with different communities, who may interpret and transcribe the documents in different way. This could lead to the tricky problem of different digital versions of the corpus. Rather than there being one EEBO, several EEBOs exist.

But this is part of a larger problem. If there are multiple versions of the original content, then which one is the one you use? In fact it’s not only about the content. Which platform works quickest? Which gives the most ‘accurate’ search results? Which one provides enhanced tools for analysis? Which gives the best results for your particular area of research? Where do you send your students? Which one do you cite?

Most importantly, which one do you trust? And why?

In ‘traditional scholarship’, different editions of original documents would be published at, for example, 50 year intervals, and it would be part of the scholarly workflow to review and criticise such editions. The complexity and proliferation of digital resources radically changes this – not only are there more digital resources but the knowledge and skills needed to critically analyse a resource are considerably widened out.

At the moment, there are no immediate solutions for these challenges. But it’s clear that the potential of the Internet continues to fracture existing practices of scholarship – despite the care, attention, and research intelligence that has gone into creating EEBO, ECCO and their various platforms, the potential for academics, funders, publishers to push forward and develop new digital ideas mean that the notion of the Internet as a place where traditional scholarly practices can simply be repeated continues to disintegrate.

(Thanks to Ben Showers for reading over this)

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

7 thoughts on “Crowdsourcing and Variant Digital Editions – some troubles ahead

  1. Ben Brumfield

    We discussed this briefly on Twitter yesterday, but I’d like to elaborate a bit on my comments there.

    benwbrum Reading @alastairdunning’s post connecting crowdsourcing to variant editions: bit.ly/raVuzo Feel like Wikipedia solved this years ago.

    benwbrum If you don’t publish (i.e. copy) a “final” edition of a crowdsourced transcription, you won’t have variant “final” versions.

    benwbrum The wiki model allows linking to a particular version of an article. I expanded this to the whole work: link

    alastairdunning But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?

    alastairdunning ie, Wikipedia can trace variants cause it’s all on the same platform; but there are multiple copies of EEBO in different places

    benwbrum I’d argue the problem is the multiple platforms, not the crowdsourcing.

    alastairdunning Yes, you’re right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly

    benwbrum You’re assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.

    benwbrum Also, why would a crowd sourced edition be restricted? I don’t think that model would work.

    I’d like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

    When we’re talking about crowdsourced editions, we’re usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure — a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform — the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen–in my world of software development, the equivalent phenomenon is a code fork–but they’re very rare.

    But what about projects which don’t run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the “published” version on the CMS to drift from the “working” version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

    Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I’ve seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow — it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the “published” edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

    It’s not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I’m not sure that they’re good examples.

  2. Ben Brumfield

    We discussed this briefly on Twitter yesterday, but I’d like to elaborate a bit on my comments there.

    benwbrum Reading @alastairdunning’s post connecting crowdsourcing to variant editions: bit.ly/raVuzo Feel like Wikipedia solved this years ago.

    benwbrum If you don’t publish (i.e. copy) a “final” edition of a crowdsourced transcription, you won’t have variant “final” versions.

    benwbrum The wiki model allows linking to a particular version of an article. I expanded this to the whole work: link

    alastairdunning But does that work with multiple providers offering restricted access to the same corpus sitting on different platforms?

    alastairdunning ie, Wikipedia can trace variants cause it’s all on the same platform; but there are multiple copies of EEBO in different places

    benwbrum I’d argue the problem is the multiple platforms, not the crowdsourcing.

    alastairdunning Yes, you’re right. Tho crowdsourcing considerably amplifies the problem as the versions are likely to diverge more quickly

    benwbrum You’re assuming multiple platforms for both reading and editing the text? That could happen, akin to a code fork.

    benwbrum Also, why would a crowd sourced edition be restricted? I don’t think that model would work.

    I’d like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

    When we’re talking about crowdsourced editions, we’re usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure — a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform — the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen–in my world of software development, the equivalent phenomenon is a code fork–but they’re very rare.

    But what about projects which don’t run on a monolithic platform? There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the “published” version on the CMS to drift from the “working” version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

    Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS. One model I’ve seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow — it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the “published” edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

    It’s not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I’m not sure that they’re good examples.

  3. Ben Brumfield

    I’d like to explore this a bit more. I think that variant editions are less likely in a crowdsourced project than in a traditional edition, but efforts to treat crowdsourced editions in a traditional manner can indeed result in the situation you warn against.

    When we’re talking about crowdsourced editions, we’re usually talking about user-generated content that is produced in collaboration with an editor or community manager. Without exception, this requires some significant technical infrastructure — a wiki platform for transcribing free-form text or an even more specialized tool for transcribing structured data like census records or menus. For most projects, the resulting edition is hosted on that same platform — the Bentham wiki which displays the transcriptions for scholars to read and analyze is the same tool that volunteers use to create the transcriptions. This kind of monolithic platform does not lend itself to the kind of divergence you describe: copies of the edition are always dated as soon as they are separated from the production platform, and making a full copy of the production platform requires a major rift among the editors and volunteer community. These kind of rifts can happen–in my world of software development, the equivalent phenomenon is a code fork–but they’re very rare.

    But what about projects which don’t run on a monolithic platform?
    There are a few transcription projects in which editing is done via a wiki (Scripto) or webform (UIowa) but the transcriptions are posted to a content management system. There is indeed potential for the “published” version on the CMS to drift from the “working” version on the editing platform, but in my opinion the problem lies not in crowdsourcing, but in the attempt to impose a traditional publishing model onto a participatory project by inserting editorial review in the wrong place:

    Imagine a correspondence transcription project in which volunteers make their edits on a wiki but the transcriptions are hosted on a CMS.
    One model I’ve seen often involves editors taking the transcriptions from the wiki system, reviewing and editing them, then publishing the final versions on the CMS. This is a tempting work-flow — it makes sense to most of us both because the writer/editor/reader roles are clearly defined and because the act of copying the transcription to the CMS seems analogous to publishing a text. Unfortunately, this model fosters divergence between the “published” edition and the working copy as voluteers continue to make changes to the transcriptions on the wiki, sometimes ignoring changes made by the reviewer, sometimes correcting text regardless of whether a letter has been pushed to the CMS. The alternative model has reviewers make their edits within the wiki system itself, with content pushed to the CMS automatically. In this model, the wiki is the system-of-record; the working copy is the official version. Since the CMS simply reflects the production platform, it does not diverge from it. The difficulty lies in abandoning the idea of a final version.

    It’s not at all clear to me how EEBO or ECCO are examples of crowdsourcing, rather than traditional restricted-access databases created and distributed through traditional means, so I’m not sure that they’re good examples.

  4. Pingback: Our thoughts on “Crowdsourcing and Variant Digital Editions —some troubles ahead” (1/2) « TCP News & Views

  5. Pingback: Our thoughts on “Crowdsourcing and Variant Digital Editions —some troubles ahead” (2/2) « TCP News & Views

Leave a Reply

Your email address will not be published. Required fields are marked *