JDCC09: Developing Content: Digging Into Data

This session focused on ways in which data can be mined and used more effectively. Innovations discussed included annotation of content, free-tagging, geo-parsing, and the need to focus on content and taking technical debates into the realm of content provision where necessary.

Nikki Rogers spoke about the Semantic Tools for Screen Arts Research Project (STARS), funded by JISC. It built on the PARIP project at the University of Bristol, funded by the AHRC. The intention was to offer software, findings and a demonstrator. The project’s domain is screen arts practitioners and researchers, and looks at the relationships between things and how resources connect to each other, allowing other people to add knowledge.

The project was interested in visualisation (making things intuitive), video annotation (using recordings and allowing people to annotate sub-sections, through free-tagging, linking etc), annotating connections and using social software, and workflow. She demonstrated the annotation capability of the project by logging on to the website via OpenID; whenever anyone searches for the term you have annotated, your annotation will come up.

The project has been running for over 18 months, and during that time there has been more work on annotation, eg from JISC’s MACFOB project. The STARS team have worked with the computer science team at the University of Bristol to run data through toolkits, tracking screen changes. There are various inter-video searches available too. Technologies used in the STARS project included Java Web App, supporting QuickTime and Flash. The project team’s wishlist includes image annotation, auto-metadata extraction and 3D visualisation.

James Reid from Edina introduced the concept of “geo-spatial”. Geography is pervasive – 80 per cent of all content has some form of geographical reference. It is the “where” aspect of a search dimension.  The Horizon Report flags six different technologies that should be watched for the next 12 months – one of those this year is “geo-everything”.

There are some JISC tools to encourage geo-enabling – GeoCrossWalk, a gazetteer database sourced from a range of Ordnance Survey datasets; and GeoParser, a language recognition service that can search resources for placenames, providing a way to translate implicitly geographically-referenced resources into explicitly georeferenced ones.  The Parser can identify ambiguous placenames. Using these tools under GeoDigRef, three enriched data sets emerged for three digitisation projects – HistPop, BOPCRIS, and BL Archival Sound archives.

It is impossible to have one generic system that will perform well on all types of textual data. Many person names and other entities also contain place names (Francis Chichester, University of Edinburgh, East India Company etc). Different collections have different properties – some have more person names than place names, others vice versa. Again there is an issue with IPR, meaning that Creative Commons data sources can be used but these may be less comprehensive. 100 per cent accuracy cannot be guaranteed; the highest degree of accuracy is around 96 per cent.

Works in progress currently include the Stormont papers and News Film Online. There are commercial alternatives via MetaCarta and Yahoo!, but these could prove expensive.

Mike Ellis from Eduserv discussed the issue of content relating to technology. He argued that machine-readable data is key; and it’s important to deliver information. The importance of this hasn’t been made clear nor has it been well-communicated.

Content is still king, and always will be. Machine-readable data is a content concern, not a technical one. Re-use is not just good, it’s essential. Content development is cheaper. Things are becoming more visual, and people are making user experiences better. Content should be taken to users, via widgets, feeds and so on. This does not have to be complicated. Content cannot be hidden; if it’s on a page, it will be found.

Leave a Reply

Your email address will not be published.