Notes from crowdsourcing, tagging, collective cataloging project

Ian McDermott’s original proposal:

I’d like to propose a General Discussion/Working Session hybrid about the D. James Dee Photo Archive, approx. 250,000 transparencies, slides, and negatives documenting contemporary art in NYC (particularly Soho galleries) from the late 1970s – present. Artstor acquired the archive this summer and is in the process of figuring out how to digitize it and, more importantly, catalog it. The collection isn’t cataloged and the slides aren’t labeled so any effort to effectively describe it will be a collective effort. I’m curious to hear what people think about crowd sourcing, tagging, and any other ideas. The BBC’s Your Paintings project is one example of a successful tagging project but what about extensive crowd sourced cataloging, how much metadata is needed before images are released, is it best to open the cataloging to everyone or a select group?

Existing projects resources:

A common theme as people introduce themselves is wanting to get *good* tags in addition to tags at all — possibly using controlled vocabularies.

Ian asks whether people do know of available tools to use — there are problems with using vendors, and there are other problems with “rolling your own” platform. One participant records Artsy’s experience using Mechanical Turk: it took a developer a couple hours to sync the database with Amazon’s, and thereafter it cost about 1 cent per image even with having about 5 people tag each one. Concerns, though, with labor ethics and with image rights.

The Carnegie Mellon program had a Teeny Harris program to get people to identify who’s in the photo.

John Resig brings up a case where a lot of crowdsourced work that had happened over the course of years was replaced in an afternoon by an advanced “computer vision” technique that helped identify things in photos. General point: before you turn to crowdsourcing, talk to advanced computer scientists to make sure that there’s not a computational technique.

Participant wonders what information would be most needed: gallery, creator, year, people, etcetera.

Amanda brings up LibraryThing’s Legacy Libraries and suggests having a “barn-raising” — an event to engage the community as well as to get some items tagged or cataloged. Ian agrees it can be a terrific jumpstart in particular. Participant raises the issue of how you reach people who “aren’t on the Internet all the time.” John Resig also raises a concern about just expecting people to do all the work: important to “chunk” the work so that it’s doable. At the same time, there are many people who do care passionately about particular items or topics. Participant raises the topic of errors in crowdsourcing: Ian mentions that many projects will only accept data once it has been verified by multiple people. Participant brings up the example of the Steve.Museum, where the curation had to happen after all the tagging. John Resig talks about how often it takes thousands of cases in order to train computer software, so unless your set has thousands and thousands of items, in some cases you might as well just do the work manually yourself, or crowdsource it.

Participant brings up search by image — how does it work? John is going to talk about some of that in the next session.