Friday, October 2, 2015

On Open Data and Numismatic Typologies

edit (2 October 2015, 4PM): I want to make it clear that we have been collaborating with numerous members of the Coins and Medals departments for several years now on a few digital projects, including building a close relationship with the Portable Antiquities Scheme. Data usage concerns have been expressed by a small handful of individuals at the British Museum and are not, as far as I can tell, driven by the Trustees of the British Museum.
 

Can the British Museum make their data available with a Creative Commons license, but then restrict how the data are used?


The short answer is yes.

But the long answer in this case is a bit more complicated. The British Museum has authorized the reuse of their data and images under a CC 4.0 BY-NC-SA license, meaning that anyone has the right to use these data for non-commercial purposes as long as the BM is attributed and the creative works derived from these data and images and likewise freely and openly shared. ANS collaborative projects have always adhered to these requirements. For OCRE, CRRO, and PELLA, we have extracted data from the British Museum SPARQL endpoint and transformed these data into the Nomisma ontology. The full list of datasets are available at http://nomisma.org/datasets, and so one may download the entire BM RDF data dump at once or extract any associated data via the Nomisma SPARQL endpoint. Individual coins are also attributed to their collection throughout the various interfaces in our digital type corpus projects.

As the British Museum license currently stands, we (or anyone) have the right to use these images and data in this manner, without the need to ask the BM permission to do so.

Only if the BM changed their license to the more restrictive ND (No Derivatives) would they be able to exert absolute control over the reuse of their data. This means that the public can only download a dump of their CIDOC-CRM RDF in N-Quads. It would not even be permissible to transform these data into RDF/XML for XSLT processing. One could not match the places in their thesaurus to Pleiades URIs and transform the CIDOC CRM into the Open Annotation model used for the Pelagios project. One could generate CSV out of the data to load into Open Refine, Google Fusion Tables for visualization, or to analyze data with R. Of course, a CC ND license would obliterate any potential for reuse of British Museum data, and this is certainly why they have not sought to place this draconian license on their data.

What does this have to do with typologies?


All of the numismatic data in the British Museum SPARQL endpoint are open, and nearly every individual specimen contains at least one reference to a coin type number. By poking around the BM data, I was able to figure out that the reference URI containing 'GC30' as a short title refers to Price's The Coinage in the Name of Alexander the Great and Philip Arrhidaeus. I developed a simple SPARQL query that allowed me to extract a list of nearly 3,000 coins from the British Museum that contained Price references. One could extend this query to gather a list of unique Price references rather than objects, and therefore anyone would be able to generate a significant portion of the typologies from the Price catalog. Now, this catalog would not be complete because Price derived some of his typologies from other collections, such as the American Numismatic Society. The BM endpoint also does not contain a full account of all Alexander coins in the BM collection.

However, these typologies from Price can be derived from descriptions of individual specimens, and the BM CC 4.0 BY-NC-SA license still applies. This begs the question: can the British Museum exert copyright control over typologies published in print when these same typologies can be freely and openly derived from its own collection database?

In fact, it would be possible to derive other typologies not under British Museum copyright by the same mechanisms. The same goes for the ANS database, which is freely and openly available with an Open Database License. Can the British Museum and ANS even include type numbers within their public databases if it is possible to derive typological data that might be under copyright of another publisher? In the United States, data aren't even copyrightable. And the use of reference numbers in databases, specifically, falls within the realm of Fair Use. If we begin to debate whether or not type numbers may even be referenced on the Web, the only real loser in this debate is the general public.

Proof of Concept: Seleucid Coinage, an American Numismatic Society publication


The URI for Houghton and Lorber's Seleucid Coins: A Comprehensive Catalogue Part I Volume I is http://collection.britishmuseum.org/id/bibliography/6336.

Poking around at the CIDOC CRM structure of the coins associated with SC references, I constructed a SPARQL query that would extract most of the typological data from the endpoint. It is a bit messy, as the SPARQL XML response tends to be with the expression of triples, so I took this XML response and wrote some basic XSLT to convert the response into CSV that better reflects individual typologies.

There are only 11 Seleucid coins in the BM system with Houghton and Lorber 2002 references, but I was able to generate a CSV file for all of the typological data for the SC types. The metadata are strings, but one could easily drop this CSV into Google Spreadsheets to clean up. If we were dealing with a typological dataset that consistent of thousands of types, it could be cleaned up in Open Refine in, probably, less than an hour to fully link all concepts to Nomisma URIs.

I have written a number of PHP scripts (e.g., like this one) to transform CSV into NUDS, and so one of these scripts could be adapted to transform the Nomisma-linked CSV into NUDS for direct publication in Numishare. It is possible to go from BM SPARQL queries to a fully-functional digital type corpus like OCRE in about a day's worth of work.

So basically, what I have done here is use the BM SPARQL endpoint to extract open data that comprise typologies that have been published by the ANS and are under ANS copyright. I mean, who cares, right?

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.