Friday, November 6, 2015

Using XForms to transform Google Spreadsheets into RDF

With XML Amsterdam happening and reading about a lot of the XForms action happening at the conference, it has occurred to me that I haven't written any blog posts in the last few months about technical advances in Nomisma or other projects (like the Digital Library application, which features an XForms interface for MODS that accepts a PDF upload, and sends the PDF into Solr for full-text indexing, as well as dynamically generates EPUB files from TEI). The ETDPub application will get a full technical write-up in a journal eventually, probably code4lib.

VoID CRUD

On Nomisma, I worked on a few new features in the backend that have greatly reduced my workload. First, I finally implemented a system that makes use of the VoID metadata RDF for data dumps that are contributed into the Nomisma SPARQL endpoint to facilitate the aggregation of coins for OCRE, CRRO, and our other large projects. The VoID RDF is validated to ensure it contains the required title, description, license, etc., and if valid, the data dump is ingested into the SPARQL endpoint. The dump can be refreshed with a click of a button, or a dataset can be removed from the triplestore entirely by passing in a SPARQL/Update query:

PREFIX nmo:    <http://nomisma.org/ontology#>
PREFIX void:    <http://rdfs.org/ns/void#>
PREFIX dcterms:    <http://purl.org/dc/terms/>
DELETE {?s ?p ?o} WHERE {
{ ?object void:inDataset <DATASET> ;
  nmo:hasObverse ?s . ?s ?p ?o }
UNION { ?object void:inDataset <DATASET> ;
  nmo:hasReverse ?s . ?s ?p ?o }
UNION { ?object void:inDataset <DATASET> ;
  dcterms:tableOfContents ?s . ?s ?p ?o }
UNION { ?s void:inDataset <DATASET> . ?s ?p ?o}


This is a fairly simple workflow, but it isn't yet complete in that it only accommodates RDF/XML at the moment (need to expand for Turtle and JSON-LD), nor does it validate the data dumps. This saves me a lot of time in that I can simply click a button in the user interface to re-ingest a dump when new matches are made between coins in that dump and new types published in OCRE--or simply refresh the OCRE dump when we publish new types. Before, I used to have to shut down the triplestore for a few minutes, delete the data directory, and then manually upload each RDF dump into the triplestore via command line.

Google Spreadsheets to RDF


The other really significant advancement has reduced my workload significantly with respect to batch publication of new concepts (as RDF) into Nomisma. I would occasionally receive spreadsheets of data to upload into Nomisma, which required me to author a PHP script to transform CSV into RDF, and there were invariably validation problems in the original data.

I spent 1-2 weeks developing an XForms application that could read a Google Spreadsheet (published as an Atom feed) in order to validate the data and import as RDF. First, one begins with a spreadsheet like this.

The user will be presented with an interface like the one below:


The user may map the spreadsheet headings to allowable RDF properties. There are some basic requirements--that there be a Nomisma ID that conforms to xs:anyURI, that there be one English preferred label, one English definition, that there may be no duplicate languages for preferred labels or definitions. There must be both a latitude and a longitude if uploading mint IDs. That sort of thing. The full list of allowable properties and more specific instructions are at https://github.com/nomisma/framework/wiki/Import-Update-IDs.

After selecting a valid data mapping (XForms bindings), the user may proceed to the next screen, which then validates each row in the spreadsheet to ensure the data values conform to other bindings. For example, there cannot be blank values for English preferred labels, and skos:exactMatch, skos:closeMatch, skos:broader, and the like must be URIs that begin with the https?:// regular expression (XPath matches() function). If everything is valid, the XForms engine will transform the Atom XML into the appropriate RDF/XML model, save to the filesystem (for versioning in Github), post to the SPARQL endpoint, and then transform the RDF/XML into an XML document for indexing into the Solr search index.

There's a neat additional feature that executes when there's a skos:closeMatch or skos:exactMatch with a wikipedia or dbpedia URL. An XForms submission executes that queries Wikidata based on the article title to extract titles in other languages (essentially facilitating multilingual interfaces by mapping alternate languages into skos:prefLabel in RDF) as well as matching concepts in other vocabulary systems, like VIAF, the Getty AAT/TGN/ULAN, Geonames, etc. In the end of this process, we can some pretty sophisticated RDF that can link people to URIs in other systems and model their relationship to a political entity or dynasty with the org ontology, e.g., http://nomisma.org/id/muhammad_ahmad_al-mahdi.

We have created nearly 1,000 new Nomisma concepts this summer through this new spreadsheet import mechanism--a great investment in two weeks' of labor to free me from having to write data processing scripts and pushing the responsibility of creating and updating IDs to the numismatic subject specialists.

This import mechanism is open source of course: https://github.com/nomisma/framework/blob/master/xforms/import.xhtml

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.