Tuesday, October 31, 2017

Nomisma launches OpenRefine reconciliation service

I recently received a spreadsheet of Classical and Archaic Greek coin hoard data, atomized and parsed for content in the Inventory of Greek Coin Hoard textual descriptions. I loaded this spreadsheet up into Open Refine in order to break it down further in order to separate the Authority column into separate columns for mints, regions, and rulers, parse uncertain attributions (looking for question marks), separate the numeric count of coins from denomination abbreviations, etc. The Authority column itself contained mostly mints, all of which are represented by concepts defined on Nomisma.org. The easiest way to reconcile this list would be to run against an OpenRefine reconciliation API, which did not exist--so I built one between Friday and Monday.

The new service is now listed among the Nomisma APIs, available at http://nomisma.org/apis/reconcile.

The API does not support every possible optional service yet, but it supports the most useful one for normalizing data to Nomisma concepts. Here's what it does do:

  • The main reconciliation service, returning the basic service JSON when there are no query parameters. Both the 'query' and 'queries' HTTP parameters are supported. These query parameters are parsed into one or more Solr queries to yield a response.
  • The Preview API, for displaying a little HTML popup when hovering the mouse over a reconciled candidate (a simple serialization from the concept RDF/XML into HTML).
  • The Entity Suggest API, which allows a user to type in a new term, yielded an autosuggest response (this is also Solr serialized into JSON). The suggest flyout is also supported, which is a serialization of the concept RDF/XML into a tiny JSON snippet to be displayed when hovering over the autosuggested term.

These are the most vital services, which enabled me to normalize about 3,500 of 4,000 lines of the CSV to existing Nomisma mints, regions, and rulers. About half remaining 500 are multiple mints of the same place name that need to verified on a case by case basis (by checking the IGCH or Coin Hoards record). After that, the only non-reconciled values remaining are rulers that do not have Nomisma IDs yet. We can extract a facet list of these values and generate the Nomisma IDs, and then reconcile the list.

I spent about two days working on these APIs, and then did almost 90% of the matching in five minutes.

Under the Hood

Parsing the OpenRefine query JSON into Solr searches

This service has been implemented on Nomisma.org, which has just been migrated onto our new dedicated server, running Orbeon Forms 2017.1. This new version of Orbeon is essential because it supports new JSON processing features in XSLT that were previously unavailable in the older version of Orbeon running on the old Nomisma cloud server.

Since Orbeon does not yet include Saxon HE 9.8 (and Saxon's JSON parsing features according to the XSLT 3.0 spec), I have used Orbeon's internal xxf:json-to-xml() function in order to parse the JSON passed in by the OpenRefine query/queries parameters.

<xsl:variable name="q">
 <xsl:variable name="query" as="node()*">

 <!-- compile the q parameter -->
 <xsl:apply-templates select="$query/json"/>

The query/queries parameters, now parsed as XML according to the XForms 2.0 specification, are passed through XSLT stylesheets to form a Solr query containing the query keyword and optional types (which are RDF classes included in the default types in the default reconcile JSON response).

The OpenRefine query:


Becomes the Solr params:


The query or queries are then passed through the Orbeon URL generator, and the Solr XML response(s) is passed through another XSLT transformation that generates an XML metamodel representation of JSON. This metamodel then passes through another small set of XSLT templates into JSON proper, which is serialized as application/json and returned by the server.

Enabling POST as an Orbeon service in the Page Flow Controller

After successfully constructing Solr queries and receiving the correct JSON response back from the reconciliation API (with GET in my browser or via curl), OpenRefine was generating errors when attempting to add a new reconciliation service. After checking the Orbeon logs, I noticed a lot of 403 errors. As it turns out, OpenRefine POSTs as multipart/form-data instead of GETs to the main reconciliation service. It turns out that this was an easy fix: changing the reconciliation service from a 'page' to a 'service' in the Orbeon Page Flow Controller, and enabling the POST @public-method, according to the Authorization documentation.

Next Steps

There are two other optional suggest APIs that may be implemented (for types and properties), but I am not sure how useful these are compared to the services I have already implemented.

The next logical step is to build a reconciliation service directly into Numishare itself, activating it for coin type corpora like OCRE, CRRO, and PELLA. This would allow users to reconcile RIC, RRC, and Price numbers to URIs in these online type corpora. The property suggestion APIs will be especially useful here, as it will enable other columns in the source spreadsheet (particularly Roman emperors when normalizing against OCRE) to be used as additional Solr query parameters ('authority_text:Augustus'). Therefore, columns containing only an RIC number, with no additional page number or authority information can be normalized, as databases will [hopefully] always include the emperor.

This will save a tremendous amount of time, removing the necessary scripts that I have written to parse large datasets harvested from the British Museum, Cambridge, Harvard, and many other Nomisma contributors. With an OCRE reconciliation service, I will be able to normalize RIC numbers much more efficiently without additional scripting, or, better yet, punt the reconciliation off to museum curators.

Perhaps this is the answer to getting the massive number of Roman imperial coins from the Portable Antiquities Scheme into OCRE.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.