Tuesday, November 7, 2017

Numishare supports OpenRefine reconciliation APIs for OCRE, PELLA, and CRRO

After building a reconciliation service for Nomisma concepts, I began working on applying the same methodologies to creating an OpenRefine reconciliation API for coin type corpora projects published in the Numishare platform. The API has been extended to support suggestions for properties. These properties are facet/string (exact match) or text (keyword anywhere) fields for mints, rulers, denominations, etc. that have been indexed into Apache Solr. It may be possible to extend this property list to dates, legends, or other indexed fields.

Property suggestion API is derived from available Solr facet fields

Test Case: University of Graz Roman imperial coins

I received a spreadsheet of about 2,000 Roman imperial coins with RIC numbers and emperors from Elisabeth Steiner at the University of Graz. I performed some cleanup of the RIC numbers and normalized the emperor list to English preferred labels via the Nomisma OpenRefine reconciliation service (more details below). About half of the coins normalized to OCRE IDs on the first pass (which took 45 seconds), but the majority of non-matches fell into two categories: RIC numbers that had been split by OCRE into separate URIs due to differences in denomination and RIC 6-8 volumes, where the numbering restarted based on mint rather than ruler. To ameliorate these issues, I got an updated spreadsheet that contained columns for mint and denomination.

Filter for uncertain attributions, 'od.'

My workflow was as follows:

Main reconciliation service + property search

In order to generate the most accurate response, it was necessary to eliminate anything that is construed as uncertain, as well as removing subtype letters/numbers (since most volumes do not yet have published subtype URIs) and RIC volume numbers, which are inconsistent. The main reconciliation API queries the Solr "title_text" field. This is the title of the coin type. In CRRO and PELLA, it's simply "RRC 100/2" or "Price 100," so typically you would not need additional search properties to normalize a spreadsheet against RRC or Price numbers. In OCRE, it's a bit more complex, and it helps to understand how the titles have been structured. The title consists of the RIC volume number (according to our standard description), the ruler(s) according to the RIC section title, and the RIC number. The exceptions are RIC 6-8, where the section is based on the mint name (standard English preferred label from Nomisma) instead of authority.

Because RIC volume number descriptions from museum or archaeological database fields may vary, it is therefore important to isolate the RIC number alone and use additional property columns to disambiguate the RIC number. For Graz, I took the following steps:

Initial cleanup and Nomisma reconciliation
  1. Filter out uncertain attributions '-', '?', 'od.', '='
  2. Remove subtype numbers or letters (capital letters), since most volumes in OCRE do not go to the subtype level (only RIC 9 and Valerian/Gallienus)
  3. Remove RIC volume numbers in order to isolate the RIC number alone as best as possible
  4. Reconcile mint, denomination, and ruler columns to Nomisma ids. Add new columns that contain only the English preferred label to be used as exact matches of facet fields as OpenRefine properties

OCRE reconciliation
  1. I ran reconciliation on RIC number + emperor first. This matched the first 50% of records in the spreadsheet. The majority of outstanding non-matches fell into two categories: RIC 6-8, where the numbering restarts based on mint, and earlier RIC volumes (mainly 2-3), where we split ID numbers in into two or more distinct URIs for different denominations for which the editors had applied the same RIC number.
  2. Filtering out the records that had already been successfully matched, I re-ran reconciliation with both ruler AND mint as additional properties. This matched nearly another 400 rows.
  3. Among the 500 remaining, I re-ran with ruler, mint, and denomination. After some automatic matching following by manual clean-up (primarily to pick parent types rather than sub types or choose the correct authority between Septimius Severus, Caracalla, and Getta), 300 rows remained. These were the more difficult to reconcile RIC 6-8 volumes. Many identifiers ended in a letter for a subtype that had not been published. I replaced any remaining trailing letters with an * character, which would result in a wild card search of the OCRE title field. I thought this would likely yield further results that could be checked against the source records.

Manual lookups with the Entity Suggest + Preview APIs

The entity suggestion API in OpenRefine does not include the properties that had already been designated for the main service. Therefore, this API searches the fulltext index field for any keywords available within the coin type metadata, and not just the title. Therefore, you may need include the ruler, mint, denomination, and the RIC number, including the * and ? characters to designate wild cards according to the Lucene query syntax. You can even include portions of the legend which have been indexed into the fulltext for an OCRE record.

In order to aid in the identification of coins with ambiguous OCRE references, the Preview API popup includes a reasonable amount of metadata: date, mint, denomination, portrait, and obverse/reverse legend and type description. The Preview API will also execute a SPARQL query for a single object that has an image in the Nomisma.org endpoint associated with that coin type.

Preview API with SPARQL-derived associated image

I am hoping that this API will make it much easier for archaeological and museum databases to associate their objects with OCRE URIs, thus paving the way to a greater integration of these objects into OCRE and the broader Ancient World Linked Data cloud.

Updates to API pages in OCRE, CRRO, and PELLA

The API pages in each project have been updated to include some basic documentation on how to interact with the OpenRefine reconciliation service and available properties. The services are available at /apis/reconcile, e.g., http://numismatics.org/ocre/apis/reconcile.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.