numishare: Recommendations for numismatic spreadsheet standardization

Over the years, we have considerably refined the way in which we organize our spreadsheets for processing into NUDS XML files and upload into the Numishare platform. Our workflow started with Online Coins of the Roman Empire, where numerous interns worked over the course of four years (the final three funded by the NEH) to produce dozens of spreadsheets (typically one per emperor) encompassing more than 40,000 types.

Many of the primary typological categories, such as denomination, mint, and authority, contained Nomisma URIs, and textual categories, e.g. legend and type description, were columns of free text. These spreadsheets (Excel files) were exported into CSV and processed through a PHP script that I wrote to transform each row into a NUDS document, and then this batch of files would be uploaded with the eXist-db XML database client into the appropriate Numishare collection. After this, I would manually edit the code in the Admin panel in Numishare to index the most relevant batch of RIC IDs into Solr for the public-facing browse and search interfaces (so as not to reindex an entire collection of 40,000 types when a new or updated spreadsheet might only contains several hundred items).

With the publication of PELLA in 2015, we implemented a key->pair stylesheet that enabled us to connect obverse and reverse type description codes to each unique description, with columns for English, French, and German translations. The OCRE PHP script was modified to accommodate this new model. Subsequent type corpora have been published for Ptolemaic and Seleucid coinage, each with a slight variation of yet another PHP script. Furthermore, with partners in the Netherlands, Switzerland, England, and Italy deploying their own Numishare collections for type corpora and/or collections of physical specimens, the wide range of slightly different spreadsheet models require an ever-diverse set of scripts that need to be manually maintained. It has long been a goal of mine to implement a standardized spreadsheet import into Numishare itself, modeled on the XForms-based validation and transformation of Google Sheet's Atom XML API implemented several years ago in Nomisma.org.

Mapping Google Sheets columns to NUDS elements

Finally, after about a month of development and testing, a Google Sheets-based spreadsheet import is functional in Numishare. It is primarily focused on type corpora at the moment, as not all of the physical and administrative descriptors have been implemented for mapping spreadsheet column headings of numismatic objects.

Some things remain the same:

Typological categories must map to Nomisma URIs
References for physical objects can be a coin type URI of some sort, a plain literal, or a combination of a type series and type number separated by a | character. The type series must be a literal or a Nomisma URI for a type series, but I am to enable support for Zenon bibliographic URIs
Parent IDs (skos:broader) and deprecation-related IDs (dcterms:replaces or dcterms:isReplacedBy) must be contained in the spreadsheet.
A question mark can trail a Nomisma URI to denote uncertainty. This is parsed in the XForms engine to insert the appropriate uncertainty URI into the NUDS XML.
Columns for symbols/monograms located at certain positions on the obverse and reverse can be mapped to the positions listed in the Numishare config.

Structured XML produced from a spreadsheet

Other types of information requirements must be met in order for the spreadsheet to validate, which means that certain data must be explicit and not automatically inserted by a script. For example, each NUDS XML document requires a title. This title was typically generated in the PHP script by some concatenation of a human-readable string with the type number parsed from an ID column. Similarly, all coin types and all physical specimens NOT linked to a coin type URI must have an Object Type URI in the spreadsheet, even if that URI for all objects is nm:coin.

All of this normalization can occur in a pre-processing phase in OpenRefine: automatic generation of titles through regular expressions, reconciliation of typological columns to Nomisma URIs through Nomisma's OpenRefine API, etc.

This new spreadsheet import also requires type descriptions to be present in the typological spreadsheet, which means rethinking the way in which descriptions are connected to the main typology spreadsheet. Instead of a separate stylesheet spreadsheet of key->pair combinations between codes and translations, this stylesheet is incorporated as a second sheet in the typological spreadsheet. It is therefore possible to create a VLOOKUP formula between the unique type description code in the typology sheet and the corresponding column in the description stylesheet (see https://docs.google.com/spreadsheets/d/e/2PACX-1vQoyHYDyh79oJuoW9m2g9BNbnysyVWjl13KQNEyTF5dgXswQwgekXMvIDTAH3onwN35c1P9eXeJAD4w/pubhtml). Therefore, the type descriptions can still be maintained with the ease of making one change to a description in Sheet #2, and the change will immediately propagate into the Atom feed.

VLOOKUP to control type descriptions

I have applied the same logic for concordances. A single concordance sheet can be maintained and propagated across multiple relevant type corpora.

See for example the Svoronos 1904 corpus of Ptolemaic coinage: https://docs.google.com/spreadsheets/d/e/2PACX-1vSSxfdRUvq_PZOlvt3Od1T1gu29wOSQub6DwqQviq1TMRs2gDCWRA4u0i0cqHaHWchJ9Zt3pq03pc0t/pubhtml

This contains a partial concordance between Svoronos numbers and the types from Catharine Lorber's Coins of the Ptolemaic Empire vol I, part I (gold and silver from Ptolemy I - IV as published in Ptolemaic Coins Online).

By eliminating the intermediary scripting and XML upload/indexing process, scholars will be able to use OpenRefine to prepare their data without much technical intervention and publish their type or specimen data into Numishare without significant IT overhead. This alone will save me quite a lot of time: a month of development up front to save at least the same amount of time per year in redundant scripting and OpenRefine data cleaning.

After a spreadsheet is uploaded, it will be indexed directly into Solr, if the types are active (not deprecated by newer URIs) and the indexing option has been enabled.

Full documentation of the spreadsheet upload is forthcoming.

Pages

Wednesday, August 14, 2019

Recommendations for numismatic spreadsheet standardization

No comments:

Post a Comment