Friday, April 27, 2018

Improving OCRE OpenRefine reconciliation with regex

I have made a slight update to improve the matching of OCRE coin types through the Numishare type-based OpenRefine reconciliation API. The reconciliation API queries the "title" as indexed as a text field in Solr, which as detailed in a previous blog post, functions most accurately when you reduce your reconciliation column down to the RIC number and use authority/mint/denomination as an additional property.

This would miss a lot of potential attributions of numbered subtypes that were never given parent type URIs in OCRE. Some examples are in Hadrianic types. The British Museum has assigned the type number '14', but OCRE has no Hadrian 14, only 14a and 14c. The API update appends the following regex to the title field Solr search: '(\(?[a-zA-z]\)?)?', resulting in the query "title_text:/14(\(?[a-zA-z]\)?)?/". This looks for a single lower-case or upper-case optional letter that may optionally be enclosed in parentheses.



When running the API against more than 2000 coins of Hadrian from Rome from the British Museum, about 500 had a 100% automatic match, and another 1,500 yielded two or more potential matches. Before this regex tweak, a significant portion of the 1,500 coins that didn't automatically match had no suggestions, and therefore required the "Search for Match" function to manually attempt through autosuggest by typing with the keyboard.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.