Monday, January 26, 2026

Improving searchability using Natural Language Processing and Wikidata

The Python Natural Language Processing Toolkit and FastAPI form the foundation for new, major improvements in the searchability of various numismatic projects, beginning with American Numismatic Society type corpora such as OCRE and Hellenistic Royal Coinages. These improvements currently apply to all online type corporate published by the ANS and are slowly reindexing in the ANS collections database, MANTIS, to improve large swath's of the ANS's collection of Greek and Roman coins that have been linked to these coin type URIs. This work began in the fall and was finally put into production over the course of December 2025.

The Problem: Words that lack meaning

There are 273 results in Online Coins of the Roman Empire when searching the term "serpent" and 589 when searching for "snake". Searching in MANTIS and other Numishare-based platforms is based quite literally on the explicit entry of words within the obverse and reverse type descriptions of coins and types. Such description is subject to idiosyncrasies of terminology related to regional linguistics. There is nothing inherent within the search index software upon which Numishare is built, Apache Solr, to suggest that a serpent and a snake are two words for the same type of animal, and the use of these two words is not only inconsistent across Roman Imperial Coinage volumes, but also within the same section of the same volume (e.g., for Galba in RIC I). Similarly, this plays out in MANTIS, as there are five objects featuring the word "puma," one "cougar," and two with the extremely rarely-used "catamount." Some residents of the United States refer to these as "mountain lions," although there are no results for this exact phrase in MANTIS. If a researcher is interested in all of the depictions of reptiles, birds, cats, architecture, etc. on coins and medals, there is simply no way to capture all possible results without the introduction of a semantic layer on the descriptive text.

 

A search for cougar in MANTIS, with one result instead of eight

The Solution: NLP + Wikidata

The first step in equating "serpent" with "snake" for merging the results for either term is the use of the Python Natural Language Toolkit to parse textual descriptions of coins into their constituent parts of speech. 

As an example, we'll use this type description: "Asclepius, nude, standing front, head left, 
leaning on small staff with serpent coils." After tokenizing the description into `words`, we'll analyze their parts of speech:

tagged_words = nltk.pos_tag(words)

[('Asclepius', 'NNP'), 
('nude', 'NN'), 
('standing', 'VBG'), 
('head', 'NN'),
('leaning', 'VBG'), 
('on', 'IN'), 
('small', 'JJ'), 
('staff', 'NN'),
('with', 'IN'), 
('serpent', 'NN'), 
('coils', 'NNS')]

Although there is potential in interpreting types of movement and action on coins (verbs), our focus in this phase is on the nouns, whether proper or not. The nouns are essentially concepts, like 'Asclepius,' 'staff,' and 'serpent.'

There is nothing inherent within the Python NLTK platform to suggest the semantic meaning of these words, so we need to introduce the intermediary step of reconciling them to their related concepts in Wikidata.org. This was done relatively easily in OpenRefine. This also has the added benefit--which is essential for a scholarly dataset--of introducing human vetting (by me) in order to control the quality of the output. These semantic assertions should not be left to an LLM without human verification, which is one of several reasons why we cannot use an LLM to improve searchability of the data, because the results are not consistent or dependable.

Reconciling terms in OpenRefine
 

Following the reconciliation of the few thousand concepts that appear throughout Hellenistic Royal Coinages, OCRE, and CRRO, the lookup table between search term, Wikidata URI, and preferred label is loaded into SQLite, and I used FastAPI in Python to create an easy lookup mechanism that will convert pre-processed type descriptions into related concepts in JSON, which are then integrated into the NUDS records in Numishare. Once the concept keywords are in the NUDS record, then Numishare makes use of them to index preferred labels, alternative labels, and labels related to hierarchical parent concepts into Solr for search. 

FastAPI response that uses NLP to look up terms in SQLite table after reconcilation

In order to facilitate the most efficient indexing process possible, the alternative and hierarchical labels are also extracted via Wikidata SPARQL in an intermediary process and stored in SQLite and served via FastAPI. Wikidata has request limits based on time, so one could never index a collection as large as OCRE without pausing frequently to refrain from violating Wikidata's API policy.

Wikidata SPARQL query extracting labels and hierarchical labels for concepts

Technical Workflow Summary

The technical workflow is thus summarized as follows:

  1. Parse descriptions, extract iconographic concepts (Python: NLTK)
  2. Reconcile concepts to a controlled vocabulary system, Wikidata.org (OpenRefine)
  3. Extract alternative labels and hierarchical labels from Wikidata (Python/SPARQL)
    1. "snake" = "serpent"
    2. "snake" is a type of "reptile" which is a type of "animal" etc.
  4. Create a fast lookup mechanism for integration into the web (Python: FastAPI)
  5. Index terms for improved search (Numishare) 

This workflow was repeated for each ANS type corpus following the developing of the prototype on the relatively small and well-curated Coinage of the Roman Republic Online.

Results

Following the implementation of NLP into the indexing workflow, there are now more than 700 results for "serpent" or "snake" in OCRE. Either term yields the same results, as expected. There are more than 4,500 results for "architecture" since altars, gates, arches, and temples are all classified as architecture in Wikidata. This is a dramatic improvement in searchability, but it is certainly not perfect yet (although perfection is not really achievable). There are some limitations, illustrated below:

  • Applies to typology databases (currently), not MANTIS
  • Depends upon detailed textual descriptions, not images
  • Wikidata labels and hierarchical structure may be inconsistent or incomplete
  • LLM may be more intuitive, but the cost overhead is prohibitive, particularly in real-time. 

Although the results of this work do not apply to all of MANTIS yet, the connection of tens of thousands of coins in the ANS collection to coin type URIs makes it possible to extract the iconographic concepts from the NUDS data in OCRE and other projects for indexing in MANTIS under the same process. After updating the indexing code in Numishare and reindexing all of the ANS' coins from CRRO, the number of coins grew from 123 to 350 when searching for "architecture," since more than 200 Roman Republican coins in the ANS collection link to the 29 types featuring some kind of architecture. The original 123 results in MANTIS included the word "architecture" explicitly in the type description or elsewhere in the record.

Search of HRC for "animals", including eagles, elephants, horses, etc.
 

After the conclusion of the indexing of ANS coins connected to type URIs, we will implement NLP on descriptions department by department, although the results of this work may take some time to vet, for lack of staffing in other departments.

Nevertheless, this is an enormous enhancement in the searchability of our numismatic collections, and the code is reusable by other projects, whether they are numismatic in nature or not. The instructions for deploying the services and processing data have not yet been fully documented in the Github repository

On January 23, I presented an ANS Longtable about this work. The slides have been uploaded to Zenodo, and the recording should be available on YouTube in the near future.