Cleaning Messy Data

In early 2017, I was hired by the Essex-Kent Mennonite Historical Association to recommend and implement a solution to upgrade their collection management system from a home-brewed Microsoft Access database to something easier to work with and maintain. I researched several possibilities appropriate for small museums and they settled on PastPerfect, a museum-specific software tool.

A screenshot of OpenRefine open with many records displayedI exported the Access database into a spreadsheet, which I then imported into OpenRefine. OpenRefine is “a free, open source, power tool for working with messy data.”  Data wrangling with OpenRefine relies on building up facets and filters. It is perfect for this kind of work.

I started by sorting almost 4000 records into PastPerfect’s four distinct catalogs (Object, Archive, Library, Photos). After that came assigning object names to all items using Nomenclature 4 (PastPerfect’s lexicon). By combining facets filtered on catalog, object name, the headings imported from MS Access, and others as needed, I was able to create a clean spreadsheet for the Conversion Specialist to import into PastPerfect.

It was no surprise to find that over the years, different museum volunteers had collected different data about the collection and that data had been entered in different ways. I had to correct and standardize any errors and omissions at the record level, and define clear procedures for future acquisitions. I continue to provide training for new volunteers to work with PastPerfect.