The Data Cleanliness Project

The Data Cleanliness Project

New common sense-based software from MIT set to make data cleaning a walk in the park – just make sure to wear a mask, though.

Ask almost anyone working with data what they consider the most tedious aspect of their work to be: they’ll all probably concur on data cleaning. Whilst data may be the driving force behind industries worldwide, in its raw form, it is almost always ‘dirty’ i.e. laden with missing values, typographical errors, duplicates and other inconsistencies. In fact, according to research from Figure Eight and Anaconda, data cleaning can take up to a quarter of a data worker’s time – making it, undoubtedly, one of the most dreaded aspects of the global data profession.

Whilst automating the process can prove to be beneficial, it is a rather challenging task, given the fact that different datasets require different types of personalised cleaning based mostly on common-sense judgement calls about real-world objects. In this regard, a recent development coming from the Massachusetts Institute of Technology (MIT) –a system called PClean – may just prove to be the missing cog in the wheel.

The Common-Sense Cleaner

Having already automated processes such as modelling time series databases and 3D perception via inverse graphics, MIT researchers have consistently aimed to automate and simplify AI application development through the writing of a series of domain-specific probabilistic programming languages as part of their Probabilistic Computing Project. The new PClean system, the latest rollout of said project, is all set to simplify the arduous data cleaning process by providing generic common-sense models customisable to specific datasets and error types, based on the dataset being used.

The PClean system utilises a systematic knowledge-based approach in order to carry out the data cleaning process after users have encoded background information of the database and the kind of issues that may appear. Consider this excerpt from the MIT blog:

“Take, for instance, the problem of cleaning state names in a database of apartment listings. What if someone said they lived in Beverly Hills but left the state column empty? Though there is a well-known Beverly Hills in California, there’s also one in Florida, Missouri, and Texas … and there’s a neighborhood of Baltimore known as Beverly Hills. How can you know in which the person lives? This is where PClean’s expressive scripting language comes in. Users can give PClean background knowledge about the domain and about how data might be corrupted. PClean combines this knowledge via common-sense probabilistic reasoning to come up with the answer. For example, given additional knowledge about typical rents, PClean infers the correct Beverly Hills is in California because of the high cost of rent where the respondent lives.”

According to researcher Alex Lew, the lead author on the paper, PClean can help close down the human-computer gap by enlisting help from computers in a way similar to the way humans enlist help from each other. Usually, a computer works on common logical commands through step-by-step programming without any real context for the task. Through PClean, it is given context and a semblance of common-sense reasoning, just like in humans.

P for PClean, P for Privacy

Although the PClean system is a much easier and cheaper alternative to joining messy and inconsistent records into clean databases as compared to the massive investments in software systems that firms currently rely on, it also carries with it several privacy risks, such as the potential to de-anonymise records based on incomplete information from a few public sources. Unfortunately, according to the researchers, privacy concerns remain persistent in spite of how fairly a database is cleaned.

“We ultimately need much stronger data, AI, and privacy regulation, to mitigate these kinds of harms,”, writes Vikash K. Mansinghka, the principal research scientist in MIT’s Department of Brain and Cognitive Sciences. Lew adds, “As compared to machine-learning approaches to data cleaning, PClean might allow for finer-grained regulatory control. For example, PClean can tell us not only that it merged two records as referring to the same person, but also why it did so — and I can come to my own judgment about whether I agree. I can even tell PClean only to consider certain reasons for merging two entries.”

In spite of the privacy drawbacks, PClean is still considered a socially beneficial application, having already found applications in journalism and humanitarian fields, such as in monitoring anticorruption and the consolidation of donor records submitted to state election boards. The hope remains that PClean can, in fact, free up data scientists’ time to allow them to focus on more important aspects of analysis instead of data cleaning whilst also mitigating most privacy issues.

© 2023 Praxis. All rights reserved. | Privacy Policy
   Contact Us
Praxis Tech School
PGP in Data Science