Data Cleaning, a Vital "Editor" in Data Analysis

Image when you are reading a novel, it must be annoying if chapters of the novel are not sorted orderly, some chapters are missing, chapters are written in different languages, or there are a lot of grammar errors in the novel. There is no doubt that these inconsistencies and errors make it hard to read and extract information from the novel efficiently. Therefore, editors, who proofread and compose books, are important in publication industry. Alike, data cleaning that plays an essential role in preparing consistent, accurate and useful data, works as an editor in the world of data analysis.

Generally speaking, one important job of the editor is “proofreading”. Usually, original data is “dirty”, containing inaccurate, irrelevant, redundant or missing information. For example, when people fill out a questionnaire, they may skip some questions that they do not want to answer, or they just miss some questions by accident. In this case, data collected from the questionnaire is incomplete. On the other scenario, a person may answer the same questionnaire twice in different locations, leading to duplicated information collected. Those two scenarios happen commonly during data collection and integration, and can result in wrong analysis conclusions. The editor, data cleaning, can identify and correct the “dirty” part of data, and it also helps to exclude inaccurate and irrelevant information (data verification).

The other important job is standardization. Data cleaning also helps to improve data consistency and organization. In this era of globalization and big data revolution, data comes from different countries and sources, so it may be recorded in different languages, units and formats. Data cleaning includes the function of translation, which can make data more readable for people around the world. Furthermore, since people may have different habit of data recording, it is necessary to make data format and unit consistent when data from different sources is integrated. The inconsistency will create troubles in data analysis phases. For instance, a set of dates can be recorded as “9/01/2017”, “2017-09-16”, and “September 5th, 2017”. Although all of them represent dates, computers read them as different type of data (number and text). Therefore, we cannot sort data by dates correctly.

The work of this editor is vital. If we consider conducting data analysis as constructing a building, data cleaning is the foundation of the building. Data of good quality is the premise of getting useful analysis outcomes. If the data analyzed is incorrect in the beginning, it is hard to imagine that the model built upon the data can be trustful. Data cleaning makes sure that no time and money is wasted by bad data preparation. Chronologically, the responsibilities of data cleaning can be concluded in the figure below.

According to the figure, we know that the magic work that data cleaning does is complex. The expensive and time-consuming work creates many jobs for data analysts. If you are interested, more information about data cleaning approaches and challenges can be found here.

Written on March 3, 2014