How Graph Helps You With Data Cleaning
Garbage data is garbage analysis out — Tableau
Those whose work is related to research and data must know that data cleaning is rather complicated and yet it’s one of the most important and time-consuming steps in data analysis. Some say that good analysis can only be generated with good data.
Well, let’s discuss this further but first, let’s define data cleansing to make sure we are in the same boat.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
When we are analyzing a dataset, we are combining multiple data sources. So, it’s common to have our data duplicated, mislabeled, missing, or corrupted, you named it.
Enough with the definition, now let me tell you a story about how graph saved me with data cleaning.
When I first registered as a graduate student, I had a hard time figuring out the issue with my dataset. I’ve been correcting duplicated data and no more missing values, but the analysis result was rather strange.
I remembered at that time I made a simple economic model to analyze something related to GDP. As someone who had no economic background, various forms of GDP from real GDP, nominal GDP, Production-based GDP, and Consumption-based GDP was not included in my “common knowledge” record. As you can guess, I was troubled using real GDP data.
I had quarterly real GDP data from the years 2000 to 2020 and they had a different base year. The first dataset was real GDP from the years 2000–2014 with the year 2000 as its base year and the second dataset was from the years 2015–2016 with the year 2010 as its base year.
I had a few datasets and I made a mistake combining two datasets without equalizing their base years. To make it worst, I didn’t even realize it (I hope it’s a common mistake a beginner would make). LOL.
Looking at those numbers manually just, you know, gave me a headache. So, one of my lectures told me to make a graph for my small dataset, and do it every time I’m doing data cleaning. Do you want to see the result?
For this case, without calculating the mean and the variance, we can see from the graph that the data was strange, especially from the years 2013 to 2014.
As you all know, a graph is a pictorial representation of data and value in an organized manner. Since the graph is a common data visualization tool and represents structural information, when you use a graph, you will directly notice if there’s something strange with your dataset.
Furthermore, this method can also sense the outliers but I suppose it is not the best method to find a missing value. This method is pretty convenient for those who have no basic statistics but are forced to do data cleaning.
Do you have another method that works for you and are efficient enough to do? Please share your experiences and thoughts here.