Week 6: Data Cleaning
Cycle of Data Processing
- Data Collection
- Data Cleaning
- Exploratory Data Analytics
Statistical Analysis
Data Visualisation
charts of data from data
Patterns,Trends
- Data transformation
- Modelling
- Data Presentation
- Formulate Key Questions
When you are given a data set, the first step to do is Data Cleaning
Exploratory Data Analysis cannot be done if data is not cleaned
Clean data
- there are no missing values(blank in the data set) &
- the data are in the correct data types ( eg. Admin Number should not be interpreted as a numerial)
Steps
Step 1: Check for any error in the data type
1.May need to use String to Number node if data is supposed to be a number. Ensure it the correct data type
Use Statistics Node to check where are the missing values
Step 2: Check the number under # Missing Values"
Note down which variable has the highest number of missing values and decide if you want to use Column Filter to remove them
if there are any column to be removed(or filter out), it must be done first before using the Missing Value node. If you remove the columns, You will lose one variable for analysis
Step 3: Use Missing Value node
To decide if you want to remove rows containing missing values OR replace with other values
Take note of the connectors. the Missing Value node should be connected to Reader node
either removed or replaced
removed will lose the data size
removed column if there alot of missing data in column
replaced the missing values with median value/mode value
Discard/Removed
Impute/Replaced
removed column means one less variable
removed row means one less observation
may used past observation