Week 6: Data Cleaning

Cycle of Data Processing

  1. Data Collection
  1. Data Cleaning
  1. Exploratory Data Analytics

Statistical Analysis

Data Visualisation

charts of data from data

Patterns,Trends

  1. Data transformation
  1. Modelling
  1. Data Presentation
  1. Formulate Key Questions

When you are given a data set, the first step to do is Data Cleaning

Exploratory Data Analysis cannot be done if data is not cleaned

Clean data

  1. there are no missing values(blank in the data set) &
  1. the data are in the correct data types ( eg. Admin Number should not be interpreted as a numerial)

Steps

Step 1: Check for any error in the data type

1.May need to use String to Number node if data is supposed to be a number. Ensure it the correct data type

Use Statistics Node to check where are the missing values

Step 2: Check the number under # Missing Values"

Note down which variable has the highest number of missing values and decide if you want to use Column Filter to remove them

if there are any column to be removed(or filter out), it must be done first before using the Missing Value node. If you remove the columns, You will lose one variable for analysis

Step 3: Use Missing Value node

To decide if you want to remove rows containing missing values OR replace with other values

Take note of the connectors. the Missing Value node should be connected to Reader node

either removed or replaced

removed will lose the data size

removed column if there alot of missing data in column

replaced the missing values with median value/mode value

Discard/Removed

Impute/Replaced

removed column means one less variable

removed row means one less observation

may used past observation