Please enable JavaScript.
Coggle requires JavaScript to display documents.
Week 11 - Data Cleaning/Mining (Data cleaning (general tips (only keep…
Week 11 - Data Cleaning/Mining
Boxplots
Basics
with outliers
Data Types
Numeric
(numbers)
Discrete distribution
integer numbers
eg # employees,
no. words in a text
Continuous distribution
continuous numeric data
real numbers (floating points)
eg income, revenue,
temp, height
Categorical
(qualitative data)
eg true/false, hair colour, agreement.
Binary
like nominal, but only 2 values
eg, answers to true/false questions,
result of med test (pos/neg)
gender (mostly)
Ordinal
values have meaningful order
can be
ranked
but magnitude between
successive values is unknown
eg size = sm, med, lrg
grades, army ranks
Nominal
states, names of things
eg hair colour chosen in range,
marital status, occupation, post codes
Data cleaning
general tips
only keep data
useful to answer question
use data vis to id outliers:
boxplots, scatterplots
avoid redundant data
eg redundant columns
Noisy data
(contains errors
or outliers)
outlier detection
clustering groups data
into similar sets
used to detect data
that does not belong with
other clusters - remove
statistical analysis
eg z-scores
regression
data smoothed by
fitting into
regression function
eg, smooth values measured
with an uncalibrated device,
using related data from other,
well claibrated, sensors.
binning
partition into
equal-frequency bins
represent with mean,
median or boundary (min/max)
sort data
Missing data
simple approach:
reject whole datapoint
(but decr quantity)
replace using
aggregate functions
eg average value
replace using
probabilistic estimates
from a larger distribution
Data transformations:
normalisation
z-score normalisation
decimal scaling
min-max transformation
Data mining
predictive
tasks:
classification, regression,
deviation prediction
Classification:
building predictive models
binary:
assign 1 of 2
classes/labels
multi-class:
more than 2 target
classes/labels
descriptive
tasks:
clustering, association rule discovery,
sequential pattern discovery