Please enable JavaScript.
Coggle requires JavaScript to display documents.
Plural Sight: Exploratory Data Analysis (Aspects of Clean Data (each data…
Plural Sight: Exploratory Data Analysis
Transforming & Cleaning Data
Renaming Variables
Data Type Conversion
Encoding Values
merging data sets
converting units
handling missing data
handing anomalous data
Loading Data into R
File Sources
File Based Data
CSV
Tab delimited
excel files
Web Based Data
XML
JSON
HTML
Databases
SQL server
Oracle
my SQL
Statistical Data Files
SaaS
SPSS
stata
Aspects of Clean Data
each data table only contains only a single type of observation ie. sales from a pizza shop, movies released in theaters
1 column for each variable ie. values that vary across each observation
column names are readable
observations in rows
no missing values in any of the observations
each row is uniquely identified
data is correctly recorded. no errors or mistakes
everything properly encoded
internally consistent ie. all column values are in the same units etc.
Descriptive Statistics
Trying to find:
Location
Spread
interdependence of the data
Referred to as summary statistics
they summarize the shape and feel of the data
Types of Analysis
Univariate
Qualitative
the analysis of a single
categorial
variable
items that might be of interest
percentage
mode
Frequency
Quantitative
the analysis of a single
numeric
variable
items that might be of interest
location:
or measures of central tendency
mean
median
mode
dispersion:
measure of spread
minimum
range
quartiles
maximum
variance
standard deviation
the shape of the data
the skewness
measure of the asymmetry of the distribution of values
kurtosis
a measure of how sharply peaked or flat the distribution is
Bivariate
Quantitative
the analysis of 2 numeric values
Interested in
Relationship
between 2 numeric variables
Covariance
the degree to which the 2 variables vary with one another
correlation coefficient
Qualitative
interested in
joint frequency
of the observations
as seen on a contingency table
joint percentages
marginal frequency
the measure of totals in columns and rows in the contingency table
the analysis of 2 categorical variables
Qualitative & Bivariate
Guidance
Domain Knowledge
understand the context around the data
Clean Data
Understand Biases
Cognitive
statistical
contextual
Make the analysis
reproducible
Understand Implications
Know your limitations
Data Visualization
Qualitative Univariate
Bar Chart
displays the frequency of a categorical variable ie. how many pizzas were sold vs how many salads were sold at a pizza shop
Can be either vertical or horizontal bars communicating the same informaiton
Pie Chart
visualizes the proportion of part in the whole
more effective if the number of categories is small
not recommended for some things
Quantitative Univariate
Usually most interested in: location, shape and spread of data
Dot Plot
displays all the possible values/observations of the variable
Box Plot
provides info about outliers
displays the min, Q1, Q2, Q3, max excluding any outliers, and outliers which are any values that are 1.5 times the IQR added to the Q3 and subtracted from Q1
Histogram
x axis shows all the possible values of the variable, and y axis shows the frequency of each value
Density Plot
shows the shape of values similar to a histogram. the x axis includes the possible values of the variable while the Y axis has the probability of each value occurring
shows a smooth representation of the data as a function
Qualitative Bivariate
Spine Plot
width of tiles represents the proportion of observations on the x axis
the height of tiles represents the proportion of observations on the y axis
use the area of the tiles to determine the number of observations that are at the intersection of the two categorical variables ie how many movies are both rated G and are an action film
Mosaic Plot
Quantitative Bivariate
Scatterplot
Line Graph
used if x axis is Time
Referred to as a time series
Qualitative and Quantitative Bivariate
Bar Chart
Multiple Box Plots
Beyond
Other Types of Analysis
1) Inferential
using a sample to make generalizations about a larger population
2) Predictive
uses current or historical data to make predictions about future or unknown values
3) Causal
4) Mechanistic
creating a mathematical model that captures the exact changes in all variables as one variable is modified
R's Advanced Programming Features
Procedural Programming
Functional Programming
Object Oriented Programming
Supports Mixed Programming
Outbound Calls
C
C++
4 tran?
Inbound Calls
Python
C#
Java
Distributed Programming
R Advanced Statistical Analysis
Statistical Modeling
models the relationship between variables with a mathematical equation
Cluster Analysis
organizes a set of data into groups with similar properties
Dimensionality Reduction
Analysis of Variance (ANOVA)