Plural Sight: Exploratory Data Analysis
Transforming & Cleaning Data
Renaming Variables
Data Type Conversion
Encoding Values
merging data sets
converting units
handling missing data
handing anomalous data
Loading Data into R
File Sources
File Based Data
CSV
Tab delimited
excel files
Web Based Data
XML
JSON
HTML
Databases
SQL server
Oracle
my SQL
Statistical Data Files
SaaS
SPSS
stata
Aspects of Clean Data
each data table only contains only a single type of observation ie. sales from a pizza shop, movies released in theaters
1 column for each variable ie. values that vary across each observation
column names are readable
observations in rows
no missing values in any of the observations
each row is uniquely identified
data is correctly recorded. no errors or mistakes
everything properly encoded
internally consistent ie. all column values are in the same units etc.
Descriptive Statistics
Trying to find:
Location
Spread
interdependence of the data
Referred to as summary statistics
they summarize the shape and feel of the data
Types of Analysis
Univariate
Qualitative
Quantitative
Bivariate
Quantitative
Qualitative
Qualitative & Bivariate
the analysis of a single categorial variable
items that might be of interest
percentage
mode
Frequency
the analysis of a single numeric variable
items that might be of interest
location: or measures of central tendency
dispersion: measure of spread
the shape of the data
mean
median
mode
minimum
range
quartiles
maximum
variance
standard deviation
the skewness
kurtosis
a measure of how sharply peaked or flat the distribution is
measure of the asymmetry of the distribution of values
interested in
the analysis of 2 categorical variables
joint frequency of the observations
as seen on a contingency table
joint percentages
marginal frequency
the measure of totals in columns and rows in the contingency table
the analysis of 2 numeric values
Interested in
Relationship between 2 numeric variables
Covariance
the degree to which the 2 variables vary with one another
correlation coefficient
Guidance
Domain Knowledge
Clean Data
understand the context around the data
Understand Biases
Cognitive
statistical
contextual
Make the analysis reproducible
Understand Implications
Know your limitations
Data Visualization
Qualitative Univariate
Bar Chart
displays the frequency of a categorical variable ie. how many pizzas were sold vs how many salads were sold at a pizza shop
Can be either vertical or horizontal bars communicating the same informaiton
Pie Chart
visualizes the proportion of part in the whole
more effective if the number of categories is small
not recommended for some things
Quantitative Univariate
Usually most interested in: location, shape and spread of data
Dot Plot
displays all the possible values/observations of the variable
Box Plot
provides info about outliers
displays the min, Q1, Q2, Q3, max excluding any outliers, and outliers which are any values that are 1.5 times the IQR added to the Q3 and subtracted from Q1
Histogram
Density Plot
shows the shape of values similar to a histogram. the x axis includes the possible values of the variable while the Y axis has the probability of each value occurring
x axis shows all the possible values of the variable, and y axis shows the frequency of each value
shows a smooth representation of the data as a function
Qualitative Bivariate
Spine Plot
Mosaic Plot
width of tiles represents the proportion of observations on the x axis
the height of tiles represents the proportion of observations on the y axis
use the area of the tiles to determine the number of observations that are at the intersection of the two categorical variables ie how many movies are both rated G and are an action film
Quantitative Bivariate
Scatterplot
Line Graph
used if x axis is Time
Referred to as a time series
Qualitative and Quantitative Bivariate
Bar Chart
Multiple Box Plots
Beyond
Other Types of Analysis
1) Inferential
2) Predictive
using a sample to make generalizations about a larger population
uses current or historical data to make predictions about future or unknown values
3) Causal
4) Mechanistic
creating a mathematical model that captures the exact changes in all variables as one variable is modified
R's Advanced Programming Features
Procedural Programming
Functional Programming
Object Oriented Programming
Supports Mixed Programming
Outbound Calls
Inbound Calls
C
C++
4 tran?
Python
C#
Java
Distributed Programming
R Advanced Statistical Analysis
Statistical Modeling
Cluster Analysis
models the relationship between variables with a mathematical equation
organizes a set of data into groups with similar properties
Dimensionality Reduction
Analysis of Variance (ANOVA)