Plural Sight: Exploratory Data Analysis

Transforming & Cleaning Data

Renaming Variables

Data Type Conversion

Encoding Values

merging data sets

converting units

handling missing data

handing anomalous data

Loading Data into R

File Sources

File Based Data

CSV

Tab delimited

excel files

Web Based Data

XML

JSON

HTML

Databases

SQL server

Oracle

my SQL

Statistical Data Files

SaaS

SPSS

stata

Aspects of Clean Data

each data table only contains only a single type of observation ie. sales from a pizza shop, movies released in theaters

1 column for each variable ie. values that vary across each observation

column names are readable

observations in rows

no missing values in any of the observations

each row is uniquely identified

data is correctly recorded. no errors or mistakes

everything properly encoded

internally consistent ie. all column values are in the same units etc.

Descriptive Statistics

Trying to find:

Location

Spread

interdependence of the data

Referred to as summary statistics

they summarize the shape and feel of the data

Types of Analysis

Univariate

Qualitative

Quantitative

Bivariate

Quantitative

Qualitative

Qualitative & Bivariate

the analysis of a single categorial variable

items that might be of interest

percentage

mode

Frequency

the analysis of a single numeric variable

items that might be of interest

location: or measures of central tendency

dispersion: measure of spread

the shape of the data

mean

median

mode

minimum

range

quartiles

maximum

variance

standard deviation

the skewness

kurtosis

a measure of how sharply peaked or flat the distribution is

measure of the asymmetry of the distribution of values

interested in

the analysis of 2 categorical variables

joint frequency of the observations

as seen on a contingency table

joint percentages

marginal frequency

the measure of totals in columns and rows in the contingency table

the analysis of 2 numeric values

Interested in

Relationship between 2 numeric variables

Covariance

the degree to which the 2 variables vary with one another

correlation coefficient

Guidance

Domain Knowledge

Clean Data

understand the context around the data

Understand Biases

Cognitive

statistical

contextual

Make the analysis reproducible

Understand Implications

Know your limitations

Data Visualization

Qualitative Univariate

Bar Chart

displays the frequency of a categorical variable ie. how many pizzas were sold vs how many salads were sold at a pizza shop

Can be either vertical or horizontal bars communicating the same informaiton

Pie Chart

visualizes the proportion of part in the whole

more effective if the number of categories is small

not recommended for some things

Quantitative Univariate

Usually most interested in: location, shape and spread of data

Dot Plot

displays all the possible values/observations of the variable

Box Plot

provides info about outliers

displays the min, Q1, Q2, Q3, max excluding any outliers, and outliers which are any values that are 1.5 times the IQR added to the Q3 and subtracted from Q1

Histogram

Density Plot

shows the shape of values similar to a histogram. the x axis includes the possible values of the variable while the Y axis has the probability of each value occurring

x axis shows all the possible values of the variable, and y axis shows the frequency of each value

shows a smooth representation of the data as a function

Qualitative Bivariate

Spine Plot

Mosaic Plot

width of tiles represents the proportion of observations on the x axis

the height of tiles represents the proportion of observations on the y axis

use the area of the tiles to determine the number of observations that are at the intersection of the two categorical variables ie how many movies are both rated G and are an action film

Quantitative Bivariate

Scatterplot

Line Graph

used if x axis is Time

Referred to as a time series

Qualitative and Quantitative Bivariate

Bar Chart

Multiple Box Plots

Beyond

Other Types of Analysis

1) Inferential

2) Predictive

using a sample to make generalizations about a larger population

uses current or historical data to make predictions about future or unknown values

3) Causal

4) Mechanistic

creating a mathematical model that captures the exact changes in all variables as one variable is modified

R's Advanced Programming Features

Procedural Programming

Functional Programming

Object Oriented Programming

Supports Mixed Programming

Outbound Calls

Inbound Calls

C

C++

4 tran?

Python

C#

Java

Distributed Programming

R Advanced Statistical Analysis

Statistical Modeling

Cluster Analysis

models the relationship between variables with a mathematical equation

organizes a set of data into groups with similar properties

Dimensionality Reduction

Analysis of Variance (ANOVA)