EDA
Exploratory Data Analysis

Introduction

An approach to analyzing data sets to summarize their main characteristics.

EDA was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

EDA is different from initial data analysis (IDA).

Tukey's EDA was related to robust statistics and nonparametric statistics. Tukey promoted the use of five number summary of numerical data.

Typicl Language

R

SAS

Python

Matlab

Objectives

support the selection of appropriate statistical tools and techniques

suggest hypotheses about the causes of observed phenomena

determine relationships among variables

maximize insight into a data set

detect outliers and anomalies

extract important variables

test underlying assumptions

Classification

non-graphical or graphical

univariate or multivariate (usually bivariate)

Typical graphical techniques

Box plot

Scatter plot

Histogram

Dimensionality reduction

Principal component analysis

Typical quantitative techniques

Median polish

Trimean

Ordination

History

Francis Galton emphasized order statistics and quantiles.

Arthur Lyon Bowley used precursors of the stemplot and five-number summary.

Andrew Ehrenberg articulated a philosophy of data reduction.

John W. Tukey wrote the book Exploratory Data Analysis in 1977.