Please enable JavaScript.
Coggle requires JavaScript to display documents.
EXPLORATORY DATA ANALYSIS EDA: Describing and exploring data,…
EXPLORATORY DATA ANALYSIS EDA:
Describing and exploring data
UNIVARIATE
Central Tendency
Mean
(average)
Median
(middle)
Mode
(most typical)
Dispersion
Variance
Standard
Deviation
Range
Quartiles
IQR Range
Q3-Q1
Coefficient of Variation
CV=stdv/mean*100
Is stdv as % of the mean
Greater CV, the greater variability in the data irrespective of scale
Z Score
z = x - x mean/ stdv
z 0 mean of data
z + above mean
z - below mean
(in terms of stdv)
Five
Number
Minimun
Q1
Q2 (median)
Q3
Maximun
Shape of the data
Skewness
Mean>Median: Positive or Right
Mean<Median: Negative or Left
Mean=Median:
Zero skewness, symmetrical
Outliers
Boxplot represent outliers as x > Q3 + 1.5 IQR
or x < Q1 - 1.5 IQR
Visuals
Histograms
Boxplot
Univariate Analysis
summarize and find patterns in the data
(one variable / column)
Data Analysis
len(df)
df.describe()
df.shape
df.columns
Missing value check
data.isnull()
df.dropna()
Distribution
Plotting
plt.hist(df,bin=n)
g=sns.displot(data=data,kind='hist',bins=n)
sns.violinplot(df)
sns.displot(data,kind='ecdf',legend=True)
sns.displot(data=data,kind='kde')
MULTIVARIATE
Covariance
how one variable varies with respect to variation of another variable
Positive: two variables in same direction
Negative: opposite direction
No causation implication
Correlation coefficient
strength of relationship of two vars
scale independent
range 1 : -1
CoefCorr =cov(x,y)/
stdv x stdv y
closer to 1 or -1 : strong correlation
value closer to 0 : weak correlation
No causation implication
Multivariate Analysis
to understand interactions between different fields in the dataset (or) finding interactions between variables more than 2
Data Analysis
df.describe()
df.shape
df.columns
Plotting
sns.scatterplot(df['Price'], df['Distance'])
association between two variables
sns.pairplot(df)
for all the numerical columns.
degree of correlation between any two columns
sns.scatterplot(df['Distance'], df['Price'], hue=data1['Type'], palette='Set2')
sns.heatmap(df.corr(), annot=True)
Coding
df.corr()
correlations calculated for the numerical columns.
CATEGORICAL
Most of the ML models are designed to work on numeric data.
We need to convert categorical text data (labels) into numerical data for model building
One-Hot-Encoding
is used to create
dummry variables
to replace the categories into features of each category and represent it using 1 or 0
CODING
df_dummies= pd.get_dummies(df, prefix='pfxname', columns=['CatColname'])
This function does One-Hot-Encoding on categorical text
SKlearn Label Encoding
encode labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value to as assigned earlier.
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df_dummies['RegionId'] = labelencoder.fit_transform(df_dummies.Regionname)
BIVARIATE
Numerical vs Numerical
Scatterplot
Line plot
Heatmap for correlation
Joint plot
Categorical vs. Numerical
Bar chart
Voilin plot
Categorical box plot
Swarm plot
Two Categorical Variables
Bar chart
Grouped bar chart
Point plot
NORMALIZATION, TRANSFORMATION & PANDA PROFILING
DATA PREPROCESSING
Previous phase to Modelling following up EDA making data ready for downstream analysis