Please enable JavaScript.

Coggle requires JavaScript to display documents.

EXPLORATORY DATA ANALYSIS EDA: Describing and exploring data,…

- - - - Coefficient of Variation
        CV=stdv/mean*100
        Is stdv as % of the mean
        Greater CV, the greater variability in the data irrespective of scale
        
        Z Score
        
        z = x - x mean/ stdv
        z 0 mean of data
        z + above mean
        z - below mean
        (in terms of stdv)
  - - - Outliers
        
        Boxplot represent outliers as x > Q3 + 1.5 IQR
        or x < Q1 - 1.5 IQR
  - - - Missing value check
        
        data.isnull()
        
        df.dropna()
        
        Distribution
        
        Plotting
        
        plt.hist(df,bin=n)
        
        g=sns.displot(data=data,kind='hist',bins=n)
        
        sns.violinplot(df)
        
        sns.displot(data,kind='ecdf',legend=True)
        
        sns.displot(data=data,kind='kde')
- - - - closer to 1 or -1 : strong correlation
        
        value closer to 0 : weak correlation
        
        No causation implication
  - - - Plotting
        
        sns.scatterplot(df['Price'], df['Distance'])
        association between two variables
        
        sns.pairplot(df)
        for all the numerical columns.
        degree of correlation between any two columns
        
        sns.scatterplot(df['Distance'], df['Price'], hue=data1['Type'], palette='Set2')
        
        sns.heatmap(df.corr(), annot=True)
        
        Coding
        
        df.corr()
        correlations calculated for the numerical columns.
- - - - CODING
        
        df_dummies= pd.get_dummies(df, prefix='pfxname', columns=['CatColname'])
        
        This function does One-Hot-Encoding on categorical text
    - - from sklearn.preprocessing import LabelEncoder
        
        labelencoder = LabelEncoder()
        
        df_dummies['RegionId'] = labelencoder.fit_transform(df_dummies.Regionname)