Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Cleaning, Data Operation - Coggle Diagram
Data Cleaning
Typo
Modification
-
Manually Mapping : dataframe['gender'].map({'m': 'male', fem.': 'female', ...})
Pattern Matching : re.sub(r"\^m\$", 'Male', 'male', flags=re.IGNORECASE)
-
-
-
-
-
-
Modification
Drop Data
-
Drop the whole column
Pandas : df.drop(['column'], axis=1)
(axis = 0 >> index ; axis = 1 >> column default = 0)
-
-
-
Detection
-
-
-
calculate percentage of missing value
----- df_missing = df.isnull().sum().sum()
df_all = np.prod(df.shape)
percentage = df_missing / df_all * 100
-
Detection
-
-
-
Scatter Plot
Matplotlib : df.plot(kind='scatter', x='Sales', y='Buyers', rot=70)
-
-
-
-
Modification
-
Pandas : df.drop_duplicates('column', keep='')
-
Modification
Pandas : df.drop(columns='XXX')
Pandas : df.drop(columns =['column1', 'column2']
-
-
-
Data Operation
Data Filtering
Grouping
-
-
application : grouping + aggregation value
df.groupby([grouping_column])[column_showing].agg([np.mean, np.std, np.min, np.max])
Crosstab
pd.crosstab([column1], [column2], normalize=bool)
Pivot table
df.pivot_table(['columns'], ['column_for_index'], aggfunc='mean/medain/..')
Value ><!=
-
Count value
(df['column'] > 5).astype('int').value_counts(normalize=True)
---- astype('int') turns the boolean value into integer, in order to count
Value Replacement
-
-
If Else Replacement
np.where(df['column_name']>x, answer1, answer2)
ex : np.where(df['Loan_amount'] > 325, 140, df['Loan_amount'])
Split Value
String
Pandas : df[['new_column1', 'new_column2']] = df['column_tobesplit'].str.split(' ', expand=True)
ex : df[['First_name', 'Last_name']] = df['Name'].str.split(' ', expand=True)
Date
Pandas : df[['new_column']] = df.origin_column.dt.date_format
ex : df['Month'] = df.Date.dt.month
ex : df['Day of week'] = df.Date.dt.dayofweek
-
Value Delete / Insert
Column
-
Drop
df.drop(['column_to_drop1', 'column_to_drop2'], axis=1, inplace=True)
Row
df.drop([rowdropped_num1, rowdropped_num2])
-
Pandas : df.iloc[:,[0,1,3,2,4,5]]
(numbers represent the origin column number)
-
-
-
-
-
Data Sorting
df.sort_values(by=['column1', 'column2'], ascending=[True, False])