Data Wrangling

Missing Completely at Random (MCAR): The probability of any datapoint being MCAR is the same for all data points — this type of missing data is mostly hypothetical

References:

Relational Databases for Data Analysis / Science | Codeacademy
Thinking in SQL vs Thinking in Python | Mode
Data Wrangling and Tidying | Codeacademy
Handling Missing Data | Codeacademy
Data Types and Quality | Codeacademy
Cleaning and Manipulating Data | Codeacademy (click the link to learn more about the relevant Python methods)

Types of Acquired Data

Structured data

Semi-structured data

Unstructured data

Using SQL / Python

📌 Goal

Relational databases are the primary means of storage for structured data. The relational model can be viewed as a database model that has multiple tables that each describe a particular entity of the database.

When we consider relational databases as a collection of tables, what we call a schema, we can visualize them with entity-relationship diagrams, which give us a chance to view the data within each table, and how each table relates to the others.

1) Review the data

View a few rows of data
View the various data types
View a quick summary / description of the data

Steps

2) Preliminary data cleaning

Remove duplicates
Rename columns, if necessary

3) Handle missing data

Why data might be missing:

Systematic causes
Privacy concerns
Information loss

Methods to check for missing data:

Verify that the data was uploaded correctly
Review small chunks of the data
Look at the statistics of the entire dataset

Types of missing data

Structurally Missing Data: We expect this data to be missing for some logical reason

Missing Completely at Random (MCAR): The probability of any datapoint being MCAR is the same for all data points — this type of missing data is mostly hypothetical

Missing at Random (MAR): The probability of any data point being MAR is the same within groups of the observed data — this is much more realistic than MCAR

Missing Not at Random (MNAR): There is some reason why the data is missing

Deletion:

Listwise deletion
Pairwise deletion
Drop variables

💡 Solution

Single imputation:

Last Observation Carried Forward (LOCF)
Next Observation Carried Backward (NOCB)
Baseline Observation Carried Forward (BOCF)

💡 Solution

Multiple imputation:

Use the IterativeImputer module within sklearn

4) Final review / Tidy data

Is the data:

Accurate?
Valid?
Representative?

Data inspection techniques:
(Reference: Exploratory Data Analysis (EDA) | Codeacademy)

.head()—first five rows
.describe()—numerical summaries
.info()
.isnull()—locate missing (null) values
.unique()—return unique values