Data Wrangling
References:
- Relational Databases for Data Analysis / Science | Codeacademy
- Thinking in SQL vs Thinking in Python | Mode
- Data Wrangling and Tidying | Codeacademy
- Handling Missing Data | Codeacademy
- Data Types and Quality | Codeacademy
- Cleaning and Manipulating Data | Codeacademy (click the link to learn more about the relevant Python methods)
Types of Acquired Data
Structured data
Semi-structured data
Unstructured data
Using SQL / Python
📌 Goal
Relational databases are the primary means of storage for structured data. The relational model can be viewed as a database model that has multiple tables that each describe a particular entity of the database.
When we consider relational databases as a collection of tables, what we call a schema, we can visualize them with entity-relationship diagrams, which give us a chance to view the data within each table, and how each table relates to the others.
1) Review the data
- View a few rows of data
- View the various data types
- View a quick summary / description of the data
Steps
2) Preliminary data cleaning
- Remove duplicates
- Rename columns, if necessary
3) Handle missing data
Why data might be missing:
- Systematic causes
- Privacy concerns
- Information loss
Methods to check for missing data:
- Verify that the data was uploaded correctly
- Review small chunks of the data
- Look at the statistics of the entire dataset
Types of missing data
Structurally Missing Data: We expect this data to be missing for some logical reason
Missing Completely at Random (MCAR): The probability of any datapoint being MCAR is the same for all data points — this type of missing data is mostly hypothetical
Missing at Random (MAR): The probability of any data point being MAR is the same within groups of the observed data — this is much more realistic than MCAR
Missing Not at Random (MNAR): There is some reason why the data is missing
Deletion:
- Listwise deletion
- Pairwise deletion
- Drop variables
💡 Solution
Single imputation:
- Last Observation Carried Forward (LOCF)
- Next Observation Carried Backward (NOCB)
- Baseline Observation Carried Forward (BOCF)
💡 Solution
💡 Solution
Multiple imputation:
Use the IterativeImputer module within sklearn
4) Final review / Tidy data
Is the data:
- Accurate?
- Valid?
- Representative?
Data inspection techniques:
(Reference: Exploratory Data Analysis (EDA) | Codeacademy)
- .head()—first five rows
- .describe()—numerical summaries
- .info()
- .isnull()—locate missing (null) values
- .unique()—return unique values