Data Wrangling

Types of Acquired Data

Structured data

Semi-structured data

Unstructured data

Using SQL / Python

📌 Goal

Relational databases are the primary means of storage for structured data. The relational model can be viewed as a database model that has multiple tables that each describe a particular entity of the database.


When we consider relational databases as a collection of tables, what we call a schema, we can visualize them with entity-relationship diagrams, which give us a chance to view the data within each table, and how each table relates to the others.

1) Review the data

  • View a few rows of data
  • View the various data types
  • View a quick summary / description of the data

Steps

2) Preliminary data cleaning

  • Remove duplicates
  • Rename columns, if necessary

3) Handle missing data

Why data might be missing:


  1. Systematic causes
  2. Privacy concerns
  3. Information loss

Methods to check for missing data:


  • Verify that the data was uploaded correctly
  • Review small chunks of the data
  • Look at the statistics of the entire dataset

Types of missing data

Structurally Missing Data: We expect this data to be missing for some logical reason

Missing Completely at Random (MCAR): The probability of any datapoint being MCAR is the same for all data points — this type of missing data is mostly hypothetical

Missing at Random (MAR): The probability of any data point being MAR is the same within groups of the observed data — this is much more realistic than MCAR

Missing Not at Random (MNAR): There is some reason why the data is missing

Deletion:


  • Listwise deletion
  • Pairwise deletion
  • Drop variables

💡 Solution

Single imputation:


  • Last Observation Carried Forward (LOCF)
  • Next Observation Carried Backward (NOCB)
  • Baseline Observation Carried Forward (BOCF)

💡 Solution

💡 Solution

Multiple imputation:


Use the IterativeImputer module within sklearn

4) Final review / Tidy data

Is the data:

  • Accurate?
  • Valid?
  • Representative?

Data inspection techniques:
(Reference: Exploratory Data Analysis (EDA) | Codeacademy)


  • .head()—first five rows
  • .describe()—numerical summaries
  • .info()
  • .isnull()—locate missing (null) values
  • .unique()—return unique values