Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Intro (Union (Created based on the assumption…
Chapter 9: Data Integration
Intro
Obtaining more data will lead to better predictions from algorithms
Can also look for additional features of interest that we do not currently have
If so, it is almost always necessary to integrate data from other sources
Two methods: joins and unions
Join
Combines two data sets with shared identity value
Such as a customer_id
Creates one or more new rows
Union
Created based on the assumption there are multiple columns in common
Lines up columns containing similar information on top of one another, creating a new table
Ex: if a customer information is scattered about different databases/tables, a union will combine that information into one table
Joins
Two types
Inner join
Each row in the left table is combined horizontally with any row in the right table that has the same identity value
Most commonly used join
In the example: most databases will not accept an order unless an account is made/exists (most databases have data on each side, and will therefore accept an inner join)
Inner join example:
Customer table (left table)
CustomerID 1: Kai
CustomerID 2: Dan
Purchases table (right table)
CustomerID 1 found in two rows
Kai: Purchase product pants
Kai: Purchase product bouncing castle
CustomerID 3
Dan: Purchase product sunglasses
New Joined Table
Produces a header row (CustomerID) and two data rows (Name, Purchase)
Inner Join Practice
See handwritten notes
Keep combining tables based off of common column
Filter out/delete unnecessary columns once new table has been created
Outer Joins
Left outer join
Produces the same value as the inner join
ALSO adds any rows from the LEFT table that do not have corresponding rows in the right table
Ex: left table (name) & right table (purchase)
Customers who have never bought anything (null purchase) will still be included in the new table
May be used if we want to figure out why some customers placed and order and why some did not
As opposed to an inner join, which ONLY includes customers who have a name AND an order
If we want to collect as much
customer
(table 1) data as possible
Right outer join
Produces the same value as the inner join
ALSO adds any rows from the RIGHT table that do not have corresponding rows in the left table
Example: CustomerID purchases that do not have corresponding names
May be useful if we suspected specific products led to abandonment of shopping carts (since no name and products = anonymous users)
Full outer join
Produces the same outcome as an inner join + left outer join + right outer join
Example:
Kai, 1, sunglasses
Dan, 2, pants
Null, Null, sunglasses
Kai, 1, null
All information from both tables is combined
Unions
Combine top and bottom datasets
Generally used when combining datasets that have unique cases sharing the same or very similar columns
Different sources of external data
Different systems for tuition, course enrollments, assignments/grades, alumni relations
Having an aggregate of a given person’s information available in one consistent system is critical to accurate and high quality analytics/ML
Outer joins are great if there is a shared IDENTIFIER and overlaps between rows
If there are no overlaps, but simply additional data points, a union is more useful
A case for unions
Dataset containing TRAINING and TEST cases
Must integrate both sets so all modifications are shared between training and evaluation data
Dataset also requires imputation of ages
Imputation = replacing missing data with reasonable values
Do this by assigning the average age to every column where an age value is missing