Mindmap - Larsen Chp. 9 | Data Integration (If need to collect as much…
Mindmap - Larsen Chp. 9 | Data Integration
It is necessary to integrate data when looking for additional features of data that you do not have access too.
2 Types of data Integration: “joins” (to access
more features) and “unions” (to access more observations)
A “join” combines two datasets with a shared identity value, such as a customer
A union is based on the assumption that there are multiple columns in common
between A and C. A union lines up each column containing similar information on
top of another (zip code existing in A and C, for example), creating a new table,
There are 2 types of joins. Inner Join & Outer Join. Outer Join had 3 subtypes, left outer, right outer, and full outer join.
Inner Join - creates a new table including repeated data on a specific customers behaviors.
A Left Outer Join - does the same as an inner join except it adds additional rows to the table, including data, such as customers who never placed an order. Adds rows from the left table that don't have rows in the right table.
A Right Outer join - adds rows from right that dont have corresposnding rows from left. Adds additional rows such as combining orders, or near orders...like abandoned choppig carts
Full Outer Join - Adds left and right outer join. Combining all null and any possible data
INNER JOIN: More popular with carefully curated databases. Think “referential integrity." This is something designed to not accept an order unless a customer account exists for that order and forces the database and any system using it to adhere to rules of conduct like this
If need to collect as much customer data as possible. LEFT OUTER JOIN
Practice joining. Consider a "Customers Table". First looking at 2 different companies, the employee id of each transaction and the order id as well as the.
Step 1: Integrate tables with an inner join...to look more closely at each company,
Step 2: Look more closely at each employee (City, Last Name)
Step 3L Order details information, unique sales, order id, quantity, unit price etc..
Step 4: A final join
Unions are performed when we have datasets that contain unique sets of cases sharing the same or very similar columns
A Union could be very useful for tracking a consistent system of the entirety of a college’s
interactions. Consider a college may be tracking students in one system for tuition payments, another system for course enrollments, and a
third system for course assignments and grades. All of these different data sources need to be integrated and can be with a Union.
A FULL OUTER JOIN is useful: for connecting different lists if there are overlaps between rows and a shared identifier to allow for integration
A Union is much more useful: If there is no overlap
but simply additional data points a union is a more appropriate tool for the combining of lists that consist of different people
A case where UNION'S will always win (be the appropriate tool.) - Suppose you have a
dataset containing training and test cases for your machine learning. The training cases will be used to train the model(s), while the test set will allow someone else to rate the performance of the model(s)