Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Unions (Example: upon first implementing a…
Chapter 9: Data Integration
Joins
Based on relational algebra
when integrate, remove any identical columns from resulting table
Two types most relevant for our purposes
Outer Join
3 Subtypes:
Right Outer Join
Produces the same result as an inner join, but also adds any rows from the right table that do not have corresponding rows in the left table
In the customer and order case, this type of join might be used to combine orders or near-orders, like the common problem of abandoned shopping carts
These shopping carts will be filled by an anonymous user and registered in our system as an incomplete order when the potential customer leaves the site without registering as a customer and paying for the products
This kind of data would be critical if we suspected that the specific products themselves led to abandonment of shopping carts
Full Outer Join
Productes the same result as an inner join + a left outer join + a right outer join
This means that any customer without orders is added to the resulting table, and any order not connected to a customer is also added to the resulting table
Left Outer Join
Most important when goal is to collect as much customer data as possible
Produces the same result as an inner join, but also adds any rows from the left table that do not have corresponding rows in the right table
Below the rows outlined in the above cell, we will also get rows containing customers who never placed an order. For these, the order-related columns will contain NULLs (sign that there is nothing there)
This join might be used if our goal is to find out why some customers place orders and some do not. An inner join would exclude all the customers that did not place an order
Inner Join
will use most commonly for our data integration process
when dealing with carefully curated databases
designed to not accept an order unless...
...a customer account exists for that order
Referential Integrity
forces the database and any system using it to adhere to rules of conduct
Each row in left table is combined horizontally with any row in the right table that has the same identity value
EX: if we join based on the identity CustomerID, which both the customers and the Orders tables contain
The customer information and order information for each customer is combined into a third table, containing repeated information on the customer for each order they have made
Unions
Allows us to combine 2 datasets
Perform when we have datasets that contain unique sets of cases sharing same, or very similar columns
Example: upon first implementing a Salesforce system
not uncommon for an org. to find that different parts of org. have been keeping separate lists of internal and external data
having aggregated of a given person's info available within a single system
critical to performing high-quality analytics and ML
full outer join useful connecting different lists
if
there are overlaps between rows and a shared identifier to allow for integration
No overlap, but additional data points, union more appropriate
when union is always appropriate: training and test datasets integration and with the imputation
replacing missing data in columns with reasonable values
EX: missing age, replace empty spots with average of age