Chapter 9. Data Integration (Union (A union is based on the assumption…
Chapter 9. Data Integration
A “join” combines two datasets with a shared identity value, such as a customer identifier.
After a join, one or more new rows are created
There are two types of joins that are most relevant for our purposes; one has three subtypes:
Each row in the left table is combined
horizontally with any row in
the right table that has the same identity
When dealing with carefully curated databases, we will use the inner join most commonly for our data integration process.
Left outer join
This join produces the same result as an inner join but also adds any rows from the left table that do not have corresponding rows in the right table
Right outer join
This join produces the same result as an inner join, but also adds any rows from the right table that do not have corresponding rows in the left table.
Full outer join
This join produces the same result as an
inner join + a left outer join + a right outer
A union is based on the assumption that there are multiple columns in common between A and C.
A union lines up each column containing similar information on top of another (zip code existing in A and C, for example), creating a new table, E.
Unions allow us to combine two datasets
Generally, we perform unions when we have datasets that contain unique sets of cases sharing the same or very similar columns.
The full outer join is great for connecting these different lists if there are overlaps between rows and a shared identifier to allow for integration. If there is no overlap but simply additional data points a union is a more appropriate tool for the combining of lists that consist of different people.