Chapter 9. Data Integration (9.1 Joins (There are two types of joins that…
Chapter 9. Data Integration
Access to additional and relevant data will lead to better predictions from algorithms until we reach the point where more observations (cases) are no longer helpful to detect the signal, the feature(s) or conditions that inform the target.
This chapter introduces this process of data integrating, starting with an introduction of two methods: "joins" (to acess more features) and "unions" (to access more observations)
A "join" combines two datasets with a shared identity value, such as a customer identifier.
After a join, one or more new rows are created, each containing customer information from A on the left and website behavior from B on the right.
A union is based on the assumption that there are multiple columns in common between A and C.
A union lines up each column containing similar information on top of another (zip code existing in A and C, for example), creating a new table, E.
For example, if a company has multiple customers, and their records are stored in various databases, a union will create one table containing all customers for that company.
Joins are complicated concepts for many and take quite a bit of practical experience to understand.
Let's use the language of "tables" here with the assumption that a table can be any rectangular matrix containing data in column form where each column begins with a descriptive name of the data in that column (i.e., SocialSecurityNumber)
There are two types of joins that are most relevant for our purposes; one has three subtypes:
Left outer join, right outer join, full outer join
When dealing with carefully curated databases, we will use the inner join most commonly for our data integration process.
Such databases will generally be designed to not accept an order unless a customer account exists for that order (something called "referential integrity" forces that database and any system using it to adhere to rules of conduct like this).
If we are engaged in a project with the goal being to collect as much customer data as possible, then the left outer join is most useful.
the inner join and the left outer join together will take care of 99% of related needs.
Joins are based on relational algebra, so, as is true for learning math, the best way to get a sense of how joins work is to practice performing them.
Unions allow us to combine two datasets, shown as top (T) and bottom (B).
Generally, we perform unions when we have datasets that contain unique sets of cases sharing the same or very similarly columns
For example, a college may be tracking students in one system for tuition payments, another system for course enrollments, and a third system for course assignments and grades. Once the students graduate and become alumni, they are tracked in yet another system.
The full outer join is great for connecting these different lists if there are overlaps between rows and a shared identifier to allow for integration.
If there is no overlap but simply additional data points a union is a more appropriate tool for the combining of lists that consist of different people.