Chapter 9: Data Integration
Important Information
Access to additional and relevant data will lead to better predictions from algorithms until we reach the point where more observations are no longer helpful to detect the signal, the feature or conditions that inform the target.
Look for additional features of interest that we do not currently have, at which point it will invariably be necessary to integrate data from different sources.
Two Methods of data Integration
Joins
Unions
Access more features
Access more observations
Combines two datasets with shared identity value
Based on assumption that there are multiple columns in common . A union lines up each column containing similar info on top of another
Joins
Two Types
Inner Join
Outer Join
Right outer join
Full outer join
Left outer join
Each row in the left table is combined horizontally with any row in
the right table that has the same identity value.
This join produces the same result as an
inner join but also adds any rows from the
left table that do not have corresponding
rows in the right table.
This join produces the same result as an
inner join, but also adds any rows from the
right table that do not have corresponding
rows in the left table
This join produces the same result as an
inner join + a left outer join + a right outer
join
Most commonly used when dealing with carefully curated database
Most useful if engaging in a project with goal being to collect as much data as possible
Unions
We perform unions when we have datasets that contain unique sets of cases sharing the same or very similar columns
Great for connecting lists in there are overlaps between rows and shared identifiers