Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Important Information (Two Methods of data…
Chapter 9: Data Integration
Important Information
Access to additional and relevant data will lead to better predictions from algorithms until we reach the point where more observations are no longer helpful to detect the signal, the feature or conditions that inform the target.
Look for additional features of interest that we do not currently have, at which point it will invariably be necessary to integrate data from different sources.
Two Methods of data Integration
Joins
Access more features
Combines two datasets with shared identity value
Unions
Access more observations
Based on assumption that there are multiple columns in common . A union lines up each column containing similar info on top of another
Joins
Two Types
Inner Join
Each row in the left table is combined horizontally with any row in
the right table that has the same identity value.
Most commonly used when dealing with carefully curated database
Outer Join
Right outer join
This join produces the same result as an
inner join, but also adds any rows from the
right table that do not have corresponding
rows in the left table
Full outer join
This join produces the same result as an
inner join + a left outer join + a right outer
join
Great for connecting lists in there are overlaps between rows and shared identifiers
Left outer join
This join produces the same result as an
inner join but also adds any rows from the
left table that do not have corresponding
rows in the right table.
Most useful if engaging in a project with goal being to collect as much data as possible
Unions
We perform unions when we have datasets that contain unique sets of cases sharing the same or very similar columns