Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 9: Data Integration (Joins (9.1) (Inner (Use for curated databases…
Chapter 9: Data Integration
Access to additional and relevant data= ALWAYS BETTER
Joins
More features
Combines datasets based off of identifier
Login information, name, address
New rows created with customer information
Unions
More observations
Common columns
Combine customer based off of similarity of zip code or company, etc.
Joins (9.1)
Inner
Produces header row and subsequent information
Use for curated databases
Most databases have "referential integrity"
Only accept purchase with customer ID
Left outer
Does the same as inner but includes all customers even if there is no subsequent information
Use to collect as much information about customer database as possible
Right outer
Contains all information as inners and adds in information without customer ID
Full outer
COmbones both customers without information, information without customers and all information in inner join
Joins are rational algebra
Example: join customer info, employee info and product info to get detailed datatable
Now we can use machine learning
Adding data horizontally
Unions
Add top data set to bottom data set
Example: Colleges have different databases of people for different things, combine to see the entirety of interaction with the university
Appropriate if no overlaps
Same person/ID = problem
Always use for training and test datasets
undesirable to develop separately
likelihood of producing mistake is larger
imputation is more accurate (average of whole)