Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 11: Summarization (11.2: Crosstab (Ex: A table of a customer’s…
Chapter 11: Summarization
11.1: Summarize
Also known as
group
or
group by
Provides summary and aggregate information on
existing
columns
Selects one or more columns and creates a virtual “bucket” for each UNIQUE group
Every row belonging to your specified unique group would be placed inside that bucket
Once all the relevant rows are in the bucket, the data within are available to be summarized
This can include functions, such as:
Count
Sum
Min
Max
First
Last
Average
Median
Mode
Standard deviation
Can also summarize by more than one column
Ex: summarizing by a given employee on a specific day
Summarize by EmployeeID and OrderDate
Many more digital buckets would be created
Summarize can run with any column that does not change - such as CustomerID, CompanyName, EmployeeID, etc
These are all ABOVE the desired level of analysis of OrderID
Will ultimately not affect the summarize result
Can be used because orderID, employeeID, and customerID, etc do not change across the board in respect to orderID
Example
: Order Detail
Summarize by
OrderID
(each unique orderID will be placed into its own bucket)
The preferred function (count, sum, min, max, etc) can now be applied across all buckets
“Sum” OrderDetailPrice - creating a new column called OrderCost
Results in a new table with OrderID and Order Cost
Intro
Most data does not focus on the actual unit of analysis
This data must be aggregated or summarized
Happens after tables are joined
Ex customer & customer support
Data is summarized to display how many times the customer called a customer support line
Summarization reduces both the number of columns and the number of rows
11.2: Crosstab
Uses content inside columns to create new columns
Transforms what is currently listed in rows (“skinny” form) to column form
Ex: employee drinks
Old data: name, date, drink type
Creating new columns: name, tea, coffee, energy drink
Function = “count”
Skinny table is a common shape for data to take
Happens with data from sales when one or more customers make many purchases over time
Happens with IoT devices report their status back to their owner every sub-second
Each “event” is stored as its own row
Every user is involved in a number of different “types” of events
Ex
: A table of a customer’s grocery store purchases over time
Exists at the order detail level, instead of the customer level
Want to know what that specific customer put in their shopping cart
Could give insights into a typical customer’s shopping habits
Must join data via cross-tab to do this
Step 1: categorizing products into a specific product category (dairy, produce, condiments, grains, etc)
Step 2: use cross-tab to count how many of each product category a customer has bought during each trip to the grocery store (using orderID as the row of interest)
These two tables can now be combined to create a larger table for machine learning