Chapter 11. Summarization (11.1 Summarize (The following is a list of…
Chapter 11. Summarization
When data is available that
does not focus directly on the right unit, it must be aggregated or summarized.
Summarize is a feature available in most programming tools and is also known under the name group or group by.
When summarizing, one or more columns by
which to group data is selected, essentially creating a virtual “bucket” for each unique group.
It is possible also to summarize by more than one column.
The following is a list of typical functions that can be
applied to data inside each bucket:
In this case, for every Order Detail row, the orders’ individual costs should be summarized into an overall order cost.
With each set of order details placed in the appropriate bucket, the preferred function can now be applied across the buckets, for example, a sum, count, average, max, min, and so on.
The summarize could have been run with any information
above the level of analysis of OrderID.
Where summarize provides summary and aggregate information on existing columns, crosstab uses the content inside columns to create new columns.
Crosstab is a way to deal with data that is currently in “skinny” form and transform what is currently listed in rows to column-form.
This kind of “skinny” data is seldom at the right level for analysis, and crosstab makes data available in an intuitive and readable fashion, sometimes also creating new features for machine learning to better predict a target.
The skinny table is a common shape for data to take.
It happens with data from sales when one or more customers make many purchases over time.
It is usually unknown specifically how data will be combined at the time of database design, so each “event” is stored as its own row in a database table or file.
This means that every customer, device, or user of the Web is involved in a number of different types of events.
Joining data about what kind of products someone placed in their shopping cart could provide access to valuable information about a customer and his or her shopping habits.
In this case, in applying the cross-tab tool, OrderId will become the rows of interest and CategoryName will provide the content for the columns.
Apply the count function again to determine how many of each type of product category were part of each order.
Now the two tables from the chapter may be joined to create a larger table for machine learning.