Summarization (11.1 Summarize (When summarizing, one or more columns by…
Feature also know by "group" or "group by"
When summarizing, one or more columns by which to group data is selected, essentially creating a virtual “bucket” for each unique group.
if you are interested in employees, you would
summarize by EmployeeID, and in the given case, the nine unique employees in the dataset would each be assigned a bucket. From there, every row belonging to that employees would them be placed inside the bucket
Ex part 2:
Once all the relevant rows for thatemployee are in the bucket, the data within are available to be summarized
Possible to summarize in more than one column
Typical functions that can be applied to data inside each bucket
With each set of order details placed in the appropriate bucket, the preferred function can now be applied across the buckets, for example, a sum, count, average, max, min, and so on
While we may have information directly about the customer, such as age, gender, and personality factors, most information will be about interactions with that customer
When data is available that does not focus directly on the right unit, it must be aggregated or summarized
If the customer called a support line five times, this information which will not exist in the customer table, but rather in a customer support table. After joining these together, the data is summarized to get a new column about each customer stating how many times they call the customer support line
Where summarize provides summary and aggregate information on existing columns, crosstab uses the content inside columns to create new columns.
Crosstab is a way to deal with data that is currently in “skinny” form and transform what is currently listed in rows to column-form.
“skinny” data is seldom at the right level for analysis, and crosstab makes data available in an intuitive and readable fashion, sometimes also creating new features for machine learning to better predict a target
Happens with data from sales when one or more customers make many purchases over time
Happens with Internet of Things Devices , that report their status back to their owner every sub-second
unknown specifically how data will be combined at the time of database design, so each event is stored as its own row in a database table or file
Meaning every customer, device, user of the web is involved in a number of different types of events
Ex:grocery store customer over time will
purchase products from multiple different categories, such as Dairy Products, Produce, Seafood, and so on, but this information does not exist at the individual customer level; it exists at the order detail level