Please enable JavaScript.
Coggle requires JavaScript to display documents.
Lecture 7 Association (What is Association Analysis? (Association Rules…
Lecture 7
Association
What is Association Analysis?
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Example rules
{Diaper} -> {Beer}
{Milk, Bread} -> {Eggs, Coke}
{Beer, Bread} -> {Milk}
Implication means co-occurrence, not causality
Examples
In retail market basket analysis
e.g. grocery shops
What items are often bought together
In linguistics
e.g. text analysis
What words often occur together
In recommender applications
e.g. movie rental
What movies are often watched together
Frequent Item-sets
Item-set
A collection of one or more items
Example: {Milk, Bread, Diaper}
Candidate item-sets
All potential combinations of items
K-item-set
An item-set that contains K items
\[\text{Support count }(\sigma)\]
How many times an item-set occurs
\[\text{e.g. }\sigma(\text{{Milk, Bread, Diaper}})=2\]
#
Support (s)
Transactions that contain an item-set in total transactions
e.g. s({Milk, Bread, Diaper}) = 2/5
#
Frequent Item-set
An item-set whose support is greater than or equal to a minimum support threshold
All items get equal attention in a basket, no implication rules
Data mining task
Given a data-set and min support request, find all frequent item-sets
Example application
Movie Rentals
Output: {LOTR1, LOTR2}
Interpretation: LOTR1 and LOTR2 are often rented together
Possible action: company can bundle these products or offer a special pricing if you rent both
Association Rules
Association Rule: if X then Y in a given data-set
or in another for: X -> Y
Example
{Milk, Diaper} -> {Beer}
Evaluation of association rules
Every rule is assessed individually
Two evaluation measures
Support (s) = transactions that contain X and Y / all transactions
X = banana
10 shopping transactions
4 people bought bananas and apples in a single transaction
Y = apple
Support = 0.4
Confidence (c) = transactions that contain X and Y / transactions that only contain X
X = banana
Y = apple
4 people bought banana and apple in a single transaction
5 people bought banana
Confidence = 0.8
Example
{Milk, Diaper} -> Beer
\[s=\frac{2}{5}=0.4\]
\[c=\frac{2}{3}=0.67\]
if... then... rules
Data mining task
Given a data-set and min support and min confidence, find all common and confident association rules
Example application
Movie Rentals
Output: {LOTR2} -> {LOTR1}
Interpretation: if a customer rents LOTR2, often LOTR1 is also rented (other way around is not frequent)
Possible action: a reminder that someone renting LOTR2 should be offered to rent LOTR1
Transactional data
Transaction data records transactions
Typical transactions are:
Shopping
Items bought by customers
Financial
Orders
Invoices
Payments
Work
Plans
Activity records
Logistics
Deliveries
Storage records
Travel records
Transactional data is often stored in databases by having a table with the transaction ID as the primary key
Data in this format needs to be converted to one row per transaction before it can be used to learn classifiers, association rules, or clustering algorithms
#
Product attributes are binary
1 means that the product was in the basket
0 means that the product wasn't in the basket
There is no target label
(unsupervised learning task)
#
Algorithms
Mining association rules
Given a transactional data-set the goal of association rule mining is to find all rules that have high support and high confidence
Support ≥ minsup threshold
Confidence ≥ minconf threshold
Brute-force approach
List all possible item-sets, compute support for each
Discard item-sets that are infrequent
Stop here if you only need frequent item-sets, not rules
List all possible association rules from the remaining item-sets
Compute the confidence for each rule
Discard the rules that are not confident
Problem: impractical and often computationally infeasible
Examples of rules
{Milk, Diapers} -> {Beer}
s=0.4
c = 0.67
{Milk, Beer} -> {Diapers}
s=0.4
c=1.0
{Diapers, Beer} -> {Milk}
s=0.4
c=0.67
{Beer} -> {Milk, Diapers}
s=0.4
c=0.67
{Diapers} -> {Milk, Beer}
s=0.4
c=0.5
{Milk} -> {Diapers, Beer)
s=0.4
c=0.5
Observations
All the above rules are binary partitions of the same item-set
{Milk, Diapers, Beer}
Rules originating from the same item-set have identical support but have different confidence
Thus, we may decouple the support and confidence requirements
Two-step approach
Frequent Item-set generation
Generate all item-sets whose support equal to or greater than minsup
Rule generation
Generate high confidence rules from each frequent item-set, where each rule is a binary partitioning of a frequent item-set
Frequent item-set generation is still computationally expensive
Brute force approach (item-sets)
Given d items, there are 2^d possible candidate item-sets
Each item-set in the lattice is a candidate frequent item-set
Count the support of each candidate by scanning the database
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M=2^d
Computational complexity
Given d unique items:
Total number of item-sets = 2^d
Total number of possible association rules:
\[ R = \sum^{d-1}_{k=1}\begin{bmatrix} \begin{pmatrix} d \\ k \\ \end{pmatrix} \times \sum^{d-k}_{j=1}\begin{pmatrix} d-k \\ j \\ \end{pmatrix} \end{bmatrix} = 3^d - 2^{d+1} +1 \]
If d=6, R = 602 rules
Frequent Item-set generation strategies
Reduce the number of candidates (M)
Complete search: M=2^d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce the size of N as the size of item-set increases
Reduce the number of comparisons (NM)
Use efficient data structures to store the candidates or transactions
No need to match every candidate against every transaction
Apriori Algorithm
Principle
Helps to reduce the number of item-sets to be checked
Mathematically no item-set can have higher support than its subsets
And, if an item-set is frequent, then all of its subsets must also be frequent
Example
If {beer, diapers, milk} is frequent
Then
{beer, diapers} is frequent
{diapers, milk} is frequent
and {beer, milk} is frequent
Suppose we check this item-set and find the support is low (below the threshold)
Other item-sets that contain AB do not need to be checked, they can be skipped
Illustrating Apriori principle
Items (1-item-sets)
Set Minimum support = 3
Pairs (2-item-sets)
(No need to generate candidates involving Coke or Eggs)
Triplets (3-item-sets)
Method
Let k=1 be the length of item-sets
Generate all item-sets of length 1
Repeat until no new frequent item-sets are identified
Count the support of each candidate
Eliminate candidates that are infrequent, leaving only those that are frequent
set k=k+1
Generate length k candidate item-sets from length k-1 frequent item-sets
Until no more frequent item-sets are produced
Example
Set minsup = 0.6
Item-sets of 1:
{Bread} s=0.8
{Milk} s=0.8
{Diaper} s=0.8
{Coke} s=0.4
{Beer} s=0.6
{Eggs} s=0.2
Keep frequent
{Bread} s=0.8
{Milk} s=0.8
{Diaper} s=0.8
{Beer} s=0.6
Item-sets of 2
{Bread, Mlik} s=0.6
{Milk, Diaper} s=0.6
{Bread, Diaper} s=0.6
{Bread, Beer} s=0.4
{Milk, Beer} s=0.4
{Diaper, Beer} s=0.6
Keep frequent
{Bread, Milk} s=0.6
{Milk, Diaper} s=0.6
{Bread, Diaper} s=0.6
{Diaper, Beer} s=0.6
Item-sets of 3
{Bread, Milk, Beer} s=0.2
{Bread, Milk, Diaper} s=0.4
{Milk, Diaper, Beer} s=0.4
{Bread, Diaper, Beer} s=0.4
Keep frequent
No item-sets over minsup, stop
Advantages
Simple to implement
Limitations
Relies on domain expertise for setting good minsup (and minconf)
May generate large amount of item-sets if requested thresholds are low
WEKA
Associate tab -> associations ->Apriori
Factors affecting complexity
Choice of minimum support threshold
Lowering support threshold results in more frequent item-sets
This may increase number of candidates and max length of frequent item-sets
Dimensionality (number of items) of the data-set
More space is needed to store support count of each item
If number of frequent items also increases, both computation and I/O costs may also increase
Size of database
Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Average transaction width
Transaction width increases with denser data-sets
This may increase max length of frequent item-sets (number of subsets in a transaction increases with its width)
Rule generation for Apriori
When a rule has low confidence, all subsequent rules in the tree are pruned
Rule Generation
\[\text{Given a frequent item-set L, find all non-empty subsets f } \subset \text{L such that f -> L - f satisfies the minimum confidence requirement}\]
If {A, B, C, D} is a frequent item-set, candidate rules:
ABC -> D
ABD -> C
ACD -> B
BCD -> A
A -> BCD
B -> ACD
C -> ABD
D -> ABC
AB -> CD
AC ->BD
AD -> BC
BC -> AD
BD -> AC
CD -> AB
If |L| = k, then there are 2^k - 2 candidate association rules (ignoring L -> null and null -> L)
How to efficiently generate rules from frequent item-sets?
In general, confidence does not have an anti-monotone property
c(ABC -> D) can be larger or smaller than c(AB -> D)
But confidence of rules generated from the same item-set has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC -> D) ≥ c(AB -> CD) ≥ c(A -> BCD)
Confidence is anti-monotone with respect to number of items on the RHS of the rule
Pattern evaluation
Association rule algorithms tend to produce too many rules
Many of them are uninteresting or redundant
Redundant if {A, B, C} -> D and {A, B} -> D have same support and confidence
Interestingness measure can be used to prune/rank the derived patterns
In the original formulation of association rules, support and confidence are the only measures used