Please enable JavaScript.

Coggle requires JavaScript to display documents.

Lecture 7 Association (What is Association Analysis? (Association Rules…

- - - - Example rules
        
        {Diaper} -> {Beer}
        
        {Milk, Bread} -> {Eggs, Coke}
        
        {Beer, Bread} -> {Milk}
      - Implication means co-occurrence, not causality
  - - - e.g. grocery shops
        
        What items are often bought together
    - - e.g. text analysis
        
        What words often occur together
    - - e.g. movie rental
        
        What movies are often watched together
  - - - A collection of one or more items
        
        Example: {Milk, Bread, Diaper}
        
        Candidate item-sets
        
        All potential combinations of items
        
        K-item-set
        
        An item-set that contains K items
      - \[\text{Support count }(\sigma)\]
        
        How many times an item-set occurs
        
        \[\text{e.g. }\sigma(\text{{Milk, Bread, Diaper}})=2\] #
      - Support (s)
        
        Transactions that contain an item-set in total transactions
        
        e.g. s({Milk, Bread, Diaper}) = 2/5 #
      - Frequent Item-set
        
        An item-set whose support is greater than or equal to a minimum support threshold
    - - Given a data-set and min support request, find all frequent item-sets
      - Example application
        
        Movie Rentals
        
        Output: {LOTR1, LOTR2}
        
        Interpretation: LOTR1 and LOTR2 are often rented together
        
        Possible action: company can bundle these products or offer a special pricing if you rent both
  - - - or in another for: X -> Y
        
        Example
        
        {Milk, Diaper} -> {Beer}
    - - Every rule is assessed individually
      - Two evaluation measures
        
        Support (s) = transactions that contain X and Y / all transactions
        
        X = banana
        
        10 shopping transactions
        
        4 people bought bananas and apples in a single transaction
        
        Y = apple
        
        Support = 0.4
        
        Confidence (c) = transactions that contain X and Y / transactions that only contain X
        
        X = banana
        
        Y = apple
        
        4 people bought banana and apple in a single transaction
        
        5 people bought banana
        
        Confidence = 0.8
      - Example
        
        {Milk, Diaper} -> Beer
        
        \[s=\frac{2}{5}=0.4\]
        
        \[c=\frac{2}{3}=0.67\]
    - - Given a data-set and min support and min confidence, find all common and confident association rules
      - Example application
        
        Movie Rentals
        
        Output: {LOTR2} -> {LOTR1}
        
        Interpretation: if a customer rents LOTR2, often LOTR1 is also rented (other way around is not frequent)
        
        Possible action: a reminder that someone renting LOTR2 should be offered to rent LOTR1
  - - - Typical transactions are:
        
        Shopping
        
        Items bought by customers
        
        Financial
        
        Orders
        
        Invoices
        
        Payments
        
        Work
        
        Plans
        
        Activity records
        
        Logistics
        
        Deliveries
        
        Storage records
        
        Travel records
      - Transactional data is often stored in databases by having a table with the transaction ID as the primary key
      - Data in this format needs to be converted to one row per transaction before it can be used to learn classifiers, association rules, or clustering algorithms
    - - #
      - Product attributes are binary
        1 means that the product was in the basket
        0 means that the product wasn't in the basket
        
        There is no target label
        (unsupervised learning task)
      - #
- - - - Support ≥ minsup threshold
      - Confidence ≥ minconf threshold
    - - List all possible item-sets, compute support for each
      - Discard item-sets that are infrequent
        
        Stop here if you only need frequent item-sets, not rules
      - List all possible association rules from the remaining item-sets
      - Compute the confidence for each rule
      - Discard the rules that are not confident
      - Problem: impractical and often computationally infeasible
    - - Examples of rules
        
        {Milk, Diapers} -> {Beer}
        
        s=0.4
        
        c = 0.67
        
        {Milk, Beer} -> {Diapers}
        
        s=0.4
        
        c=1.0
        
        {Diapers, Beer} -> {Milk}
        
        s=0.4
        
        c=0.67
        
        {Beer} -> {Milk, Diapers}
        
        s=0.4
        
        c=0.67
        
        {Diapers} -> {Milk, Beer}
        
        s=0.4
        
        c=0.5
        
        {Milk} -> {Diapers, Beer)
        
        s=0.4
        
        c=0.5
    - - All the above rules are binary partitions of the same item-set
        
        {Milk, Diapers, Beer}
      - Rules originating from the same item-set have identical support but have different confidence
      - Thus, we may decouple the support and confidence requirements
    - - Frequent Item-set generation
        
        Generate all item-sets whose support equal to or greater than minsup
      - Rule generation
        
        Generate high confidence rules from each frequent item-set, where each rule is a binary partitioning of a frequent item-set
    - - Given d items, there are 2^d possible candidate item-sets
      - Each item-set in the lattice is a candidate frequent item-set
      - Count the support of each candidate by scanning the database
      - Match each transaction against every candidate
      - Complexity ~ O(NMw) => Expensive since M=2^d
    - - Given d unique items:
        
        Total number of item-sets = 2^d
        
        Total number of possible association rules:
        
        \[ R = \sum^{d-1}_{k=1}\begin{bmatrix} \begin{pmatrix} d \\ k \\ \end{pmatrix} \times \sum^{d-k}_{j=1}\begin{pmatrix} d-k \\ j \\ \end{pmatrix} \end{bmatrix} = 3^d - 2^{d+1} +1 \]
        
        If d=6, R = 602 rules
    - - Reduce the number of candidates (M)
        
        Complete search: M=2^d
        
        Use pruning techniques to reduce M
      - Reduce the number of transactions (N)
        
        Reduce the size of N as the size of item-set increases
      - Reduce the number of comparisons (NM)
        
        Use efficient data structures to store the candidates or transactions
        
        No need to match every candidate against every transaction
      - Apriori Algorithm
        
        Principle
        
        Helps to reduce the number of item-sets to be checked
        
        Mathematically no item-set can have higher support than its subsets
        
        And, if an item-set is frequent, then all of its subsets must also be frequent
        
        Example
        
        If {beer, diapers, milk} is frequent
        
        Then
        
        {beer, diapers} is frequent
        
        {diapers, milk} is frequent
        
        and {beer, milk} is frequent
        
        Suppose we check this item-set and find the support is low (below the threshold)
        
        Other item-sets that contain AB do not need to be checked, they can be skipped
        
        Illustrating Apriori principle
        
        Items (1-item-sets)
        Set Minimum support = 3
        
        Pairs (2-item-sets)
        (No need to generate candidates involving Coke or Eggs)
        
        Triplets (3-item-sets)
        
        Method
        
        Let k=1 be the length of item-sets
        
        Generate all item-sets of length 1
        
        Repeat until no new frequent item-sets are identified
        
        Count the support of each candidate
        
        Eliminate candidates that are infrequent, leaving only those that are frequent
        
        set k=k+1
        
        Generate length k candidate item-sets from length k-1 frequent item-sets
        
        Until no more frequent item-sets are produced
        
        Example
        
        Set minsup = 0.6
        
        Item-sets of 1:
        
        {Bread} s=0.8
        {Milk} s=0.8
        {Diaper} s=0.8
        {Coke} s=0.4
        {Beer} s=0.6
        {Eggs} s=0.2
        
        Keep frequent
        
        {Bread} s=0.8
        {Milk} s=0.8
        {Diaper} s=0.8
        {Beer} s=0.6
        
        Item-sets of 2
        
        {Bread, Mlik} s=0.6
        {Milk, Diaper} s=0.6
        {Bread, Diaper} s=0.6
        {Bread, Beer} s=0.4
        {Milk, Beer} s=0.4
        {Diaper, Beer} s=0.6
        
        Keep frequent
        
        {Bread, Milk} s=0.6
        {Milk, Diaper} s=0.6
        {Bread, Diaper} s=0.6
        {Diaper, Beer} s=0.6
        
        Item-sets of 3
        
        {Bread, Milk, Beer} s=0.2
        {Bread, Milk, Diaper} s=0.4
        {Milk, Diaper, Beer} s=0.4
        {Bread, Diaper, Beer} s=0.4
        
        Keep frequent
        
        No item-sets over minsup, stop
        
        Advantages
        
        Simple to implement
        
        Limitations
        
        Relies on domain expertise for setting good minsup (and minconf)
        
        May generate large amount of item-sets if requested thresholds are low
        
        WEKA
        
        Associate tab -> associations ->Apriori
        
        Factors affecting complexity
        
        Choice of minimum support threshold
        
        Lowering support threshold results in more frequent item-sets
        
        This may increase number of candidates and max length of frequent item-sets
        
        Dimensionality (number of items) of the data-set
        
        More space is needed to store support count of each item
        
        If number of frequent items also increases, both computation and I/O costs may also increase
        
        Size of database
        
        Since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
        
        Average transaction width
        
        Transaction width increases with denser data-sets
        
        This may increase max length of frequent item-sets (number of subsets in a transaction increases with its width)
        
        Rule generation for Apriori
        
        When a rule has low confidence, all subsequent rules in the tree are pruned
    - - \[\text{Given a frequent item-set L, find all non-empty subsets f } \subset \text{L such that f -> L - f satisfies the minimum confidence requirement}\]
        
        If {A, B, C, D} is a frequent item-set, candidate rules:
        
        ABC -> D
        
        ABD -> C
        
        ACD -> B
        
        BCD -> A
        
        A -> BCD
        
        B -> ACD
        
        C -> ABD
        
        D -> ABC
        
        AB -> CD
        
        AC ->BD
        
        AD -> BC
        
        BC -> AD
        
        BD -> AC
        
        CD -> AB
      - If |L| = k, then there are 2^k - 2 candidate association rules (ignoring L -> null and null -> L)
      - How to efficiently generate rules from frequent item-sets?
        
        In general, confidence does not have an anti-monotone property
        
        c(ABC -> D) can be larger or smaller than c(AB -> D)
        
        But confidence of rules generated from the same item-set has an anti-monotone property
        
        e.g., L = {A,B,C,D}:
        
        c(ABC -> D) ≥ c(AB -> CD) ≥ c(A -> BCD)
        
        Confidence is anti-monotone with respect to number of items on the RHS of the rule
    - - Association rule algorithms tend to produce too many rules
        
        Many of them are uninteresting or redundant
        
        Redundant if {A, B, C} -> D and {A, B} -> D have same support and confidence
      - Interestingness measure can be used to prune/rank the derived patterns
      - In the original formulation of association rules, support and confidence are the only measures used