Please enable JavaScript.

Coggle requires JavaScript to display documents.

FIM (Introduction (Definition of FIM (Given a database of customer…

- - - - Transaction
        
        The problem of FIM is formally defined as follows. Let there be a set of items (symbols) I ={i1, i2, … im}. A transaction database D ={T1, T2 … Tn}is a set of transactions such that each transaction Tq ? I(1 ≤ q ≤ m) is a set of distinct items, and each transaction Tq has a unique identifier q called its Transaction IDentifier (TID).
        
        Exemple
        
        This database contains five transactions, where the letters a, b, c, d, e represent items bought by customers. For example, the first transaction T1 represents a customer that has bought the items a, c,and d.
      - Itemset X
        
        An itemset X is a set of items such that X ? I.
        Let the notation |X| denotes the set cardinality or, in other words, the number of items in an itemset X. An itemset X is said to be of length k or a k-itemset if it contains k items (|X|= k).
        
        Objective of itemset mining
        
        Is to discover interesting itemsets in a transaction database, i.e., interesting associations between items.
      - Support
        
        The support (or absolute support) of an itemset X in a database D is denoted as sup(X) and defined as the number of transactions containing X, i.e., sup(X)= |{T|X ? T ^ T 2 D}|.
        
        Another definiton
        
        Note that some authors prefer to define the support of an itemset X as a ratio. This definition called the relative sup-port is relSup(X)= sup(X)/|D|. For example, the relative support of the itemset {a, b} is 0.4.
      - Task of FIM
        
        The task of FIM consists of discovering all frequent itemsets in a given transaction database. An itemset X is frequent if it has a support that is no less than a given minimum support threshold minsup set by the user (i.e., sup(X) ≥ minsup).
    - - Objective
        
        The goal is to enumerate all patterns that meet the minimum support constraint specified by the user.
      - The naive approaches have a exponential cost, because this solution consider all possible itemsets to then output only those meeting the minimum support constraint specified by the user.
      - The number of itemsets in the search space generally matters more than the size of the data in FIM
        
        But what influences the number of itemsets in the search space?
        
        The number of itemsets depends on how similar the transactions are in the database, and also on how low the minsup threshold is set by the user.
    - - is thus necessary to design algorithms that avoid exploring the search space of all possible itemsets and that process each itemset in the search space as efficiently as possible
      - Example
        
        Apriori
        
        FP-Growth,
        
        Eclat,
        
        H-Mine
        
        LCM
      - The algorithms differ in
        
        (1) whether they use a depth-first or breadth-first search,
        
        (2) the type of database repre-sentation that they use internally or externally,
        
        (3) how they generate or determine the next itemsets to be explored in the search space
        
        (4) how they count the support of itemsets to determine if they satisfy the minimum support constraint.
  - - - A breadth-first search algorithm (also called a level-wise algorithm) such as Apriori explores the search space of itemsets by first consid-ering 1-itemsets, then 2-itemsets, 3-itemsets, and lastly m-itemsets.
      - Example
    - - However, depth-first search algorithms such as FPGrowth, H-Mine, and LCM start from each 1-itemset and then recursively try to append items to the current itemset to generate larger itemsets.
    - - Is important that the algorithm avoid exploring the whole search space of itemsets
      - To reduce the search space, search space pruning techniques are used.
      - Properties
        
        downward-closure property
        
        anti-monotonicity
        
        property
        
        Apriori property
  - - - The first one is that because Apriori generates candidates by combining itemsets without looking at the database, it can generate some patterns that do not even appear in the database.
      - The second limitation is that Apriori has to repeatedly scan the database to count the support of candidates, which is very costly.
      - The third limitation is that the breadth-first search approach can be quite costly in terms of memory as it requires at any moment to keep in the worst case all k and k − 1 itemsets in memory (for k > 1).
  - - - First, because Eclat also generates candidates without scanning the database, it can spend time considering itemsets that do not exist in the database.
      - Second, although TID-lists are useful, they can consume a lot of memory especially for dense datasets (datasets where all items appear in almost all transactions).
  - - - Is it costly to create all these copies of the original database?
        
        The answer is no if an optimi-zation called pseudo-projection is used, which consists of implementing a projected database as a set of pointers on the original database.