Please enable JavaScript.
Coggle requires JavaScript to display documents.
FIM (Introduction (Definition of FIM (Given a database of customer…
FIM
Introduction
Foco
This study is a survey that focuses on the discovery of item-sets in databases, a popular data mining task for analyzing symbolic data.
Begin of the theme
in 1993 by Agrawal and Srikant as large itemset mining, but it is nowadays called frequent itemset mining (FIM).
Definition of FIM
Given a database of customer transactions, FIM consists of discovering groups of items (itemsets) that are frequently purchased together by customers.
FIM can be equivalently defined as the task of finding attribute values that frequently co-occur in a database.
Application
FIM has many applications in a wide range of domains, such as bioinformatics, image classification, network traffic analysis, analyzing customer reviews, activity monitoring, malware detection, and e-learning, to name just a few
FIM has also been extended in many ways to address specific needs. For example, some extensions of FIM are to discover rare patterns, correlated patterns, patterns in sequences and graphs, and patterns that generate a high profit.
Implication of the study
This study provides an up-to-date survey that can serve both as an introduction and as a guide to recent advances and opportunities in the field.
Frequent itemset Mining FIM
Problem
Formal definition
Transaction
The problem of FIM is formally defined as follows. Let there be a set of items (symbols) I ={i1, i2, … im}. A transaction database D ={T1, T2 … Tn}is a set of transactions such that each transaction Tq ? I(1 ≤ q ≤ m) is a set of distinct items, and each transaction Tq has a unique identifier q called its Transaction IDentifier (TID).
Exemple
This database contains five transactions, where the letters a, b, c, d, e represent items bought by customers. For example, the first transaction T1 represents a customer that has bought the items a, c,and d.
Itemset X
An itemset X is a set of items such that X ? I.
Let the notation |X| denotes the set cardinality or, in other words, the number of items in an itemset X. An itemset X is said to be of length k or a k-itemset if it contains k items (|X|= k).
Objective of itemset mining
Is to discover interesting itemsets in a transaction database, i.e., interesting associations between items.
Support
The support (or absolute support) of an itemset X in a database D is denoted as sup(X) and defined as the number of transactions containing X, i.e., sup(X)= |{T|X ? T ^ T 2 D}|.
Another definiton
Note that some authors prefer to define the support of an itemset X as a ratio. This definition called the relative sup-port is relSup(X)= sup(X)/|D|. For example, the relative support of the itemset {a, b} is 0.4.
Task of FIM
The task of FIM consists of discovering all frequent itemsets in a given transaction database. An itemset X is frequent if it has a support that is no less than a given minimum support threshold minsup set by the user (i.e., sup(X) ≥ minsup).
Enumeration problem
Objective
The goal is to enumerate all patterns that meet the minimum support constraint specified by the user.
The naive approaches have a exponential cost, because this solution consider all possible itemsets to then output only those meeting the minimum support constraint specified by the user.
The number of itemsets in the search space generally matters more than the size of the data in FIM
But what influences the number of itemsets in the search space?
The number of itemsets depends on how similar the transactions are in the database, and also on how low the minsup threshold is set by the user.
Discover frequent itemsets efficiently
is thus necessary to design algorithms that avoid exploring the search space of all possible itemsets and that process each itemset in the search space as efficiently as possible
Example
Apriori
FP-Growth,
Eclat,
H-Mine
LCM
The algorithms differ in
(1) whether they use a depth-first or breadth-first search,
(2) the type of database repre-sentation that they use internally or externally,
(3) how they generate or determine the next itemsets to be explored in the search space
(4) how they count the support of itemsets to determine if they satisfy the minimum support constraint.
Breadth-First Search and Depth-First Search
Breadth-First Search
A breadth-first search algorithm (also called a level-wise algorithm) such as Apriori explores the search space of itemsets by first consid-ering 1-itemsets, then 2-itemsets, 3-itemsets, and lastly m-itemsets.
Example
Depth-First Search
However, depth-first search algorithms such as FPGrowth, H-Mine, and LCM start from each 1-itemset and then recursively try to append items to the current itemset to generate larger itemsets.
Design an efficient FIM algorithm
Is important that the algorithm avoid exploring the whole search space of itemsets
To reduce the search space, search space pruning techniques are used.
Properties
downward-closure property
anti-monotonicity
property
Apriori property
Apriori: An Horizontal Breadth-First Search Algorithm
The first FIM algorithm
Takes a transaction database and the minsup threshold as input. Apriori uses a standard database representation, called a horizontal database.
Limitations
The first one is that because Apriori generates candidates by combining itemsets without looking at the database, it can generate some patterns that do not even appear in the database.
The second limitation is that Apriori has to repeatedly scan the database to count the support of candidates, which is very costly.
The third limitation is that the breadth-first search approach can be quite costly in terms of memory as it requires at any moment to keep in the worst case all k and k − 1 itemsets in memory (for k > 1).
Eclat: A Vertical Depth-First Search
Algorithm
The Eclat algorithm improves upon the Apriori approach by using a depth-first search to avoid keeping many itemsets in memory.
Use a vertical database representation
This vertical representation can be obtained by scanning the original horizontal database only once.
Vertical algorithms such as Eclat can explore the search space by scanning the database only once to create the initial list of transaction containing the item i TID-lists. Candidate generation and support counting are carried out directly without scanning the database.
The Eclat algorithm is considered to be a depth-first search algorithm as it outputs all frequent itemsets according to the depth-first search order.
Drawbacks
First, because Eclat also generates candidates without scanning the database, it can spend time considering itemsets that do not exist in the database.
Second, although TID-lists are useful, they can consume a lot of memory especially for dense datasets (datasets where all items appear in almost all transactions).
Pattern-Growth Algorithms
The main idea of pattern-growth algorithms is to scan a database to find itemsets, and thus avoid generating candidates that do not appear in the database.
To reduce the cost of scanning the data-base, pattern-growth algorithms have introduced the concept of projected database to reduce the size of databases as an algorithm explore larger itemsets with its depth-first search.
A pattern-growth algorithm explores the search space using a depth-first search by recursively appending items according to the minor order to frequent itemsets, to obtain larger frequent itemsets.
A major advantage of pattern-growth algo-rithms is that they only explore the frequent itemsets in the search space thus avoiding considering many itemsets not appearing in the database, or infrequent itemsets.
The concept of projected database is also useful to reduce the cost of database scans.
Is it costly to create all these copies of the original database?
The answer is no if an optimi-zation called pseudo-projection is used, which consists of implementing a projected database as a set of pointers on the original database.