An introduction to frequent pattern mining research
Updated: Jun 22, 2019
Summary of Apriori, Eclat and FP tree algorithms
What are frequent patterns?
Frequent patterns are collections of items which appear in a data set at an important frequency (usually greater than a predefined threshold)and can thus reveal association rules and relations between variables. Frequent pattern mining is a research area in data science applied to many domains such as recommender systems (what are the set of items usually ordered together), bioinformatics (what are the genes co-expressed in a given condition), decision making, clustering, website navigation.
This section will introduce the recurrent terminology of frequent pattern mining domain.
Input data is usually stored in a database or as a collection of transactions. A transaction is a collection of items which have been observed together (e.g. the list of products ordered by a customer during a shopping session or the list of expressed genes in a condition). Each transaction is identified by a unique transaction id. Typically there are 2 types of architectures (layouts) used for storing such transactional data:
horizontal layout: 2 columns structure in which one column contains transaction ids and the second one the list of associated items, for instance:
transaction1: [item1, item2, item7]
transaction2: [item1, item2, item5]
transactionk: [item1, itemk]
vertical layout: 2 columns structure in which one column contains individual item ids and the second one the associated transaction ids, for instance
item1: [transaction1, transaction5]
item2: [transaction2, transaction4]
A frequent pattern represents a set of items co-occurring in comparatively more transactions, for instance in the horizontal layout example item1 and item2 appear frequently together. This frequency is quantified using the support metric. Itemset support is the number of transactions where the itemset elements appear together divided by the total number of transactions.
Support(item1) = count(item1)/count(all transactions)
Minimum support is a threshold used by the following algorithms in order to discard sets of items from the analysis which don’t appear frequently enough.
The strength of the association rule between 2 items, (for instance item1 and item2) or the association confidence represents the number of transactions containing item1 and item 2 divided by the number of transactions containing item1
Confidence (item1→item2)=count(item1 & item2)/count(item1)
The confidence metric estimates the likelihood that a transaction containing item1 will include also item2.
The lift is the ratio of observing two items together to the likelihood of seeing just one of them. A lift greater than 1 means that items1 and 2 are more likely present together in transactions while values inferior to 1 apply to the cases when the two items are rarely associated.
Lift(item1→item2) = (Confidence (item1→item2))/(Support (item2))
There are many very well written posts on frequent pattern matching topics with well detailed examples. In this post I will give them credit and provide you with links you can follow for more in-depth details on how these algorithms work with step by step examples. I will also provide references to libraries in python if you are looking for an out of the box implementation.
Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that if a subset S appears k times in a database, any other subset S1 which contains S will appear k times or less. This implies that when deciding on a minimum support threshold(minimum frequency an item set needs to have in order to not be discarded)we can avoid calculating S1 or any other superset of S if support(s) < minimum support. It can be said that all such candidates are being discarded a priori.
The algorithm computes the counts for all itemsets of k elements (starting with k = 1). During the next iterations the previous sets are being joined and thus we create all possible k + 1 itemsets. Only the combinations appearing at a frequency inferior to the minimum support rate are being discarded. The iterations end when no further extensions (joins) are being found.
An example of running this algorithm step by step on a dummy data set can be found here.
Apriori algorithm produces a large number of candidates items (with possible duplicates) and performs many database scans (equal to the maximum length of frequent itemset). It is thus very expensive to run on large databases.
Eclat (Equivalence Class Clustering and bottom up Lattice traversal) algorithm uses data organized by vertical layout which associates each element with the list of underlying transactions. In an iterative depth-first search way the algorithm continues by calculating for all combinations of k items (starting from k=1, it calculates all pairs of 2 items) the list of common transactions. In a nutshell, during the k step all combinations of k items are calculated by intersecting the lists of transactions associated with the k-1 itemsets. K will be incremented by 1 each time until no frequent items or no candidate items can still be found.
Eclat algorithm is generally faster than apriori and requires only one database scan which will find the support for all itemsets with 1 element. All k>1 iterations rely only on previously stored data.
An example of running this algorithm step by step on a dummy data set can be found here.Python implementation
FP tree algorithm
FP tree algorithm uses data organized by horizontal layout. It is the most computationally efficient algorithm from the 3 presented in this post. It only performs 2 database scans and keeps the data in an easily exploitable tree structure.
The first database scan sorts all items in the order of their global occurrence in the database (the equivalent of applying a counter to the unraveled data of all transactions). The second pass iterates line by line through the list of transactions and for each transaction it sorts the elements by the global order (previous step corresponding to the first database pass) and introduces them as nodes of a tree grown in depth. These nodes are introduced with a count value of 1. Continuing the iterations, for each line new nodes are being added to the tree at the point where the ordered items differ from the existing tree. If the same pattern already exists, all common nodes will increase their count value by one.
The FP tree can be pruned by removing all nodes having a count value inferior to a minimum threshold occurrence. The remaining tree can be traversed and for instance all paths from the root node to leaves correspond to clusters of frequently occurring items.
An example of running this algorithm step by step on a dummy data set can be found here.Python implementations