Itemset mining
-
itemset patterns (sets that follow itemset structure)
-
Customer -> Browsing + logs
-
Frequently occuring sets (frequet co-occurrences)
-
Helps with shelf placement (bananas and cereal frequently bought together)
Terminlogy
-
I = items (books)
-
T = transaction ids
-
x is an itemset \(x \subseteq I\)
-
\(2^I \) is the powerset (all possible subsets)
-
k itemset \(\Rightarrow |x| = k\), cardinality of x is k
-
Tid set y \(\subseteq T\)
-
sup(x) = number of times x occurs in the database
-
Absolute support : number of times
-
relative support : number of times/n
-
sup(x) = |t(y)|
Example
-
ABCDE
-
BCE
-
ABDEE
-
ADCE
-
ABCDE
-
BDC
Apriori algorithm
-
minsp (s)
-
2I subsets (exponential search space)
-
Naive algorithm: for all x subset of I, compute the support, if support is larger than min support, print
-
\(O(2^{|I|} * |D| * |I|)\)
-
Want to avoid duplicate elimination
-
Prune the space
-
if x is frequent, then any subset of x must be frequent
-
level wise/breadth-first (sum all occurences of substring, increasing the size of each string at each level)
-
O(l * (dadabase size in bytes))
-
O(2l * |D| * l)
Vertical approach
-
log each single string occurance in vertical set
-
Successivley
Projection + prefix tree (FP growth)
-
sort items in decreasing order of support (optional)
-
create a prefix tree with counts
-
This tree is the tree-index
-
Project the tree from the end of each long branch
-
O(2l * |D|)