lec9

Itemset mining

  • itemset patterns (sets that follow itemset structure)
  • Customer -> Browsing + logs
  • Frequently occuring sets (frequet co-occurrences)
  • Helps with shelf placement (bananas and cereal frequently bought together)

Terminlogy

  • I = items (books)
  • T = transaction ids
  • x is an itemset \(x \subseteq I\)
  • \(2^I \) is the powerset (all possible subsets)
  • k itemset \(\Rightarrow |x| = k\), cardinality of x is k
  • Tid set y \(\subseteq T\)
  • sup(x) = number of times x occurs in the database
  • Absolute support : number of times
  • relative support : number of times/n
  • sup(x) = |t(y)|

Example

  • ABCDE
  • BCE
  • ABDEE
  • ADCE
  • ABCDE
  • BDC

Apriori algorithm

  • minsp (s)
  • 2I subsets (exponential search space)
  • Naive algorithm: for all x subset of I, compute the support, if support is larger than min support, print
  • \(O(2^{|I|} * |D| * |I|)\)
  • Want to avoid duplicate elimination
  • Prune the space
  • if x is frequent, then any subset of x must be frequent
  • level wise/breadth-first (sum all occurences of substring, increasing the size of each string at each level)
  • O(l * (dadabase size in bytes))
  • O(2l * |D| * l)

Vertical approach

  • log each single string occurance in vertical set
  • Successivley

Projection + prefix tree (FP growth)

  • sort items in decreasing order of support (optional)
  • create a prefix tree with counts
  • This tree is the tree-index
  • Project the tree from the end of each long branch
  • O(2l * |D|)