Source code please see in Github repository
Q1.Hash Tree
We have this hash functions shown in the figure below.
(a) Please write a program to generate a hash tree with max leaf size 3, output the nested list (or nested dict) of the hash tree hierarchically and draw the structure of the hash tree (you can write program to draw this hash tree or just manually draw it according to the nested list you output).
Build a data structure of LinkedList
In the last level of leaf node, it may exist items over 3 but the max leaf size == 3. Therefore, we could use Linkedlist
to store items whose index >= 2. We first build a LinkedList
structure.
1 | class Node(object): |
Define the Hash Tree function
1 | import sys |
1 | # The example is 3-item so we build a hash tree with max-level 3, starting from 0 depth |
[1, 8, 9]
[3, 4, 6]
[7, 8, 9]
[2, 6, 9]
[4, 5, 6]
[4, 5, 9]
[[[[1, 3, 9], [3, 7, 8]],
[[[1, 2, 3]],
[[3, 4, 8]],
[[1, 4, 5], [1, 4, 6], <__main__.LinkedList at 0x2b9156641d0>]],
[[[1, 5, 7]], [[1, 6, 8]], [[1, 5, 9], [1, 6, 9]]]],
[[[], [[2, 7, 8], [4, 7, 8]], [[2, 3, 9], [2, 7, 9], [4, 7, 9]]],
[[2, 8, 9], [4, 8, 9]],
[[[2, 5, 7], [2, 6, 7]],
[[2, 6, 8], [4, 5, 8]],
[[2, 5, 6], [2, 5, 9], <__main__.LinkedList at 0x2b915664518>]]],
[[[5, 7, 8], [5, 7, 9], [6, 7, 9]],
[[5, 8, 9], [6, 8, 9]],
[[5, 6, 7], [5, 6, 8]]]]
Therefore, we can draw the hash tree below
(b) Given a transaction that contains items {1, 3, 5, 6, 7, 9}
, how many comparisons are needed using the hash tree which you generate above? Please circle these candidates in the hash tree. No programming required.
Match transaction 10 candidates in the hash tree which are circled. It needs to compare totally 34 times using my generated hash tree.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
161 + 35679
13 + 5679 2+2+2+1=7
15 + 679 2+1+1=4
16 + 79 1+2=3
17 + 9 2
3 + 5679
35 + 679 2+1+2=5
36 + 79 1+2=3
37 + 9 2
5 + 679
56 + 79 1+2=3
57 + 9 2
679 3
Q2. FP-Tree
Suppose you are given some transactions and a vocabulary that map terms to indexes. Please use
FP-Tree algorithm to discover the frequent itemsets.
Data Description:
- topi-i.txt
Input file of frequent pattern mining algorithms. Each line represents a transaction with indices of terms.
format: term1_index term2_index term3_index …
Columns are separated by a space.
- vocab.txt
Dictionary that maps term index to term.
format: term_index term
Columns are separated by a space.
- pattern-i.txt:
The file you need to submit, which contains your result for this frequent pattern mining task. Each line represents a transaction with frequent itemsets sorted in descending order of support count.
format: support_count term1 term2 …
support_count and term1 are separated by a tab, while terms are separated by a space.
Here we give an example:1
2
3
4233 rule association
230 random
227 finding
203 mining pattern
Questions:
(a) Please write a program to implement FP-growth algorithm and find all frequent itemsets with support >= 400
in the given dataset.
(b) Based on the result of (a), please print out those FP-conditional trees whose height is larger than 1.
Define preprocessed funtion
read_file()
function that converts txt into nested listcreate_init_set()
function that generates transaction dictionary which is as initial data to inputcreate_fp_tree()
function
1 | def read_file(file): |
Define the class of TreeNode and make a vocab dict
1 | # Define the vocab_dict which is used to transform index to vocabulary |
1 | # Define the structure of TreeNode |
Define FP-growth function
1 | def update_header(node_to_test, target_node): |
Generate FP-Tree from txt file
1 | # Test for topic-0.txt |
{frozenset({'248'}): 668, frozenset({'390'}): 624, frozenset({'458'}): 500, frozenset({'298'}): 328, frozenset({'382'}): 324, frozenset({'118'}): 303, frozenset({'390', '382'}): 296, frozenset({'473'}): 217, frozenset({'723'}): 208, frozenset({'225'}): 170, frozenset({'382', '723'}): 123, frozenset({'382', '225'}): 108, frozenset({'473', '248'}): 67, frozenset({'458', '248'}): 52, frozenset({'298', '248'}): 47, frozenset({'390', '248'}): 45, frozenset({'390', '298'}): 38, frozenset({'473', '390'}): 37, frozenset({'118', '248'}): 37, frozenset({'382', '458'}): 33, frozenset({'382', '248'}): 28, frozenset({'118', '390'}): 26, frozenset({'225', '248'}): 23, frozenset({'390', '458'}): 23, frozenset({'390', '225'}): 23, frozenset({'390', '723'}): 23, frozenset({'390', '382', '723'}): 23, frozenset({'298', '458'}): 21, frozenset({'390', '382', '248'}): 20, frozenset({'390', '382', '298'}): 20, frozenset({'118', '723'}): 20, frozenset({'723', '298'}): 18, frozenset({'118', '458'}): 18, frozenset({'473', '382'}): 18, frozenset({'390', '382', '225'}): 15, frozenset({'382', '225', '248'}): 14, frozenset({'473', '118'}): 13, frozenset({'118', '298'}): 13, frozenset({'473', '382', '723'}): 12, frozenset({'118', '390', '382'}): 12, frozenset({'473', '723'}): 11, frozenset({'473', '225'}): 11, frozenset({'382', '723', '458'}): 11, frozenset({'382', '298'}): 11, frozenset({'473', '298'}): 9, frozenset({'458', '298', '248'}): 9, frozenset({'473', '458'}): 9, frozenset({'473', '390', '382'}): 9, frozenset({'225', '298'}): 9, frozenset({'118', '382'}): 9, frozenset({'723', '458'}): 9, frozenset({'473', '382', '225'}): 8, frozenset({'473', '118', '248'}): 8, frozenset({'118', '382', '723'}): 7, frozenset({'723', '248'}): 7, frozenset({'382', '723', '248'}): 7, frozenset({'458', '473', '248'}): 7, frozenset({'473', '390', '248'}): 6, frozenset({'382', '723', '298'}): 6, frozenset({'473', '382', '248'}): 5, frozenset({'118', '225'}): 5, frozenset({'118', '390', '723'}): 4, frozenset({'473', '390', '298'}): 4, frozenset({'390', '382', '458'}): 4, frozenset({'225', '723'}): 4, frozenset({'390', '382', '225', '248'}): 4, frozenset({'473', '118', '390'}): 3, frozenset({'118', '382', '458'}): 3, frozenset({'390', '382', '723', '298'}): 3, frozenset({'118', '382', '225'}): 3, frozenset({'473', '298', '458'}): 2, frozenset({'118', '390', '248'}): 2, frozenset({'458', '473', '723', '248'}): 2, frozenset({'382', '225', '298'}): 2, frozenset({'473', '118', '382'}): 2, frozenset({'390', '225', '248'}): 2, frozenset({'390', '723', '458'}): 2, frozenset({'473', '118', '382', '723'}): 2, frozenset({'473', '382', '225', '248'}): 2, frozenset({'473', '390', '382', '723'}): 2, frozenset({'118', '298', '458'}): 2, frozenset({'473', '390', '382', '225'}): 1, frozenset({'473', '723', '298'}): 1, frozenset({'473', '390', '225', '723'}): 1, frozenset({'225', '458'}): 1, frozenset({'118', '382', '248'}): 1, frozenset({'118', '723', '458'}): 1, frozenset({'473', '118', '298'}): 1, frozenset({'118', '390', '298'}): 1, frozenset({'473', '382', '723', '298'}): 1, frozenset({'118', '723', '248'}): 1, frozenset({'390', '723', '298'}): 1, frozenset({'298', '382', '248'}): 1, frozenset({'473', '225', '248'}): 1, frozenset({'225', '382', '723'}): 1, frozenset({'298', '382', '458'}): 1, frozenset({'473', '382', '723', '248'}): 1, frozenset({'225', '382', '723', '298'}): 1, frozenset({'473', '382', '298'}): 1, frozenset({'298', '390', '248'}): 1, frozenset({'118', '225', '248'}): 1, frozenset({'118', '298', '723', '248'}): 1, frozenset({'458', '225', '248'}): 1, frozenset({'390', '723', '248'}): 1, frozenset({'118', '390', '458'}): 1, frozenset({'118', '390', '225'}): 1, frozenset({'382', '248', '723', '473', '390'}): 1, frozenset({'473', '382', '723', '458'}): 1, frozenset({'118', '390', '382', '723'}): 1, frozenset({'225', '723', '298'}): 1, frozenset({'458', '390', '382', '248'}): 1, frozenset({'225', '723', '248'}): 1, frozenset({'473', '390', '723'}): 1, frozenset({'298', '390', '473', '248'}): 1, frozenset({'473', '118', '390', '723'}): 1, frozenset({'458', '118', '248'}): 1, frozenset({'473', '118', '723', '248'}): 1, frozenset({'473', '390', '382', '248'}): 1, frozenset({'473', '118', '723'}): 1, frozenset({'473', '118', '225'}): 1, frozenset({'118', '298', '248'}): 1, frozenset({'390', '225', '458'}): 1, frozenset({'458', '382', '248'}): 1, frozenset({'458', '390', '248'}): 1, frozenset({'298', '723', '248'}): 1, frozenset({'473', '382', '458'}): 1, frozenset({'458', '473', '382', '248'}): 1, frozenset({'473', '723', '248'}): 1, frozenset({'118', '723', '298'}): 1, frozenset({'473', '298', '248'}): 1, frozenset({'298', '248', '723', '473', '390'}): 1}
{'390': [1288, <__main__.TreeNode at 0x2b915671518>],
'382': [1163, <__main__.TreeNode at 0x2b915671320>],
'248': [1087, <__main__.TreeNode at 0x2b915671278>],
'458': [720, <__main__.TreeNode at 0x2b915671630>],
'298': [560, <__main__.TreeNode at 0x2b915671588>],
'723': [528, <__main__.TreeNode at 0x2b9156715f8>],
'118': [509, <__main__.TreeNode at 0x2b915671208>],
'473': [488, <__main__.TreeNode at 0x2b915671550>],
'225': [416, <__main__.TreeNode at 0x2b915671400>]}
Begin to Mine FP-tree
To mine the FP-tree, we need to start from certain node and go through all nodes in the direction of root and build conditional pattern base. Then using the pattern base to generate FP-conditional tree. Finally, through recursively mining FP-conditional tree, we can gain the frequent itemsets pattern.
1 | # Ascend FP Tree |
1 | def mine_fp_tree(in_tree, header_table, threshold, pre_fix, freq_item_list, min_height): |
1 | def mine_from_txt(topic_file, pattern_file, threshold=400, min_height=1): |
Print the FP-conditional tree and write pattern in file
According to the question description, for eache topic, we need to output to a pattern.txt
file and we also need to find out FP-conditional tree with height more than 1. Therefore we do this:
1 | # Process all topic files |
processing topic-0.txt
height = 2
list form of cond tree:
['Null Set 1', ['data 413']]
display cond tree
Null Set 1
390 413
frequent items pattern:
[[1288, {'data'}], [1163, {'mining'}], [1087, {'algorithm'}], [720, {'graph'}], [560, {'time'}], [528, {'pattern'}], [509, {'tree'}], [488, {'efficient'}], [416, {'rule'}], [413, {'mining', 'data'}]]
write the pattern in pattern-0.txt
processing topic-1.txt
frequent items pattern:
[[1488, {'learning'}], [1050, {'using'}], [819, {'model'}], [715, {'based'}], [582, {'classification'}], [488, {'feature'}], [474, {'clustering'}], [463, {'network'}]]
write the pattern in pattern-1.txt
processing topic-2.txt
height = 2
list form of cond tree:
['Null Set 1', ['information 475']]
display cond tree
Null Set 1
190 475
frequent items pattern:
[[1226, {'web'}], [1211, {'information'}], [1114, {'retrieval'}], [863, {'based'}], [757, {'system'}], [707, {'search'}], [564, {'document'}], [490, {'language'}], [475, {'information', 'retrieval'}], [421, {'model'}], [414, {'semantic'}]]
write the pattern in pattern-2.txt
processing topic-3.txt
frequent items pattern:
[[1074, {'database'}], [928, {'system'}], [743, {'knowledge'}], [558, {'learning'}], [514, {'data'}], [506, {'logic'}], [493, {'reasoning'}], [446, {'model'}], [426, {'constraint'}]]
write the pattern in pattern-3.txt
processing topic-4.txt
frequent items pattern:
[[1713, {'query'}], [1163, {'database'}], [1040, {'data'}], [767, {'system'}], [637, {'processing'}], [567, {'distributed'}], [528, {'object'}], [508, {'efficient'}], [458, {'xml'}]]
write the pattern in pattern-4.txt