Data Mining Algorithms In R/Sequence Mining/SPADE
Frequent Sequence Mining is used to discover a set of patterns shared among objects which have between them a specific order. For instance, a retail shop may possess a transaction database which specifies which products were acquired by each customer over time. In this case, the store may use Frequent Sequence Mining to find that 40% of the people who bought the first volume of Lord of the Rings came back to buy the second volume a month later. This kind of information may be used to support directed advertising campaigns or recommendation systems.
Another application domain where Frequent Sequence Mining may be used is Web click log analysis in Information Retrieval systems, in which case the system performance may be refined by analyzing the sequence of interactions that the user exposed while searching or browsing for a specific information. This kind of usage becomes especially clear when we consider the huge amount of data obtained by industrial search engines in the form of query logs. Google alone was reported to answer 5.42 billion queries during December 2008 (Telecom Paper, 2009)
In biology Frequent Sequence Mining may be used to extract information hidden in DNA sequences. Sequence databases in biology are often huge and complex due to variations from genetic mutations and evolution (Li et al., 2007). For example, Frequent Sequence Mining can be used to extract patterns which may be determinant to the development of genetic conditions.
A sequence α is an ordered list of events <a1,a2,...,am>. An event is a non-empty unordered set of items ai ⊆ i1,i2,...,ik. A sequence α = <a1,a2,...,am> is a subsequence of β = < b1, b2,...,bn > if and only if exists i1,i2,...,im such that 1 ≤ i1 < i2 < ... < im ≤ n and a1 ⊆ bi1, a2 ⊆ bi2 and am ⊆ bim. Given a sequence database D = s1,s2,...,sn, the support of a sequence α is the number of sequences of D which contains α as a subsequence. If the support of α is bigger than a threshold minsup, then α is a frequent sequence (Peng and Liao, 2009).
Algorithm
editAn algorithm to Frequent Sequence Mining is the SPADE (Sequential PAttern Discovery using Equivalence classes) algorithm. It uses a vertical id-list database format, where we associate to each sequence a list of objects in which it occurs. Then, frequent sequences can be found efficiently using intersections on id-lists. The method also reduces the number of databases scans, and therefore also reduces the execution time.
The first step of SPADE is to compute the frequencies of 1-sequences, which are sequences with only one item. This is done in a single database scan. The second step consists of counting 2-sequences. This is done by transforming the vertical representation into an horizontal representation in memory, and counting the number of sequences for each pair of items using a bidimensional matrix. Therefore, this step can also be executed in only one scan.
Subsequent n-sequences can then be formed by joining (n-1)-sequences using their id-lists. The size of the id-lists is the number of sequences in which an item appears. If this number is greater than minsup, the sequence is a frequent one. The algorithm stops when no frequent sequences can be found anymore. The algorithm can use a breadth-first or a depth-first search method for finding new sequences (Zaki, 2001)
Implementation
editThe first step is to install the arulesSequences package (Buchta and Hahsler, 2010).
> install.packages("arules") > install.packages("arulesSequences")
To illustrate the use of CSPADE, we are going to use an example file that can be found inside the package arulesSequence. This file is listed bellow:
$ cat /usr/local/lib/R/site-library/arulesSequences/misc/zaki.txt 1 10 2 C D 1 15 3 A B C 1 20 3 A B F 1 25 4 A C D F 2 15 3 A B F 2 20 1 E 3 10 3 A B F 4 10 3 D G H 4 20 2 B F 4 25 3 A G H
Each line in the zaki.txt file is an event. The first column is the sequence id, that is, the sequence to which this event belongs. The second column is the event timestamp, which is the moment in time when the event has occurred. The third column is the number n of items in the event, and it is followed by n additional columns, one for each item.
First, we need to load the necessary packages:
> library(Matrix) > library(arules) > library(arulesSequences)
To read the zaki.txt file, issue the following commands:
> x <- read_baskets(con = system.file("misc", "zaki.txt", package = "arulesSequences"), info = c("sequenceID","eventID","SIZE")) > as(x, "data.frame") items sequenceID eventID SIZE 1 {C,D} 1 10 2 2 {A,B,C} 1 15 3 3 {A,B,F} 1 20 3 4 {A,C,D,F} 1 25 4 5 {A,B,F} 2 15 3 6 {E} 2 20 1 7 {A,B,F} 3 10 3 8 {D,G,H} 4 10 3 9 {B,F} 4 20 2 10 {A,G,H} 4 25 3
Next, we execute the CSPADE algorithm:
> s1 <- cspade(x, parameter = list(support = 0.4), control = list(verbose = TRUE))
Note that we executed the cpade algorithm with the data contained in zaki object. We have set the support parameter to 0.4, and also have instructed the algorithm to show a verbose output.
The algorithm output will be the following:
preprocessing ... 1 partition(s), 0 MB [0.046s] mining transactions ... 0 MB [0.022s] reading sequences ... [0.07s] total elapsed time: 0.138s
Visualization
editWe can use the command summary and as to see the results:
cspade> summary(s1) set of 18 sequences with most frequent items: A B F D (Other) 11 10 10 8 28 most frequent elements: {A} {D} {B} {F} {B,F} (Other) 8 8 4 4 4 3 element (sequence) size distribution: sizes 1 2 3 8 7 3 sequence length distribution: lengths 1 2 3 4 4 8 5 1 summary of quality measures: support Min. :0.5000 1st Qu.:0.5000 Median :0.5000 Mean :0.6528 3rd Qu.:0.7500 Max. :1.0000 mining info: data ntransactions nsequences support x 10 4 0.4 cspade> as(s1, "data.frame") sequence support 1 <{A}> 1.00 2 <{B}> 1.00 3 <{D}> 0.50 4 <{F}> 1.00 5 <{A,F}> 0.75 6 <{B,F}> 1.00 7 <{D},{F}> 0.50 8 <{D},{B,F}> 0.50 9 <{A,B,F}> 0.75 10 <{A,B}> 0.75 11 <{D},{B}> 0.50 12 <{B},{A}> 0.50 13 <{D},{A}> 0.50 14 <{F},{A}> 0.50 15 <{D},{F},{A}> 0.50 16 <{B,F},{A}> 0.50 17 <{D},{B,F},{A}> 0.50 18 <{D},{B},{A}> 0.50
This output shows (1) the list of the most frequent isolated items (A,B, ..), (2) the list of the most frequent set of items that occur in events (referred to as elements), (3) the distribution of the sizes of the set of items, (4) the distribution of the number of events in a sequence (referred to as sequence length), (5) the minimum, maximum, mean and median support values, and (6) the set of frequent sequences mined ordered by its support value.
Case Study
editScenario
editIn this case study we analyse the application of the CSPADE algorithm to a Tag Recommendation problem. Tags are keywords assigned by users to items in the context of the Web 2.0. A Tag Recommendation System is used to suggest new tags to users with the objective of enhancing their browsing experience and enrich item description. The dataset used in this case study was obtained from Delicious using its public time-line, which shows bookmarks from all the system users during a given period of time. The SPADE algorithms were executed and some results are presented in the discussion bellow.
Datasets
editThe dataset used in this case study was collected from Delicious in October 2009. We collected periodically the Delicious public time-line which shows the bookmarks from all the system users. Each bookmark consists of a user, the URL which was bookmarked, and a set of tags that user chose to describe the URL.
We show some bookmark examples:
gmenezes http://www.traderslog.com/forum/ 5 education investment forex CFD trading gmenezes http://bikebins.com/index.html 6 pannier bike bicycle cycling bikes commuting osvaldo http://www.noycefdn.org/ecrwresources.php 6 literacy math foundation education webdesign professionaldevelopment
In this example, the user gmenezes has bookmarked the URL http://www.traderslog.com/forum/ and used 5 tags to describe its content: education investment forex CFD trading.
The public time-line shows the system bookmarks in a time sequence, so that we can obtain sequences of bookmarks for specific users. We can use these sequences as input for the CSPADE algorithm. In this case, each bookmark is an event, and each tag is an item. In other words, we will mine temporal patterns with the objective of generating rules that predict useful tags for a specific user by using the wisdom of crowds.
Before applying the algorithm to the data we had to perform some pre-processing in our raw data. First, we extracted a sample, which consisted of 31,289 sequential bookmarks. Next, we performed data cleansing and duplicate removal. We ended up with 7,559 sequential bookmarks. We grouped each user`s bookmarks as sequences and each individual bookmark as an event. For instance, the example above yields:
1 1 5 education investment forex CFD trading 1 2 6 pannier bike bicycle cycling bikes commuting 2 1 6 literacy math foundation education webdesign professionaldevelopment
The dataset used can be downloaded from here.
Execution
editWe used the dataset described above in the experiments. To execute the algorithm we first execute read_baskets() to load the dataset file from disk as temporal transaction data. Note that we need to load the required libraries (as above).
n <- read_baskets(con = system.file("misc", "delicious.sequence", package = "arulesSequences"), info = c("sequenceID","eventID","SIZE"))
We can see data statistics with the command summary():
> summary(n) transactions as itemMatrix in sparse format with 7559 rows (elements/itemsets/transactions) and 7496 columns (items) and a density of 0.0004482878 most frequent items: design tools blog webdesign inspiration (Other) 469 301 233 229 220 23949 element (itemset/transaction) length distribution: sizes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2283 1432 1172 825 560 343 230 273 171 100 60 34 25 14 5 5 17 18 19 20 5 7 8 7 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 1.00 3.00 3.36 4.00 20.00 includes extended item information - examples: labels 1 | 2 - 3 , includes extended transaction information - examples: sequenceID eventID SIZE 1 1 1 2 2 2 1 5 3 3 1 3
Next, we execute cspade() to generate the rules. We have set a support of 0.002 to obtain a larger number of patterns.
s1 <- cspade(n, parameter = list(support = 0.002), control = list(verbose = TRUE))
Caveats
editThe use of higher support levels in this dataset does not generate any rule and can cause non-intuitive errors. For example, if the support is set to 0.1, the system outputs:
> s1 <- cspade(n, parameter = list(support = 0.1), control = list(verbose = TRUE)) parameter specification: support : 0.1 maxsize : 10 maxlen : 10 algorithmic control: bfstype : FALSE verbose : TRUE summary : FALSE preprocessing ... 1 partition(s), 0.18 MB [0.05s] mining transactions ... can't open data file: No such file or directory Error in cspade(n, parameter = list(support = 0.1), control = list(verbose = TRUE)) : system invocation failed
Output
editWe can see the generated rules by issuing the command as().
as(s1, "data.frame") (...) 845 <{webdesign},{design}> 0.004675877 846 <{inspiration,webdesign},{design}> 0.001912859 847 <{design,webdesign},{design}> 0.004250797 848 <{design,typography},{design}> 0.002337938 849 <{design,tools},{design}> 0.002337938 850 <{inspiration},{inspiration},{design}> 0.001912859 851 <{inspiration},{design},{design}> 0.002125399 852 <{design,inspiration},{design}> 0.004675877 853 <{design},{inspiration},{design}> 0.001912859 854 <{inspiration,art},{design}> 0.001912859 855 <{design,inspiration},{inspiration},{design}> 0.001912859 856 <{design},{design,inspiration},{design}> 0.001912859 857 <{design},{design},{design}> 0.004250797 858 <{design,blog},{design}> 0.002337938 859 <{design,art},{design}> 0.002550478 860 <{design},{design},{design},{design}> 0.002337938 861 <{design},{design,blog}> 0.001912859 862 <{design,art,blog}> 0.001912859 863 <{design},{design,art}> 0.002763018 864 <{art},{design,art}> 0.002125399 865 <{culture,art}> 0.001912859 866 <{css},{css}> 0.001912859 867 <{design},{css}> 0.002125399 868 <{blog,blogs}> 0.003400638 869 <{blog,blogging}> 0.001912859 (...)
To see the complete set of rules, download it from here.
Analysis
editWe have observed that CSPADE found many trivial sequences from user behaviour. For example, it has found many unitary sequences, such as <{design}>, <{ajax}>, <{css}>, among others. These unitary sequences are really frequently used, but they may not be useful in the particular application, which is Tag Recommendation.
Furthermore, other trivial sequences were found, such as <{design}.{design}> and <{webdesign},{design}>. These sequences indicates that the same users tend to bookmarks pages in the same subject subsequently. However, some interesting patters were also found. We can cite <{library},{books}>, <{javascript},{ajax}> and <{video},{youtube}>.
We can also observe that many frequent patterns are related to design, art and web_development. These tags are also the most popular tags in the whole Delicious system, as can be seen here.
References
edit- ^ Buchta, C., Hahsler, M., 2010. "arulesSequences: Mining frequent sequences". R package version 0.1-11. http://CRAN.R-project.org/package=arulesSequences
- ^ Li, K.-C., Chang, D.-J., Rouchka, E. C., Chen, Y. Y., 2007. "Biological sequence mining using plausible neural network and its application to exon/intron boundaries prediction". In: CIBCB. IEEE, pp. 165–169.
- ^ Peng, W.-C., Liao, Z.-X., 2009. "Mining sequential patterns across multiple sequence databases". Data Knowl. Eng. 68 (10), 1014–1033.
- ^ Telecom Paper, January 2009. "Google query volume".
- ^ Zaki, M. J., 2001. "Spade: An efficient algorithm for mining frequent sequences". In: Machine Learning. Vol. 42. pp. 31–60.