Data Mining Algorithms In R/Frequent Pattern Mining/arulesNBMiner
Introduction
editThe technique to be discussed in this chapter is used in frequent itemset mining. There are several situations which people are interested in co-occurrence of two or more items of a set. It is important to establish which items co-occur, since association rules can be extracted from frequent itemsets [1]. A typical example of an application is a supermarket where one discovers customers who buy meat and beer also tend to buy coal. Thus, a frequent item set would be meat-beer-coal and an association rule would be that, in general, customers who buy meat and beer have more chances to buy coal.
Many works have dealt with the problem of frequent itemset mining. Most of them show the necessity of a min_support threshold, which is an itemset minimum frequency in the data and in general defined by the miner user. In addition, these studies have as goal to mine a complete set of frequent itemsets that satisfy min_support [2]. The application of a minimum support results in several assumptions which are rarely discussed or verified. One assumption is that items occur in the database following a possibly unknown but stable process, and that the items occur in the database with roughly similar frequencies. Nevertheless, in the real world, transactions data have a frequency distribution highly skewed with most items occurring infrequently, while just some of them occur with high frequency. In a database where this phenomenon happens, interesting patterns are not found since some of the associated items are too infrequent to satisfy the user-specified minimum support [3].
Some algorithms such as TFP were developed in such a way that a user does not need to determine a min_support. However, he needs to inform the minimum size of itemsets (min_l) and the number of itemsets (k) he wishes to mine. Furthermore, the TFP algorithms only mine frequent closed itemsets.[2]. Again, the user has a parameter (min_l) which he should specify before mining the data, which is a subtle decision.
Therefore, this chapter presents an algorithm, which is implemented in an R-package and uses a simple stochastic model (Negative Binomal model or NB-model) to estimate a minimum support, utilizing knowledge of the process which generates transaction data, and allows for highly skewed frequency distributions. The name of the package is arulesNBMiner, which is a Java implementation of a depth first search algorithm to mine NB-frequent itemsets of NB-precise rules [4]. The algorithm utilizes the information contained in its own data structure to estimate the minimum support. It then uses a precision limit to estimate the min_support for each k-itemset, plus one extension it calculates at a different minimum support.
Technique to be discussed
editAlgorithm
editFunction NBSelect 1. rmax = max(c.count) in L 2. rrescale = sum(c.count) in L 3. for each tuple [c,c.count] ∈ L do nobs [c.count]++ 4. do 5. 6. while (precision ≥ π ∧ (p−−) > 0) 7. return {c| [c,c.count] ∈ L ∧ c.count > p} where l = the pattern for which candidate items are selected L = a data structure which holds count information for items which co-occur with pattern l; we use tuples c, c.count , where c represents a candidate item and c.count the count. n = the total number of available items = estimated parameters for the data set π = user-specified precision threshold
Implementation (description of used modules and packages)
editThe first thing one has to do to use NBMiner is installing the NBMiner package. This package has three dependencies packages: arules (http://cran.r-project.org/web/packages/arules/index.html) which provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules), Matrix (http://cran.r-project.org/web/packages/Matrix/index.html) which is classes and methods for dense and sparse matrices and operations on them using Lapack and SuiteSparse, and lettice (http://cran.r-project.org/web/packages/lattice/index.html) which is a framework for data visualization developed at the Bell Labs. Besides, the user has to install on computer the Sun Java(TM) Development Kit (JDK) 6, since the NBMiner uses a package called rJava which is part of JDK. This must be installed in the user operational system.
Installation of the dependencies packages can be performed within the R environment using the function "install.packages(“package name”)". The name of NBMiner package is arulesNBMiner.
install.packages(“arulesNBMiner”)
After the user installed the necessaries packages, he must load them. This can be done using the function "library(package name)".
library(arulesNBMiner)
However, before we show how to use the NBMiner package, it is necessary to show how to load the data. Here, we are using a data on csv format. In this format each line is a transaction like the example below:
Employment | Education | Marital | Occupation | Sex | Accounts |
---|---|---|---|---|---|
Private | College | Separated | Service | Female | USA |
Private | Associate | Unmarried | Transport | Male | Jamaica |
Private | HSgrad | Divorced | Clerical | Male | USA |
Private | Bachelor | Civil | Repair | Male | USA |
Private | College | Civil | Executive | Male | USA |
Private | HSgrad | Civil | Service | Male | USA |
Using the function "read.table()" on R environment, we are able to load the data. This function creates a "data.frame" object. Its syntax is:
object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”)
The argument header indicates the data has or does not a header while the sep indicates which character is being used to separate the elements. Later, the user has to transform this data frame object in a transaction object. For that, he can use the function "as()". Its syntax is:
object name<-as(data frame object,”transactions”)
It is important to mention the user has to load the NBMiner package before he uses the function as with parameter transaction. Furthermore, there is another data format that some R functions can read and transform it in a transaction object. The user can utilize a binary format like the example below:
1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
Here, each line represents a transaction and each column represents a term of the database. The number 1 means presence of that term in that transaction, while 0 means no presence. To read this data, user can first utilize the function "scan()". It creates a vector object, and its syntax is:
object name<-scan(“file name”,sep=”,”)
After that, it is necessary to transform the vector object in a matrix object. This is performed using the function "dim()". Its syntax is:
dim(vector object)<-c(dim1,dim2)
Dim1 and Dim2 are the dimensions of the matrix which the user want creating. If the user wants, he can name each row and column of the matrix using the function "dimnames()". Its syntax is:
dimnames(matrix object)<-list(pridim=c(names),segdim=c(names))
Where pridim is a vector with first dimension names, and segdim is a vector with second dimension names. Finally, this object can be transformed in transaction object using the function "as()". Its syntax is:
object name<-as(matrix object,”ItemMatrix”)
Soon after the user loaded the package and the data, it is necessary to create the parameters object using the function of NBMiner called "NBMinerParameters()". This function reads the data and extracts from them some parameters which will be used in the next step. The syntax of this function is:
object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE) where: # = numbers data object = the data as an object of class transaction pi = precision threshold theta = pruning parameter maxlen = maximal number of items in found itemsets (default = 5) minlen = minimum number of items in found itemsets (default = 1) trim = fraction of incidences to trim off the tail of the frequency distribution of the data verbose = use verbose output for the estimation procedure plot = plot the model rules = mine NB-precise rules instead of NB-frequent itemsets
After the user perform this procedure, he is able to run the algorithm to mine his data. For that, one will use the function called "NBMiner()". Its syntax is:
object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE))
If the user uses the option rules=TRUE when he created the object parameter, the algorithm will mine the NB-precise rules. Otherwise, the same algorithm will mine the NB-frequent itemsets.
Finally, we summarized the necessaries commands to use the NBMiner package for mining tasks using the first data structure mentioned above, these are below:
library(package name) object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”) object name<-as(data frame object,”transaction”) object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE) object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE
Visualization
editTo visualize data information contained in a “data.frame” object just use the object name. Here, we use table as object name:
table = read.table("data",sep=",",header=TRUE) table
# | Employment | Education | Marital | Occupation | Sex | Accounts |
---|---|---|---|---|---|---|
1 | Private | College | Separated | Service | Female | USA |
2 | Private | Associate | Unmarried | Transport | Male | Jamaica |
3 | Private | HSgrad | Divorced | Clerical | Male | USA |
4 | Private | Bachelor | Civil | Repair | Male | USA |
5 | Private | College | Civil | Executive | Male | USA |
6 | Private | Hsgrad | Civil | Service | USA |
However, when the same principle is used in “transactions” format, we have just the number of transactions and items generated by the "as()" function:
tableT = as(table,"transactions") tableT transactions in sparse format with 6 transactions (rows) and 18 items (columns)
Use “inspect()” function to visualize the result of the conversion to the “transactions” format:
inspect(tableT)
# | items | transactionID |
---|---|---|
1 | {Employment=Private,Education=College,Marital=Separated,Occupation=Service,Sex=Female,Accounts=USA} | 1 |
2 | {Employment=Private,Education=Associate,Marital=Unmarried,Occupation=Transport,Sex=Male,Accounts=Jamaica} | 2 |
(...) |
To summarize the data information in "transaction" format use "summary()".
summary(tableT) transactions as itemMatrix in sparse format with 6 rows (elements/itemsets/transactions) and 18 columns (items) and a density of 0.3333333 most frequent items: Employment=Private Sex=Male Accounts=USA 6 5 5 Marital=Civil Education=College (Other) 3 2 15 element (itemset/transaction) length distribution: sizes 6 6 Min. 1st Qu. Median Mean 3rd Qu. Max. 6 6 6 6 6 6 includes extended item information - examples: labels variables levels 1 Employment=Private Employment Private 2 Education=Associate Education Associate 3 Education=Bachelor Education Bachelor includes extended transaction information - examples: transaction ID 1 1 2 2 3 3
To see labels of the items generated in the conversion to transaction format use “itemInfo()”:
itemInfo(tableT) labels variables levels 1 Employment=Private Employment Private 2 Education=Associate Education Associate 3 Education=Bachelor Education Bachelor 4 Education=College Education College 5 Education=HSgrad Education HSgrad 6 Marital=Civil Marital Civil 7 Marital=Divorced Marital Divorced 8 Marital=Separated Marital Separated 9 Marital=Unmarried Marital Unmarried 10 Occupation=Clerical Occupation Clerical 11 Occupation=Executive Occupation Executive 12 Occupation=Repair Occupation Repair 13 Occupation=Service Occupation Service 14 Occupation=Transport Occupation Transport 15 Sex=Female Sex Female 16 Sex=Male Sex Male 17 Accounts=Jamaica Accounts Jamaica 18 Accounts=USA Accounts USA
Use “labels()” to see only the labels of the itens and transactions:
labels(tableT) $items [1] "Employment=Private" "Education=Associate" "Education=Bachelor" [4] "Education=College" "Education=HSgrad" "Marital=Civil" [7] "Marital=Divorced" "Marital=Separated" "Marital=Unmarried" [10] "Occupation=Clerical" "Occupation=Executive" "Occupation=Repair" [13] "Occupation=Service" "Occupation=Transport" "Sex=Female" [16] "Sex=Male" "Accounts=Jamaica" "Accounts=USA" $transactionID [1] "1" "2" "3" "4" "5" "6"
To see NB mined results use the same commands above mentioned to “transactions” data:
paramA <- NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5) tableNB <- NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE)) inspect(head(tableNB)) lhs rhs precision 1 {Education=HSgrad, Sex=Male} => {Employment=Private} 0.9991467 2 {Sex=Male, Accounts=USA} => {Marital=Civil} 0.9999421 3 {Sex=Male, Accounts=USA} => {Education=HSgrad} 0.9977636 4 {Education=College, Accounts=USA} => {Employment=Private} 0.9982934 5 {Marital=Civil, Accounts=USA} => {Sex=Male} 0.9999752 6 {Employment=Private, Sex=Male} => {Accounts=USA} 0.9999948
Where "lhs" is the antecedent of the rules and de "rhs" is the consequent of the rules.
We can also show the data distribution in the space with “image()” function:
image(tableNB)
Case study
editScenario
editIt contains artificial data created by Graham Williams (developer of Rattle) and is supplied with Rattle. To quote from the Rattle documentation: "It consists of 2,000 fictional clients who have been audited, perhaps for compliance with regard to the amount of a tax refund that is being claimed. For each case an outcome is recorded (whether the taxpayer's claims had to be adjusted or not) and any amount of adjustment that resulted is also recorded."
Input data no longer exists to download
editAvailable on http://cs.anu.edu.au/Student/comp3420/mining/audit.csv.
Execution
edittable = read.table("audit.csv",sep=",",header=TRUE) tableT = as(table,"transactions") paramA = NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5) transNB = NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE))
Output
edittransNB set of 18158 rules
tableNB = as(transNB,"data.frame") write.table(tableNB,file="auditNB.csv",sep=",")
Analysis
editrules | consequent | precedent A | precedent B | precedent C | precedent D |
---|---|---|---|---|---|
92 | Accounts=China | Employment=PSState | Education=Master | Occupation=Professional | |
4721 | Accounts=China | Employment=PSState | Education=Master | ||
4762 | Accounts=China | Employment=PSState | Occupation=Professional | ||
4857 | Accounts=China | Education=Master | Marital=Civil | Occupation=Professional | Sex=Male |
5871 | Accounts=China | Education=Master | Occupation=Professional | ||
6131 | Accounts=China | Marital=Civil | Occupation=Professional | ||
10269 | Accounts=China | Education=Master | Occupation=Professional | Sex=Male | |
10386 | Accounts=China | Education=Master | Marital=Civil | Occupation=Professional | |
14791 | Accounts=China | Marital=Civil | Occupation=Professional | Sex=Male |
Based on the data above we can note that when the Marital state is “Civil” and the "Occupation" is "Professional" we have an "Chinese" account.
References
edit- [1]Mohammed j. Zaki and Wagner Meira Jr. Fundamentals of Data Mining Algorithms. Chaper 11 – Itemset Mining, 74-93
- [2]Jianyong Wang, Jiawei Han, Ying Lu and Petre Tzvetkov. TFP: An efficient algorithm for mining Top-K frequent closed itemsets. IEEE Transactions on Knowledge and Data Engineering, 17(5):652-665, May 2005
- [3]Michael Hahsler. A model-based frequency constraint for mining associations from transaction data. Data Mining and Knowledge Discovery, 13(2):137-166, September 2006.
- [4]http://cran.fiocruz.br/web/packages/arulesNBMiner/index.html