Data Mining Algorithms In R/Frequent Pattern Mining/arulesNBMiner

Introduction

edit

The technique to be discussed in this chapter is used in frequent itemset mining. There are several situations which people are interested in co-occurrence of two or more items of a set. It is important to establish which items co-occur, since association rules can be extracted from frequent itemsets [1]. A typical example of an application is a supermarket where one discovers customers who buy meat and beer also tend to buy coal. Thus, a frequent item set would be meat-beer-coal and an association rule would be that, in general, customers who buy meat and beer have more chances to buy coal.

Many works have dealt with the problem of frequent itemset mining. Most of them show the necessity of a min_support threshold, which is an itemset minimum frequency in the data and in general defined by the miner user. In addition, these studies have as goal to mine a complete set of frequent itemsets that satisfy min_support [2]. The application of a minimum support results in several assumptions which are rarely discussed or verified. One assumption is that items occur in the database following a possibly unknown but stable process, and that the items occur in the database with roughly similar frequencies. Nevertheless, in the real world, transactions data have a frequency distribution highly skewed with most items occurring infrequently, while just some of them occur with high frequency. In a database where this phenomenon happens, interesting patterns are not found since some of the associated items are too infrequent to satisfy the user-specified minimum support [3].

Some algorithms such as TFP were developed in such a way that a user does not need to determine a min_support. However, he needs to inform the minimum size of itemsets (min_l) and the number of itemsets (k) he wishes to mine. Furthermore, the TFP algorithms only mine frequent closed itemsets.[2]. Again, the user has a parameter (min_l) which he should specify before mining the data, which is a subtle decision.

Therefore, this chapter presents an algorithm, which is implemented in an R-package and uses a simple stochastic model (Negative Binomal model or NB-model) to estimate a minimum support, utilizing knowledge of the process which generates transaction data, and allows for highly skewed frequency distributions. The name of the package is arulesNBMiner, which is a Java implementation of a depth first search algorithm to mine NB-frequent itemsets of NB-precise rules [4]. The algorithm utilizes the information contained in its own data structure to estimate the minimum support. It then uses a precision limit to estimate the min_support for each k-itemset, plus one extension it calculates at a different minimum support.

Technique to be discussed

edit

Algorithm

edit
Function NBSelect 
1. rmax = max(c.count) in L
2. rrescale = sum(c.count) in L
3. for each tuple [c,c.count] ∈ L do nobs [c.count]++
4. do
5.  
6. while (precision ≥ π ∧ (p−−) > 0)
7. return {c| [c,c.count] ∈ L ∧ c.count > p}
where
l = the pattern for which candidate items are selected
L = a data structure which holds count information for items which co-occur with pattern l; 
we use tuples c, c.count , where c represents a candidate  item and c.count the count.
n = the total number of available items
  = estimated parameters for the data set
π = user-specified precision threshold

Implementation (description of used modules and packages)

edit

The first thing one has to do to use NBMiner is installing the NBMiner package. This package has three dependencies packages: arules (http://cran.r-project.org/web/packages/arules/index.html) which provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules), Matrix (http://cran.r-project.org/web/packages/Matrix/index.html) which is classes and methods for dense and sparse matrices and operations on them using Lapack and SuiteSparse, and lettice (http://cran.r-project.org/web/packages/lattice/index.html) which is a framework for data visualization developed at the Bell Labs. Besides, the user has to install on computer the Sun Java(TM) Development Kit (JDK) 6, since the NBMiner uses a package called rJava which is part of JDK. This must be installed in the user operational system.

Installation of the dependencies packages can be performed within the R environment using the function "install.packages(“package name”)". The name of NBMiner package is arulesNBMiner.

install.packages(“arulesNBMiner”)	

After the user installed the necessaries packages, he must load them. This can be done using the function "library(package name)".

library(arulesNBMiner)	

However, before we show how to use the NBMiner package, it is necessary to show how to load the data. Here, we are using a data on csv format. In this format each line is a transaction like the example below:

Employment Education Marital Occupation Sex Accounts
Private College Separated Service Female USA
Private Associate Unmarried Transport Male Jamaica
Private HSgrad Divorced Clerical Male USA
Private Bachelor Civil Repair Male USA
Private College Civil Executive Male USA
Private HSgrad Civil Service Male USA

Using the function "read.table()" on R environment, we are able to load the data. This function creates a "data.frame" object. Its syntax is:

object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”)

The argument header indicates the data has or does not a header while the sep indicates which character is being used to separate the elements. Later, the user has to transform this data frame object in a transaction object. For that, he can use the function "as()". Its syntax is:

object name<-as(data frame object,”transactions”)

It is important to mention the user has to load the NBMiner package before he uses the function as with parameter transaction. Furthermore, there is another data format that some R functions can read and transform it in a transaction object. The user can utilize a binary format like the example below:

1 1 0 0 0 1 0 0 0
1 0 1 0 0 0 1 0 0
1 0 0 1 0 0 0 1 0
1 0 0 0 1 0 0 0 1
1 1 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 1
1 0 0 0 1 0 0 0 1

Here, each line represents a transaction and each column represents a term of the database. The number 1 means presence of that term in that transaction, while 0 means no presence. To read this data, user can first utilize the function "scan()". It creates a vector object, and its syntax is:

object name<-scan(“file name”,sep=”,”)

After that, it is necessary to transform the vector object in a matrix object. This is performed using the function "dim()". Its syntax is:

dim(vector object)<-c(dim1,dim2) 

Dim1 and Dim2 are the dimensions of the matrix which the user want creating. If the user wants, he can name each row and column of the matrix using the function "dimnames()". Its syntax is:

dimnames(matrix object)<-list(pridim=c(names),segdim=c(names))

Where pridim is a vector with first dimension names, and segdim is a vector with second dimension names. Finally, this object can be transformed in transaction object using the function "as()". Its syntax is:

object name<-as(matrix object,”ItemMatrix”)

Soon after the user loaded the package and the data, it is necessary to create the parameters object using the function of NBMiner called "NBMinerParameters()". This function reads the data and extracts from them some parameters which will be used in the next step. The syntax of this function is:

object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE)
where: 
# = numbers
data object = the data as an object of class transaction
pi = precision threshold
theta = pruning parameter
maxlen = maximal number of items in found itemsets (default = 5)
minlen = minimum number of items in found itemsets (default = 1)
trim = fraction of incidences to trim off the tail of the frequency distribution of the data
verbose = use verbose output for the estimation procedure
plot = plot the model
rules = mine NB-precise rules instead of NB-frequent itemsets

After the user perform this procedure, he is able to run the algorithm to mine his data. For that, one will use the function called "NBMiner()". Its syntax is:

object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE))

If the user uses the option rules=TRUE when he created the object parameter, the algorithm will mine the NB-precise rules. Otherwise, the same algorithm will mine the NB-frequent itemsets.

Finally, we summarized the necessaries commands to use the NBMiner package for mining tasks using the first data structure mentioned above, these are below:

library(package name)
object name<-read.table(“filename”, header=TRUE/FALSE,sep=”,”)
object name<-as(data frame object,”transaction”)
object name<-NBMinerParameters(data object, pi=#, theta=#, maxlen=#, minlen=#, trim=#, verb=TRUE/FALSE, plot=TRUE/FALSE, rules=TRUE/FALSE)
object name<-NBMiner(data object, parameter=object parameter, control = list(verb=TRUE, debug=FALSE

Visualization

edit

To visualize data information contained in a “data.frame” object just use the object name. Here, we use table as object name:

table = read.table("data",sep=",",header=TRUE)
table
# Employment Education Marital Occupation Sex Accounts
1 Private College Separated Service Female USA
2 Private Associate Unmarried Transport Male Jamaica
3 Private HSgrad Divorced Clerical Male USA
4 Private Bachelor Civil Repair Male USA
5 Private College Civil Executive Male USA
6 Private Hsgrad Civil Service USA

However, when the same principle is used in “transactions” format, we have just the number of transactions and items generated by the "as()" function:

tableT = as(table,"transactions")
tableT
 transactions in sparse format with
 6 transactions (rows) and
 18 items (columns)

Use “inspect()” function to visualize the result of the conversion to the “transactions” format:

inspect(tableT)
# items transactionID
1 {Employment=Private,Education=College,Marital=Separated,Occupation=Service,Sex=Female,Accounts=USA} 1
2 {Employment=Private,Education=Associate,Marital=Unmarried,Occupation=Transport,Sex=Male,Accounts=Jamaica} 2
(...)

To summarize the data information in "transaction" format use "summary()".

summary(tableT)
transactions as itemMatrix in sparse format with
6 rows (elements/itemsets/transactions) and
18 columns (items) and a density of 0.3333333 
most frequent items:
Employment=Private    Sex=Male    Accounts=USA 
        6                5            5 
Marital=Civil    Education=College  (Other) 
        3                2            15 
element (itemset/transaction) length distribution:
sizes
6 
6 
  Min.    1st Qu.  Median  Mean   3rd Qu.   Max. 
     6       6       6       6       6       6 
includes extended item information - examples:
       labels			variables    	levels
1  	Employment=Private 	Employment   	Private
2 	Education=Associate  	Education 	Associate
3  	Education=Bachelor  	Education  	Bachelor
includes extended transaction information - examples:
transaction		ID
1            		1
2             		2
3             		3

To see labels of the items generated in the conversion to transaction format use “itemInfo()”:

itemInfo(tableT)
	labels  		variables		levels
1	Employment=Private 	Employment      	Private
2    	Education=Associate  	Education    		Associate
3     	Education=Bachelor  	Education     		Bachelor
4	Education=College  	Education      		College
5	Education=HSgrad  	Education       	HSgrad
6	Marital=Civil    	Marital        		Civil
7	Marital=Divorced 	Marital     		Divorced
8	Marital=Separated    	Marital    		Separated
9	Marital=Unmarried    	Marital    		Unmarried
10   	Occupation=Clerical 	Occupation     		Clerical
11  	Occupation=Executive 	Occupation    		Executive
12	Occupation=Repair 	Occupation       	Repair
13    	Occupation=Service 	Occupation      	Service
14	Occupation=Transport 	Occupation    		Transport
15	Sex=Female       	Sex       		Female
16	Sex=Male        	Sex         		Male
17	Accounts=Jamaica   	Accounts      		Jamaica
18	Accounts=USA   		Accounts 		USA

Use “labels()” to see only the labels of the itens and transactions:

labels(tableT)
$items
[1]	"Employment=Private"	"Education=Associate"		"Education=Bachelor"   
[4]	"Education=College"     "Education=HSgrad"      	"Marital=Civil"        
[7]	"Marital=Divorced"      "Marital=Separated"  		"Marital=Unmarried"    
[10]	"Occupation=Clerical"   "Occupation=Executive"  	"Occupation=Repair"    
[13]	"Occupation=Service"    "Occupation=Transport"  	"Sex=Female"           
[16]	"Sex=Male"              "Accounts=Jamaica"      	"Accounts=USA"
$transactionID
[1] "1" "2" "3" "4" "5" "6"

To see NB mined results use the same commands above mentioned to “transactions” data:

paramA <- NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5)
tableNB <- NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE))
inspect(head(tableNB))
	lhs                     		rhs		                precision
1	{Education=HSgrad, Sex=Male}       	=> {Employment=Private}    	0.9991467
2	{Sex=Male, Accounts=USA} 		=> {Marital=Civil}		0.9999421
3	{Sex=Male, Accounts=USA} 		=> {Education=HSgrad}      	0.9977636
4	{Education=College, Accounts=USA} 	=> {Employment=Private}		0.9982934
5	{Marital=Civil, Accounts=USA} 		=> {Sex=Male}              	0.9999752
6	{Employment=Private, Sex=Male}		=> {Accounts=USA}		0.9999948

Where "lhs" is the antecedent of the rules and de "rhs" is the consequent of the rules.

We can also show the data distribution in the space with “image()” function:

image(tableNB)

File:ImageFunction.png

Case study

edit

Scenario

edit

It contains artificial data created by Graham Williams (developer of Rattle) and is supplied with Rattle. To quote from the Rattle documentation: "It consists of 2,000 fictional clients who have been audited, perhaps for compliance with regard to the amount of a tax refund that is being claimed. For each case an outcome is recorded (whether the taxpayer's claims had to be adjusted or not) and any amount of adjustment that resulted is also recorded."

Input data no longer exists to download

edit

Available on http://cs.anu.edu.au/Student/comp3420/mining/audit.csv.

Execution

edit
table = read.table("audit.csv",sep=",",header=TRUE)
tableT = as(table,"transactions")
paramA = NBMinerParameters(tableT, trim = 0.01, pi = 0.999, theta = 0.8, rules = TRUE, plot = FALSE, verbose = FALSE, minlen=3, maxlen=5)
transNB = NBMiner(tableT, parameter = paramA, control = list(verb = FALSE, debug=FALSE))

Output

edit
transNB
set of 18158 rules 
tableNB = as(transNB,"data.frame")
write.table(tableNB,file="auditNB.csv",sep=",")

Analysis

edit
rules consequent precedent A precedent B precedent C precedent D
92 Accounts=China Employment=PSState Education=Master Occupation=Professional
4721 Accounts=China Employment=PSState Education=Master
4762 Accounts=China Employment=PSState Occupation=Professional
4857 Accounts=China Education=Master Marital=Civil Occupation=Professional Sex=Male
5871 Accounts=China Education=Master Occupation=Professional
6131 Accounts=China Marital=Civil Occupation=Professional
10269 Accounts=China Education=Master Occupation=Professional Sex=Male
10386 Accounts=China Education=Master Marital=Civil Occupation=Professional
14791 Accounts=China Marital=Civil Occupation=Professional Sex=Male

Based on the data above we can note that when the Marital state is “Civil” and the "Occupation" is "Professional" we have an "Chinese" account.

References

edit
  • [1]Mohammed j. Zaki and Wagner Meira Jr. Fundamentals of Data Mining Algorithms. Chaper 11 – Itemset Mining, 74-93
  • [2]Jianyong Wang, Jiawei Han, Ying Lu and Petre Tzvetkov. TFP: An efficient algorithm for mining Top-K frequent closed itemsets. IEEE Transactions on Knowledge and Data Engineering, 17(5):652-665, May 2005
  • [3]Michael Hahsler. A model-based frequency constraint for mining associations from transaction data. Data Mining and Knowledge Discovery, 13(2):137-166, September 2006.
  • [4]http://cran.fiocruz.br/web/packages/arulesNBMiner/index.html