Data Mining Algorithms In R/Clustering/Expectation Maximization (soon)

Introduction

In statistics an optimization problem is used to maximize or minimize a function, and its variables in a specific space. As these optimization problems may assume various different types, each one with its own characteristics, many techniques exists to solve some of these.

The Maximum Likehood is one of these techniques and its principal goal is to adjust a statistic model with specific data set, estimating its unknown parameters so the function can describe all the parameters in the dataset. In other words, the method will adjust some variables of a statistical mode from a dataset or a known distribution, so the model can “describe” each data sample and estimate others. This technique is very important in data mining and knowledge discovery area as it can be used as basis for most complex and powerful methods.

The Expectation-Maximization method is one of the methods developed from Maximum Likehood, trying to estimate the likehood in problems which some variables are unobserved. This method was first documented in 1977 by [2], although the technique was informally proposed in literature, as suggested by the author. So, this work was its first formalization in the literature.

The function can be calculated approximating the parameters in an iterative process, which steps called Expectation (E-step) and Maximization (M-step) are executed to find the unobserved variables and re-estimate the parameters for the maximum likehood, respectively. The next section will describe the method with more details, presenting an algorithm with all steps of the calculation of the function.