Introduction

Classification in machine learning deals with the following problem: given a series of noised training data x and their labels y, is it possible to train a model able to predict the label of future data? The difficulty lies in the noised data where same input x may result in different outputs y. That is because x is not enough for fully determining the value of y. Therefore, with the tolerances for the inaccuracy of prediction, a variety of statistical models come into play. Based on different assumptions about the relationship between data and labels, they vary in the intuition, the math model, the interpretation, and the optimization method. In this blog, I am trying to summarize some ‘familiar faces’ in classification problems, not only to identify their commons but also to compare discrepancies.

f(x) and p(y|x)

It is natural to consider the prediction from data x_i to label y_i, the most likely output, as a mapping
f: x_{i} \to y_{i}

From the probabilistic view, f can be thought as a function that given input x captures the most probable output among all its possible outputs, i.e.

y_{i}^{*} = f(x_{i}) =\arg \max_{y_{i}} p(y_{i}|x_{i})

Such probabilistic interpretation actually forms the basis for statistical methods that is prevailing adopted among the machine learning algorithms, such as the ML estimates and Bayes method.

Despite the probabilistic explanation, f can be also considered an indicator function which decides by exploring known samples in the neighborhood of input x_i . In other words, input data is classified by comparing the similarity with nearby samples and no statistical model is assumed.

y_{i}^{*} = f(x_{i}) = Simi_{x \in Neig(x_{i})} (x_{i},x)

Popular models of such kind include the SVM and the KNN.

The Generative Model vs the Discriminative Model

The Generative Model

The philosophy reflected in above methods is apparently different. The former one assumes the existence of an underlying a joint probability distribution that dictates the relationship between any pairs of x and y. In this frame, both the training and future test data are regarded as samples from one but unidentified distribution. This framework is called the generative model. As a result, only after knowing the joint distribution of x and y can one accurately predict the future data by the Bayes rule. So this is an indirect way of predicting the y of x. p(y_{i}|x_{i}) can be generated simply as follows,

p(y_{i}|x_{i}) = \frac{p(x_{i},y_{i})}{p(x_{i})}

Here we are going to play a little trick to see what the generative model really cares about. By the Bayes rule the original problem can be written as:

y_{i}^{*} =\arg \max_{y_{i}} p(y_{i}|x_{i}) = \arg \max_{y_{i}} \frac{p(x_{i},y_{i})}{p(x_{i})} = \arg \max_{y_{i}} p(x_{i},y_{i})

Note the last item can also be decomposed as

\arg \max_{y_{i}}p(x_{i}, y_{i}) = \arg \max_{y_{i}}p(y_{i}|x_{i})*p(x_{i})

From above equation, we see that the generative model is not to generated the max likelihood output y for input data x. Instead, it aims to find out a possible solution of y regarding the combined effect of the max likelihood and the occurring possibility of x. In other words, it fits the model with weighted input data and the weight is the probability of x.This weighted fitting strategy avoids overfitting in certain extent and enhances the robustness of generative models against outliners or noises of the input. An example comparing these the conditional likelihood and the joint distribution will be presented in later sections.

The Bayesian Method in Generative Model

For most cases, however, building the statistical model for the joint probability of p(x,y) is difficult since the sparsity of the labeled data. Then the Bayesian method comes into play. Recall the last equation, the joint probability can be written as follows,

p(x_{i},y_{i}) = p(x_{i}|y_{i}) * p(y_{i})

Therefore, one can calculate the right-hand side instead. Fortunately, they can be described by using some off-the-shelf statistical models, such as the gaussian and hidden Markov models for p(x_{i}|y_{i}) and the Bernoulli and uniform model for p(y_{i}). And the data for verifying those models are much easier to obtain.

The Naive Bayes Method

When the input x has too many dimensions, the system will suffer the ‘curse of dimentionality’ in which the available data and label pairs become sparse due to the enormous input space. To cope with that problem, the naive Bayes model, assuming conditional independence among input dimensions, is proposed. , it can be expressed by the chain rule as follows,

p(x_{1}, x_{2} ,..., x_{n}| y_{i}) = p(x_{1}|y_{i}) *p(x_{2}|y_{i})*,...,*p(x_{n}|y_{i})

Then one can turn to modeling each p(x_i|y) instead. In addition, if they obey the same distribution (i.i.d), the result model of p(x_{1}, x_{2} ,…, x_{n}| y_{i}) will be just n multiplications of p(x_i|y).

The Discriminative Model

The discriminative model refers to any model that is not generative. Unlike the generative model, the discriminative model assumes no prior probabilistic distribution between x and y and does classification by directly modeling the boundary or shape difference between classes in the space. To this end, it either explores the neighborhood in the training dataset of a testing data or just calculates the likelihood of p(y|x), not the p(x,y) in the generative model. Popular discriminative model includes the SVM, KNN, decision-tree, and AdaBoost.

Which One is Better – How to Select a Model

It can be inferred that the generative model employs a more macro-scope on the relationship between input and output data, while its counterpart, the discriminative model, focuses primarily on specific situations. Although relying on certain pre-assumptions, the generative model is more suitable to general problems since its pre-assumptions are usually widely-accepted rules in statistics like the central-limit theorem. But its performance may be degraded not least when data does not fit in the pre-assumptions . On the other hand, the discriminative model is more adaptive to different datasets and is likely to be more effective in certain specific cases. The difference between these two models can be thought of the comparison between the Chinese and Western medicine, where the former one seeks for the overall recuperation of one’s body while the later one pursuits targeted solutions for a particular disease.

Moreover, there have been constant disputes between the statisticians and the Bayesians for the superiority among these two models. The statisticians think that a good model should be loyal to the data, so they propose the discriminative model where only the likelihood is maximized. On the contrary, the Bayes supporters believe that a good model should include what the data fails to tell and hence they advocate the generative model where a pre-assumption about the data is made. Which one is better? I will use an example of two doctors to illustrate my understanding of it.

Doctor A wants to judge whether one his coughing patient is infected with either cold or flu. Suppose he then finds by pathology flu is more likely to cause coughs than cold. With such information, he then concludes that the patient is infected with flu rather than cold. This a typical decision based on the conditional probability, the direct link of the input and output. It is such cause and effect logic that underpins the discriminative model.

Doctor B has the same patient. But despite the pathology knowledge, Doctor B also harnesses demographic clues for his diagnosis. He notices that it is the high season for the cold and there is a number of cold patients, which infers a much higher possibility of infecting cold than the flu. Therefore, he concludes that the patient is probably infected with cold from other victims, though flu is more likely to cause a cough than cold. This decision is based on the joint probability and generative model.
For most the time, the generative model relying on prior probability is a more plausible way of making decisions, since it is more considerate. However, when the prior probability is missing or hard to get, using discriminative way is more direct and efficient, though it may generalize as well as the generative model. Other considerations include the learning curve of the model. To conclude, it depends on specific problems.

Parametric model vs Non-parametric Model

According to a type of parameters, machine learning models can be differentiated as the parametric model and the non-parametric models. Specifically, parametric model refers to statistical models that can be described by a finite number of parameters independent of the size of input data. For example, the perceptron and the logistic regression are both parametric models. On the other hand, the non-parametric model represents non-statistical models or statistical models with increasing parameters along with the size of input data. Representatives include the SVM and KNN.

Parameters to Optimize

The follow-up question is how to optimize the parameters for the parametric and the non-parametric model respectively. For the non-parametric model, methods, like minimizing the loss/or maximize the margin/distance, are normally used to directly optimize the decision boundary between classes. For parametric models, maximizing likelihood (ML) is employed to fit parameters by statistical estimation. Logistic regression and perceptron (linear regression) are among popular models optimized in the ML way.

Learning Process

The word ‘learning’ here refers to the numerical process of optimizing the model parameters. Among the plethora of methods for goal, the most classical learning method is the gradient descent (also called the batch gradient descent or the deepest descent), designed specifically to the convex optimization problems. But when the feasible region is not symmetric its convergences slow down so that following the logic of gradient descent, variations like the stochastic gradient descent, the conjugate (gradient) descent are proposed. Particularly, the SGD also provides an efficient solution to the neural networks, a complex structure requiring the optimization of enormous parameters. Moreover, many non-gradient based methods such as the coordinate descent have also won their places among the machine learning optimization techniques.

The detailed summary of learning (optimization) method from my past post can be found here.

Conclusion

The typical process in machine learning classification involves a series of procedures from the raw data input to the predicted label output. Determination the form of target function the is the first task. Then depending on the function, machine learning uses either direct or indirect methods for classification tasks. For such methods, although their model parameters, or math expressions differ considerably, they can be implicitly related logically such as the case between the Logistic Regression and the Naive Bayes pointed by []. Finally, the parameters get optimized by various optimization methods and the model is learned.

References

[1] T. M. Mitchell, Machine learning. New York, NY [u.a.: McGraw-Hill, 2010.
[2] “Generative vs. discriminative.” [Online]. Available: http://stats.stackexchange.com/questions/12421/generative-vs-discriminative. [Accessed: 23-Nov-2013].

Advertisements