ML Algorithms addendum: Passive Aggressive Algorithms

Passive Aggressive Algorithms are a family of online learning algorithms (for both classification and regression) proposed by Crammer at al. The idea is very simple and their performance has been proofed to be superior to many other alternative methods like Online Perceptron and MIRA (see the original paper in the reference section).

Classification

Let’s suppose to have a dataset:

The index t has been chosen to mark the temporal dimension. In this case, in fact, the samples can continue arriving for an indefinite time. Of course, if they are drawn from same data generating distribution, the algorithm will keep learning (probably without large parameter modifications), but if they are drawn from a completely different distribution, the weights will slowly forget the previous one and learn the new distribution. For simplicity, we also assume we’re working with a binary classification based on bipolar labels.

Given a weight vector w, the prediction is simply obtained as:

All these algorithms are based on the Hinge loss function (the same used by SVM):

The value of L is bounded between 0 (meaning perfect match) and K depending on f(x(t),θ) with K>0 (completely wrong prediction). A Passive-Aggressive algorithm works generically with this update rule:

To understand this rule, let’s assume the slack variable ξ=0 (and L constrained to be 0). If a sample x(t) is presented, the classifier uses the current weight vector to determine the sign. If the sign is correct, the loss function is 0 and the argmin is w(t). This means that the algorithm is passive when a correct classification occurs. Let’s now assume that a misclassification occurred:

The angle θ > 90°, therefore, the dot product is negative and the sample is classified as -1, however, its label is +1. In this case, the update rule becomes very aggressive, because it looks for a new w which must be as close as possible as the previous (otherwise the existing knowledge is immediately lost), but it must satisfy L=0 (in other words, the classification must be correct).

The introduction of the slack variable allows to have soft-margins (like in SVM) and a degree of tolerance controlled by the parameter C. In particular, the loss function has to be L <= ξ, allowing a larger error. Higher C values yield stronger aggressiveness (with a consequent higher risk of destabilization in presence of noise), while lower values allow a better adaptation. In fact, this kind of algorithms, when working online, must cope with the presence of noisy samples (with wrong labels). A good robustness is necessary, otherwise, too rapid changes produce consequent higher misclassification rates.

After solving both update conditions, we get the closed-form update rule:

This rule confirms our expectations: the weight vector is updated with a factor whose sign is determined by y(t) and whose magnitude is proportional to the error. Note that if there’s no misclassification the nominator becomes 0, so w(t+1) = w(t), while, in case of misclassification, w will rotate towards x(t) and stops with a loss L <= ξ. In the next figure, the effect has been marked to show the rotation, however, it’s normally as smallest as possible:

After the rotation, θ < 90° and the dot product becomes negative, so the sample is correctly classified as +1. Scikit-Learn implements Passive Aggressive algorithms, but I preferred to implement the code, just to show how simple they are. In next snippet (also available in this GIST), I first create a dataset, then compute the score with a Logistic Regression and finally apply the PA and measure the final score on a test set:

Regression

For regression, the algorithm is very similar, but it’s now based on a slightly different Hinge loss function (called ε-insensitive):

The parameter ε determines a tolerance for prediction errors. The update conditions are the same adopted for classification problems and the resulting update rule is:

Just like for classification, Scikit-Learn implements also a Regression, however, in the next snippet (also available in this GIST), there’s a custom implementation:

The error plot is shown in the following figure:

The quality of the regression (in particular, the length of the transient period when the error is high) can be controlled by picking better C and ε values. In particular, I suggest checking different range centers for C (100, 10, 1, 0.1, 0.01), in order to determine whether a higher aggressiveness is preferable.

References: