In the decision function of an SVM, you compute the scalar products of the support vectors (points that are on the margin of your hyperplane, or more precisely, the points that constrain your hyperplane) and your new sample point:
x· sv
The "z" the article defines is a new component that will be taken into account in the scalar product. A more mathematical way of seeing that is that you define a function phi that takes an original sample of your dataset, and transform it into a new vector. In our case, we simply add a new dimension (x3) based on the two original dimensions (x1, x2) that we add as a third component in our vector:
phi(x) = [x1, x2, x1² + x2²]
The scalar product we will have to compute in our decision function can then be expressed as (this is the a and b in the article, i.e. the sample and the support vector in our new space):
phi(x)· phi(sv)
The SVM doesn't need phi(x) or phi(sv), but the scalar product of those two numbers. The kernel trick is to find a function k that satisfies
k(x, sv) = phi(x)· phi(sv)
and that satisfies the Mercer's condition (I'll let Google explain what it is).
Your SVM will compute this (simpler) k function, instead of the full scalar product. There are multiple "common" kernel functions used (Wikipedia has examples of them[1]), and choosing one is a parameter of your model (ideally, you would then setup a testing protocol to find the best one).
Thank you. This was an amazing explanation. I am new to SVM's but did not make the connection that margin points (observations along the margin of the hyperplane) become your support vectors. This makes a lot more sense.
And if I am following correctly, it would make sense that the final step would then be:
We would maximize the dot product of a new observation with the support vectors to determine its classification (red or blue)
During the learning phase of the SVM, you try to find an hyperplane that maximizes the margin.
The decision function of an SVM can be written as:
f(x) = sign(sum alpha_sv y_sv k(x, sv))
Where sum represents the sum over all support vectors "sv", "y_sv" represents the class of the sample (red=1, blue=-1, for example), "alpha_sv" is the result of the optimization in the learning phase during the learning phase (it is equal to zero for a point that is not a support vector, and is positive otherwise).
The decision function is a sum over all support vectors balanced by the "k" function (that can thus be seen a similarity function between 2 points in your kernel), the y_i will make the term positive or negative depending on the class of the support vector. You take the sign of this sum (1 -> red, -1 -> blue, in our example), and it gives you the predicted class of your sample.
Thanks for bringing some saner notation in here. I feel like blog posts and journal articles that abuse notation like this one just make people more allergic to math.