\documentclass[a4paper]{article}
\usepackage{hyperref, graphicx, color, alltt}
\usepackage{Sweave}
\usepackage[round]{natbib}
\definecolor{Red}{rgb}{0.7,0,0}
\definecolor{Blue}{rgb}{0,0,0.8}
\definecolor{hellgrau}{rgb}{0.55,0.55,0.55}
\newcommand{\pkg}[1]{\texttt{#1}}
\newenvironment{smallexample}{\begin{alltt}\small}{\end{alltt}}
\begin{document}
%\VignetteIndexEntry{Support Vector Machines---the Interface to libsvm in package e1071}
%\VignetteDepends{e1071,randomForest,xtable}
%\VignetteKeywords{classification, regression, machine learning, benchmarking, support vector machines}
%\VignettePackage{e1071}
\SweaveOpts{engine=R,eps=FALSE}
\setkeys{Gin}{width=0.8\textwidth}
\title{Support Vector Machines
\footnote{A smaller version of this
article appeared in R-News, Vol.1/3, 9.2001}\\
\large The Interface to \texttt{libsvm} in package \pkg{e1071}}
\author{by David Meyer\\
FH Technikum Wien, Austria\\
\url{David.Meyer@R-Project.org}
}
\maketitle
\sloppy
``Hype or Hallelujah?'' is the provocative title used by
\cite{svm:bennett+campbell:2000} in an overview of Support Vector
Machines (SVM).
SVMs are currently a hot topic in the machine learning community,
creating a similar enthusiasm at the moment as Artificial Neural
Networks used to do before. Far from being a panacea, SVMs yet
represent a powerful technique for
general (nonlinear) classification, regression and outlier detection with an intuitive
model representation.
The package \pkg{e1071} offers an interface to the
award-winning\footnote{The library won the IJCNN 2001 Challenge by solving two of
three problems: the Generalization Ability Challenge (GAC) and the Text
Decoding Challenge (TDC). For more information, see: \url{http://www.csie.ntu.edu.tw/~cjlin/papers/ijcnn.ps.gz}.}
C++-implementation by Chih-Chung Chang and Chih-Jen Lin,
\texttt{libsvm} (current version: 2.6), featuring:
\begin{itemize}
\item $C$- and $\nu$-classification
\item one-class-classification (novelty detection)
\item $\epsilon$- and $\nu$-regression
\end{itemize}
and includes:
\begin{itemize}
\item linear, polynomial, radial basis function, and sigmoidal kernels
\item formula interface
\item $k$-fold cross validation
\end{itemize}
For further implementation details on \texttt{libsvm}, see \cite{svm:chang+lin:2001}.
\section*{Basic concept}
SVMs were developed by \cite{svm:cortes+vapnik:1995} for binary
classification. Their approach may be roughly sketched as follows:
\begin{description}
\item[Class separation:] basically, we are looking for the optimal separating hyperplane
between the two classes by maximizing the
\textit{margin} between the classes' closest points (see Figure
\ref{fig:svm1})---the points lying on the boundaries are called \textit{support vectors}, and
the middle of the margin is our optimal separating hyperplane;
\item[Overlapping classes:] data points on the ``wrong'' side
of the discriminant margin are weighted down to reduce their influence (\textit{``soft margin''});
\item[Nonlinearity:] when we cannot
find a \textit{linear} separator, data points are projected into an
(usually) higher-dimensional space where the data points effectively
become linearly separable (this projection is realised via \textit{kernel
techniques});
\item[Problem solution:] the whole task can be formulated as a
quadratic optimization problem which can be solved by known techniques.
\end{description}
\noindent A program able to perform all these tasks is called a \textit{Support
Vector Machine}.
\begin{figure}[htbp]
\begin{center}
\includegraphics[width=8cm]{svm}
\caption{Classification (linear separable case)}
\label{fig:svm1}
\end{center}
\end{figure}
Several extensions have been developed; the ones currently
included in \texttt{libsvm} are:
\begin{description}
\item[$\nu$-classification:] this model allows for more control over
the number of support vectors
\cite[see][]{svm:scholkopf+smola+williamson:2000} by specifying an
additional parameter $\nu$ which approximates the fraction of
support vectors;
\item[One-class-classification:] this model tries to find the support of a
distribution and thus allows for outlier/novelty detection;
\item[Multi-class classification:] basically, SVMs can only solve
binary classification problems. To allow for multi-class
classification, \texttt{libsvm} uses the
\textit{one-against-one} technique by fitting all binary
subclassifiers and finding the correct class by a voting mechanism;
\item[$\epsilon$-regression:] here, the data points lie
\textit{in between} the two borders of the margin which is maximized under
suitable conditions to avoid outlier inclusion;
\item[$\nu$-regression:] with analogue modifications of the regression
model as in the classification case.
\end{description}
\section*{Usage in R}
The R interface to \texttt{libsvm} in package \pkg{e1071}, \texttt{svm()},
was designed to be as intuitive as possible. Models are fitted
and new data are predicted as usual, and both the vector/matrix and the
formula interface are implemented. As expected for R's statistical
functions, the engine tries to be smart about
the mode to be chosen, using the dependent variable's type ($y$):
if $y$ is a factor, the engine switches to
classification mode, otherwise, it behaves as a regression
machine; if $y$ is omitted, the engine assumes a novelty detection task.
\section*{Examples}
In the following two examples, we demonstrate the practical use of
\texttt{svm()} along with a comparison to classification and
regression forests as implemented in \texttt{randomForest()}.
\subsection*{Classification}
In this example, we use the glass data from the \href{http://www.ics.uci.edu/mlearn/MLRepository.html}{UCI Repository of
Machine Learning Databases} for
classification. The task is to predict the type of a glass on basis
of its chemical analysis.
We start by splitting the data into a train and test set:
<<>>=
library(e1071)
library(randomForest)
data(Glass, package="mlbench")
## split data into a train and test set
index <- 1:nrow(Glass)
N <- trunc(length(index)/3)
testindex <- sample(index, N)
testset <- Glass[testindex,]
trainset <- Glass[-testindex,]
@
Both for SVM and randomForest (via
\texttt{randomForest()}), we fit the model and predict the test set
values:
<<>>=
## svm
svm.model <- svm(Type ~ ., data = trainset, cost = 100, gamma = 1)
svm.pred <- predict(svm.model, testset[,-10])
@
(The dependent variable, \texttt{Type}, has column number
10. \texttt{cost} is a general penalizing parameter for $C$-classification and
\texttt{gamma} is the radial basis function-specific kernel parameter.)
<<>>=
## randomForest
rf.model <- randomForest(Type ~ ., data = trainset)
rf.pred <- predict(rf.model, testset[,-10])
@
A cross-tabulation of the true versus the predicted values yields:
<<>>=
## compute svm confusion matrix
table(pred = svm.pred, true = testset[,10])
## compute randomForest confusion matrix
table(pred = rf.pred, true = testset[,10])
@
%% results table
<>=
library(xtable)
rf.acc <- c()
sv.acc <- c()
rf.kap <- c()
sv.kap <- c()
reps <- 10
for (i in 1:reps) {
## split data into a train and test set
index <- 1:nrow(Glass)
N <- trunc(length(index)/3)
testindex <- sample(index, N)
testset <- na.omit(Glass[testindex,])
trainset <- na.omit(Glass[-testindex,])
## svm
svm.model <- svm(Type ~ ., data = trainset, cost = 8, gamma = 0.0625)
svm.pred <- predict(svm.model, testset[,-10])
tab <- classAgreement(table(svm.pred, testset[,10]))
sv.acc[i] <- tab$diag
sv.kap[i] <- tab$kappa
## randomForest
rf.model <- randomForest(Type ~ ., data = trainset)
rf.pred <- predict(rf.model, testset[,-10])
tab <- classAgreement(table(rf.pred, testset[,10]))
rf.acc[i] <- tab$diag
rf.kap[i] <- tab$kappa
}
x <- rbind(summary(sv.acc), summary(rf.acc), summary(sv.kap), summary(rf.kap))
rownames <- c()
tab <- cbind(rep(c("svm","randomForest"),2), round(x,2))
colnames(tab)[1] <- "method"
rownames(tab) <- c("Accuracy","","Kappa"," ")
xtable(tab, label = "tab:class", caption = "Performance of \\texttt{svm()} and\
\\texttt{randomForest()} for classification (10 replications)")
@
\noindent Finally, we compare the performance of the two methods by computing the
respective accuracy rates and the kappa indices (as computed by \texttt{classAgreement()}
also contained in package \pkg{e1071}). In Table \ref{tab:class}, we
summarize the results of \Sexpr{reps} replications---Support Vector Machines show worse results.
\subsection*{Non-linear $\epsilon$-Regression}
The regression capabilities of SVMs are demonstrated on the
ozone data. Again, we split the data
into a train and test set.
<<>>=
library(e1071)
library(randomForest)
data(Ozone, package="mlbench")
## split data into a train and test set
index <- 1:nrow(Ozone)
N <- trunc(length(index)/3)
testindex <- sample(index, N)
testset <- na.omit(Ozone[testindex,-3])
trainset <- na.omit(Ozone[-testindex,-3])
## svm
svm.model <- svm(V4 ~ ., data = trainset, cost = 1000, gamma = 0.0001)
svm.pred <- predict(svm.model, testset[,-3])
sqrt(crossprod(svm.pred - testset[,3]) / N)
## random Forest
rf.model <- randomForest(V4 ~ ., data = trainset)
rf.pred <- predict(rf.model, testset[,-3])
sqrt(crossprod(rf.pred - testset[,3]) / N)
@
<>=
rf.res <- c()
sv.res <- c()
reps <- 10
for (i in 1:reps) {
## split data into a train and test set
index <- 1:nrow(Ozone)
N <- trunc(length(index)/3)
testindex <- sample(index, N)
testset <- na.omit(Ozone[testindex,-3])
trainset <- na.omit(Ozone[-testindex,-3])
## svm
svm.model <- svm(V4 ~ ., data = trainset, cost = 1000, gamma = 0.0001)
svm.pred <- predict(svm.model, testset[,-3])
sv.res[i] <- sqrt(crossprod(svm.pred - testset[,3]) / N)
## randomForest
rf.model <- randomForest(V4 ~ ., data = trainset)
rf.pred <- predict(rf.model, testset[,-3])
rf.res[i] <- sqrt(crossprod(rf.pred - testset[,3]) / N)
}
xtable(rbind(svm = summary(sv.res), randomForest = summary(rf.res)),
label = "tab:reg", caption = "Performance of \\texttt{svm()} and\
\\texttt{randomForest()} for regression (Root Mean Squared Error, 10 replications)")
@
\noindent We compare the two methods by the root mean squared error (RMSE)---see Table
\ref{tab:reg} for a summary of \Sexpr{reps} replications.
In this case, \texttt{svm()}
does a better job than \texttt{randomForest()}.
\section*{Elements of the \texttt{svm} object}
The function \texttt{svm()} returns an object of class ``\texttt{svm}'',
which partly includes the following components:
\begin{description}
\item[\textbf{\texttt{SV}:}] matrix of support vectors found;
\item[\textbf{\texttt{labels}:}] their labels in classification mode;
\item[\textbf{\texttt{index}:}] index of the support vectors in the input data (could
be used e.g., for their visualization as part of the data set).
\end{description}
If the cross-classification feature is enabled, the
\texttt{svm} object will contain some additional information described below.
\section*{Other main features}
\begin{description}
\item[Class Weighting:] if one wishes to weight the classes
differently
(e.g., in case of asymmetric class sizes to avoid possibly
overproportional influence of bigger classes on the margin),
weights may be specified in a vector with named components.
In case of two classes A and B, we could use
something like: \texttt{m <- svm(x, y, class.weights = c(A = 0.3, B
= 0.7))}
\item[Cross-classification:] to assess the quality of the
training result, we can perform a $k$-fold cross-classification on
the training data by setting the parameter \texttt{cross} to $k$
(default: 0). The \texttt{svm} object will then contain some
additional values, depending on whether classification or regression is
performed.
Values for classification:
\begin{description}
\item[\texttt{accuracies}:] vector of accuracy values for each of the $k$
predictions
\item[\texttt{tot.accuracy}:] total accuracy
\end{description}
Values for regression:
\begin{description}
\item[\texttt{MSE}:] vector of mean squared errors for each of the $k$
predictions
\item[\texttt{tot.MSE}:] total mean squared error
\item[\texttt{scorrcoef}:] Squared correlation coefficient (of
the predicted and the true values of the dependent variable)
\end{description}
\end{description}
\section*{Tips on practical use}
\begin{itemize}
\item Note that SVMs may be very
sensitive to the proper choice of parameters, so allways check a range
of parameter combinations, at least on a reasonable subset of your
data.
\item For classification tasks, you will most likely use
$C$-classification with the RBF kernel (default), because of its good
general performance and the few number of parameters
(only two: $C$ and $\gamma$). The authors of \pkg{libsvm} suggest
to try small and large values for $C$---like 1 to 1000---first,
then to decide which are
better for the data by cross validation, and finally to try
several $\gamma$'s for the better $C$'s.
\item However, better results are obtained by using a grid search
over all parameters. For this, we recommend to use the
\texttt{tune.svm()} function in \pkg{e1071}.
\item Be careful with large datasets as training times may increase
rather fast.
\item Scaling of the data usually drastically improves the
results. Therefore, \texttt{svm()} scales the data by default.
\end{itemize}
\section*{Model Formulations and Kernels}
Dual representation of models implemented:
\begin{itemize}
\item $C$-classification:\\
\begin{eqnarray}
\min_\alpha&&\frac{1}{2}\alpha^\top \mathbf{Q} \alpha-\mathbf{e}^\top\alpha \nonumber\\
\mbox{s.t.} &&0\le\alpha_i\le C,~i=1,\ldots,l,\\
&&\mathbf{y}^\top\alpha=0~, \nonumber
\end{eqnarray}
where $\mathbf{e}$ is the unity vector, $C$ is the upper bound, $\mathbf{Q}$ is an $l$
by $l$ positive semidefinite matrix, $Q_{ij} \equiv y_i y_j K(x_i,
x_j)$, and $K(x_i, x_j) \equiv \phi(x_i)^\top\phi(x_j)$ is the
kernel.
\item $\nu$-classification:\\
\begin{eqnarray}
\min_\alpha&&\frac{1}{2}\alpha^\top \mathbf{Q} \alpha \nonumber\\
\mbox{s.t.}&&0\le\alpha_i\le 1/l,~i=1,\ldots,l,\\
&&\mathbf{e}^\top \alpha \ge \nu, \nonumber\\
&&\mathbf{y}^\top\alpha=0~. \nonumber
\end{eqnarray}
where $\nu \in (0,1]$.
\item one-class classification:\\
\begin{eqnarray}
\min_\alpha&&\frac{1}{2}\alpha^\top \mathbf{Q} \alpha \nonumber\\
\mbox{s.t.} &&0\le\alpha_i\le 1/(\nu l),~i=1,\ldots,l,\\
&&\mathbf{e}^\top\alpha=1~,\nonumber
\end{eqnarray}
\item $\epsilon$-regression:\\
\begin{eqnarray}
\min_{\alpha, \alpha^*}&&\frac{1}{2}(\alpha-\alpha^*)^\top \mathbf{Q}
(\alpha-\alpha^*) + \nonumber\\
&&\epsilon\sum_{i=1}^{l}(\alpha_i+\alpha_i^*) + \sum_{i=1}^{l}y_i(\alpha_i-\alpha_i^*) \nonumber\\
\mbox{s.t.} &&0\le\alpha_i, \alpha_i^*\le C,~i=1,\ldots,l,\\
&&\sum_{i=1}^{l}(\alpha_i-\alpha_i^*)=0~.\nonumber
\end{eqnarray}
\item $\nu$-regression:\\
\begin{eqnarray}
\min_{\alpha, \alpha^*}&&\frac{1}{2}(\alpha-\alpha^*)^\top \mathbf{Q}
(\alpha-\alpha^*) + \mathbf{z}^\top(\alpha_i-\alpha_i^*) \nonumber\\
\mbox{s.t.} &&0\le\alpha_i, \alpha_i^*\le C,~i=1,\ldots,l,\\
&&\mathbf{e}^\top(\alpha-\alpha^*)=0\nonumber\\
&&\mathbf{e}^\top(\alpha+\alpha^*)=C\nu~.\nonumber
\end{eqnarray}
\end{itemize}
\noindent Available kernels:\\
\\
\noindent
\begin{table}[h]
\centering
\begin{tabular}{|l|l|l|} \hline
kernel & formula & parameters \\ \hline \hline
linear & $\bf u^\top v$& (none) \\
polynomial & $(\gamma \mathbf{u^\top v}+c_0)^d$ & $\gamma, d, c_0$\\
radial basis fct. & $\exp\{-\gamma|\mathbf{u-v}|^2\}$&$\gamma$\\
sigmoid & $\tanh\{\gamma \mathbf{u^\top v}+c_0\}$ &$\gamma, c_0$\\ \hline
\end{tabular}
\end{table}
\section*{Conclusion}
We hope that \texttt{svm} provides an easy-to-use interface to the
world of SVMs, which nowadays have become a popular technique in
flexible modelling. There are some drawbacks, though:
SVMs scale rather badly with the data size due to the quadratic
optimization algorithm and the kernel transformation. Furthermore,
the correct choice of kernel parameters is crucial for obtaining good
results, which practically means that an extensive
search must be conducted on the parameter space before results can be
trusted, and this often complicates the task
(the authors of \texttt{libsvm} currently conduct some work on methods of efficient automatic
parameter selection). Finally, the current implementation is optimized
for the radial basis function kernel only, which clearly might be
suboptimal for your data.
\begin{thebibliography}{5}
\bibitem[Bennett \& Campbell(2000)]{svm:bennett+campbell:2000}
Bennett, K.~P. \& Campbell, C. (2000).
\newblock Support vector machines: Hype or hallelujah?
\newblock \emph{SIGKDD Explorations}, \textbf{2}(2).
\newblock
\url{http://www.acm.org/sigs/sigkdd/explorations/issue2-2/bennett.pdf}.
\bibitem[Chang \& Lin(2001)]{svm:chang+lin:2001}
Chang, C.-C. \& Lin, C.-J. (2001).
\newblock {LIBSVM}: a library for support vector machines.
\newblock Software available at
\url{http://www.csie.ntu.edu.tw/~cjlin/libsvm}, detailed documentation
(algorithms, formulae, \dots) can be found
in \url{http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.ps.gz}
\bibitem[Cortes \& Vapnik(1995)]{svm:cortes+vapnik:1995}
Cortes, C. \& Vapnik, V. (1995).
\newblock Support-vector network.
\newblock \emph{Machine Learning}, \textbf{20}, 1--25.
\bibitem[Sch\"olkopf et~al.(2000)Sch\"olkopf, Smola, Williamson, \&
Bartlett]{svm:scholkopf+smola+williamson:2000}
Sch\"olkopf, B., Smola, A., Williamson, R.~C., \& Bartlett, P. (2000).
\newblock New support vector algorithms.
\newblock \emph{Neural Computation}, \textbf{12}, 1207--1245.
\bibitem[Vapnik(1998)]{svm:vapnik:1998}
Vapnik, V. (1998).
\newblock \emph{Statistical learning theory}.
\newblock New York: Wiley.
\end{thebibliography}
\end{document}