Abstract
The extrapolation strategy raised by Nesterov, which can accelerate the
convergence rate of gradient descent methods by orders of magnitude when
dealing with smooth convex objective, has led to tremendous success in
training machine learning tasks. In this paper, we theoretically study
its strength in the convergence of individual iterates of general
non-smooth convex optimization problems, which we name
\textit{individual convergence}. We prove that
Nesterov’s extrapolation is capable of making the individual convergence
of projected gradient methods optimal for general convex problems, which
is now a challenging problem in the machine learning community. In light
of this consideration, a simple modification of the gradient operation
suffices to achieve optimal individual convergence for strongly convex
problems, which can be regarded as making an interesting step towards
the open question about SGD posed by Shamir
\cite{shamir2012open}. Furthermore, the derived
algorithms are extended to solve regularized non-smooth learning
problems in stochastic settings. {\color{blue}They can
serve as an alternative to the most basic SGD especially in coping with
machine learning problems, where an individual output is needed to
guarantee the regularization structure while keeping an optimal rate of
convergence.} Typically, our method is applicable as an efficient tool
for solving large-scale $l_1$-regularized hinge-loss learning
problems. Several real experiments demonstrate that the derived
algorithms not only achieve optimal individual convergence rates but
also guarantee better sparsity than the averaged solution.