Walid Krichene; Continuous-time dynamics for convex optimization

Feb 20th, 2018, 12:00pm
CSE 403
Walid Krichene, Google

Abstract: Many optimization algorithms can be viewed as a discretization of a continuous-time process (described using an ordinary differential equation, or a stochastic differential equation in the presence of noise). The continuous-time point of view can be useful for simplifying the analysis, drawing connections with physics, and streamlining the design of new algorithms and heuristics. We will review results from continuous-time optimization, from the simple case of gradient flow, to more recent results on accelerated methods. In particular, we give simple interpretations of acceleration, and show how these interpretations motivate heuristics (restarting and adaptive averaging) which, empirically, can significantly improve the speed of convergence. We will then focus on the stochastic case, and study the interaction between acceleration and noise, and their effect on the convergence rates. We will conclude with a brief review of how the same tools can be applied in other problems, such as approximate sampling and non-convex optimization.

Bio: Walid Krichene is at Google Research, where he works on large-scale optimization and recommendation. He received his Ph.D. in EECS in 2016 from UC Berkeley, where he was advised by Alex Bayen and Peter Bartlett, a M.A. in Mathematics from U.C. Berkeley, and a M.S. in Engineering and Applied Math from the Ecole des Mines ParisTech. He received the Leon Chua Award and two outstanding GSI awards from U.C. Berkeley. His research interests include convex optimization, stochastic approximation, recommender systems, and online learning.

Jakub Konečný; Federated Learning: Privacy-Preserving Collaborative Machine Learning without Centralized Training Data

Jan 30th, 2018, 12:00pm
CSE 403
Jakub KonečnýGoogle

Abstract: Federated Learning is a machine learning setting where the goal is to
train a high quality centralized model while training data remains
distributed over a large number of clients each with unreliable and
relatively slow network connections. We consider learning algorithms
for this setting where on each round, each client independently
computes an update to the current model based on its local data, and
communicates this update to a central server, where the client-side
updates are aggregated to compute a new global model.

In this talk, I will introduce the underlying algorithms, and present
several ideas for improving the overall system in terms of
communication efficiency, security, and differential privacy.

Bio: Jakub Konečný is a research scientist at Google working on Federated
Learning, an effort to decentralize machine learning. Prior to joining
Google, Jakub completed his PhD at University of Edinburgh focusing on
optimization algorithms for machine learning.

Zeyuan Allen-Zhu; How to Swing By Saddle Points: Faster Non-Convex Optimization Than SGD

Feb 13th, 2018, 4pm
LOW 105
Zeyuan Allen-Zhu, Microsoft Research

“The diverse world of deep learning research has given rise to thousands of architectures for neural networks. However, to this date, the underlying training algorithms for neural networks are still stochastic gradient descent (SGD) and its heuristic variants.

In this talk, we present a new stochastic algorithm to train any smooth neural network to ε-approximate local minima, using O(ε^{-3.25}) backpropagations. The best provable result was O(ε^{-4}) by SGD before this work.

More broadly, it finds ε-approximate local minima of any smooth nonconvex function using only O(ε^{-3.25}) stochastic gradient computations.”

Felix Ye; Estimate exponential memory decay in Hidden Markov Model and its applications

Nov 14, 2017, 4pm
PDL C-401
Felix Ye, Department of Applied Mathematics
Inference in hidden Markov model has been challenging in terms of scalability due to dependencies in the observation data. In this paper, we utilize the inherent memory decay in hidden Markov models, such that the forward and backward probabilities can be carried out with subsequences, enabling efficient inference over long sequences of observations. We formulate this forward filtering process in the setting of the random dynamical system and there exist Lyapunov exponents in the i.i.d random matrices production. And the rate of the memory decay is known as $\lambda_2-\lambda_1$, the gap of the top two Lyapunov exponents almost surely. An efficient and accurate algorithm is proposed to numerically estimate the gap after the soft-max parametrization. The length of subsequences $B$ given the controlled error $\epsilon$ is $B=\log(\epsilon)/(\lambda_2-\lambda_1)$. We theoretically prove the validity of the algorithm and demonstrate the effectiveness with numerical examples. The method developed here can be applied to widely used algorithms, such as mini-batch stochastic gradient method. Moreover, the continuity of Lyapunov spectrum ensures the estimated $B$ could be reused for the nearby parameter during the inference.

John Duchi; Composite modeling and optimization, with applications to phase retrieval and nonlinear observation modeling

CORE Series
Friday, Oct 20, 2017
SMI 211, 3:30PM 
John DuchiStanford University (Departments of Statistics and Electrical Engineering)

TITLE: Composite modeling and optimization, with applications to phase retrieval and nonlinear observation modeling

ABSTRACT: We consider minimization of stochastic functionals that are compositions of a (potentially) non-smooth convex function h and smooth function c. We develop two stochastic methods–a stochastic prox-linear algorithm and a stochastic (generalized) sub-gradient procedure–and prove that, under mild technical conditions, each converges to first-order stationary points of the stochastic objective. Additionally, we analyze this problem class in the context of phase retrieval and more generic nonlinear modeling problems, showing that we can solve these problems (even with faulty measurements) with extremely high probability under appropriate random measurement models. We provide substantial experiments investigating our methods, indicating the practical effectiveness of the procedures.

 

BIO: John Duchi is an assistant professor of Statistics and Electrical Engineering and (by courtesy) Computer Science at Stanford University, with graduate degrees from UC Berkeleyand undergraduate degrees from Stanford. His work focuses on large scale optimization problems arising out of statistical and machine learning problems, robustness and uncertain data problems, and information theoretic aspects of statistical learning. He has won a number of awards and fellowships, including a best paper award at the International Conference onMachine Learning, an NSF CAREER award, and a Sloan Fellowship in Mathematics.

Hongzhou Lin; A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization

June 6, 2017, 4pm
PDL C-401
Hongzhou Lin, Inria Grenoble

In this talk, we propose a generic approach to accelerate gradient-based optimization algorithms with quasi-Newton principles. The proposed scheme, called QuickeNing, can be applied to incremental first-order methods such as stochastic variance-reduced gradient (SVRG) or incremental surrogate optimization (MISO). It is also compatible with composite objectives, meaning that it has the ability to provide exactly sparse solutions when the objective involves a sparsity-inducing regularization. QuickeNing relies on limited-memory BFGS rules, making it appropriate for solving high-dimensional optimization problems. Besides, it enjoys a worst-case linear convergence rate for strongly convex problems. We present experimental results where QuickeNing gives significant improvements over competing methods for solving large-scale high-dimensional machine learning problems.