Author Archives: kmacphee

Yurii Nesterov; Relative smoothness condition and its application to third-order methods

CORE Series
Monday, May 21, 2018
SMI 205, 4:00PM 
Yurii Nesterov (CORE/INMA, UCL, Belgium)

TITLE:  Relative smoothness condition and its application to third-order methods

ABSTRACT:  In this talk, we show that the recently developed relative smoothness condition can be used for constructing implementable third-order methods for Unconstrained Convex Optimization. At each iteration of these methods, we need to solve an auxiliary problem of minimizing a convex multivariate polynomial, which is a sum of the third-order Taylor approximation and a regularization term. It appears that this nontrivial nonlinear optimization problem can be solved very efficiently by a gradient-type minimization method based on the relative smoothness condition. Its linear rate of convergence depends only on absolute constant. This result opens a possibility for practical implementation of the third-order methods.

BIO:  Yurii Nesterov is a professor at the Center for Operations Research and Econometrics (CORE) in Catholic University of Louvain (UCL), Belgium. He received his Ph.D. degree (Applied Mathematics) in 1984 at the Institute of Control Sciences, Moscow. Starting from 1993 he works at the Center of Operations Research and Econometrics (Catholic University of Louvain, Belgium).

His research interests are related to complexity issues and efficient methods for solving various optimization problems. The main results are obtained in Convex Optimization (optimal methods for smooth problems, polynomial-time interior-point methods, smoothing technique for structural optimization, complexity theory for second-order methods, optimization methods for huge-scale problems). He is an author of 5 monographs and more than 100 refereed papers in leading optimization journals. He has received several international prizes, among which are the Dantzig Prize from SIAM and Mathematical Programming society (2000), the von Neumann Theory Prize from INFORMS (2009), the SIAM Outstanding Paper Award (2014), and the Euro Gold Medal from the Association of European Operations Research Societies (2016). In 2018 he also won an Advanced Grant from the European Research Council.

Mark Schmidt; Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence

May 29th, 2018, 4:00pm
SAV 264
Mark Schmidt, University of British Columbia

Abstract: Block coordinate descent (BCD) methods are widely-used for large-scale numerical optimization because of their cheap iteration costs, low memory requirements, amenability to parallelization, and ability to exploit problem structure. Three main algorithmic choices influence the performance of BCD methods: the block partitioning strategy, the block selection rule, and the block update rule. In this paper we explore all three of these building blocks and propose variations for each that can lead to significantly faster BCD methods. We (i) propose new greedy block-selection strategies that guarantee more progress per iteration than the Gauss-Southwell rule; (ii) explore practical issues like how to implement the new rules when using “variable” blocks; (iii) explore the use of message-passing to compute matrix or Newton updates efficiently on huge blocks for problems with a sparse dependency between variables; and (iv) consider optimal active manifold identification, which leads to bounds on the “active set complexity” of BCD methods and leads to superlinear convergence for certain problems with sparse solutions (and in some cases finite termination at an optimal solution). We support all of our findings with numerical results for the classic machine learning problems of least squares, logistic regression, multi-class logistic regression, label propagation, and L1-regularization.

Biography: Mark Schmidt has been an assistant professor in the Department of Computer Science at the University of British Columbia since 2014, and is a Canada Research Chair and Alfred P. Sloan Fellow. His research focuses on developing faster algorithms for large-scale machine learning, with an emphasis on methods with provable convergence rates and that can be applied to structured prediction problems. From 2011 through 2013 he worked at the École Normale Supérieure in Paris on inexact and stochastic convex optimization methods. He finished his M.Sc. in 2005 at the University of Alberta working as part of the Brain Tumor Analysis Project, and his Ph.D. in 2010 at the University of British Columbia working on graphical model structure learning with L1-regularization. He has also worked at Siemens Medical Solutions on heart motion abnormality detection, with Michael Friedlander in the Scientific Computing Laboratory at the University of British Columbia on semi-stochastic optimization methods, and with Anoop Sarkar at Simon Fraser University on large-scale training of natural language models.

Aaron Sidford; Faster Algorithms for Computing the Stationary Distribution

CORE Series
Tuesday, May 8, 2018
SAV 264, 4:00PM 
Aaron SidfordStanford University (Management Science and Engineering)

TITLE: Faster Algorithms for Computing the Stationary Distribution

ABSTRACT: Computing the stationary distribution of a Markov Chain is one of the most fundamental problems in optimization. It lies at the heart of numerous computational tasks including computing personalized PageRank vectors, evaluating the utility of policies in Markov decision process, and solving asymmetric diagonally dominant linear systems. Despite the ubiquity of these problems, until recently the fastest known running times for computing the stationary distribution either depended polynomially on the mixing time or desired accuracy or appealed to generic linear system solving machinery, and therefore ran in super-quadratic time.

In this talk I will present recent results showing that the stationary distribution and related quantities can all be computed in almost linear time. I will present new iterative methods for extracting structure from directed graphs and and show how they can be tailored to achieve this new running time. Moreover, I will discuss connections between this work and recent developments in solving Laplacian systems and emerging trends in combining numerical and combinatorial techniques in the design of optimization algorithms.

This talk reflects joint work with Michael B. Cohen (MIT), Jonathan Kelner (MIT), Rasmus Kyng (Harvard), John Peebles (MIT), Richard Peng (Georgia Tech), Anup Rao (Adobe Research), and Adrian Vladu (Boston University).

BIO: Aaron completed his PhD at MIT (CSAIL) advised by Jon Kelner and since then is an Assistant Professor at Stanford (MSE). He works on fast algorithms for various optimization problems. His work received best paper awards at SODA 2014 and FOCS 2014 and a best student paper award at FOCS 2015. In particular he become broadly known in the community for the development of an Interior Point Algorithm that solves LPs in sqrt(rank) many iterations (also improving the total running time compared to previous algorithms). This had been the most prominent open problem in the field of interior point methods since the work of Nesterov and Nemirovski in 1994.

Walid Krichene; Continuous-time dynamics for convex optimization

Feb 20th, 2018, 12:00pm
CSE 403
Walid Krichene, Google

Abstract: Many optimization algorithms can be viewed as a discretization of a continuous-time process (described using an ordinary differential equation, or a stochastic differential equation in the presence of noise). The continuous-time point of view can be useful for simplifying the analysis, drawing connections with physics, and streamlining the design of new algorithms and heuristics. We will review results from continuous-time optimization, from the simple case of gradient flow, to more recent results on accelerated methods. In particular, we give simple interpretations of acceleration, and show how these interpretations motivate heuristics (restarting and adaptive averaging) which, empirically, can significantly improve the speed of convergence. We will then focus on the stochastic case, and study the interaction between acceleration and noise, and their effect on the convergence rates. We will conclude with a brief review of how the same tools can be applied in other problems, such as approximate sampling and non-convex optimization.

Bio: Walid Krichene is at Google Research, where he works on large-scale optimization and recommendation. He received his Ph.D. in EECS in 2016 from UC Berkeley, where he was advised by Alex Bayen and Peter Bartlett, a M.A. in Mathematics from U.C. Berkeley, and a M.S. in Engineering and Applied Math from the Ecole des Mines ParisTech. He received the Leon Chua Award and two outstanding GSI awards from U.C. Berkeley. His research interests include convex optimization, stochastic approximation, recommender systems, and online learning.

Jakub Konečný; Federated Learning: Privacy-Preserving Collaborative Machine Learning without Centralized Training Data

Jan 30th, 2018, 12:00pm
CSE 403
Jakub KonečnýGoogle

Abstract: Federated Learning is a machine learning setting where the goal is to
train a high quality centralized model while training data remains
distributed over a large number of clients each with unreliable and
relatively slow network connections. We consider learning algorithms
for this setting where on each round, each client independently
computes an update to the current model based on its local data, and
communicates this update to a central server, where the client-side
updates are aggregated to compute a new global model.

In this talk, I will introduce the underlying algorithms, and present
several ideas for improving the overall system in terms of
communication efficiency, security, and differential privacy.

Bio: Jakub Konečný is a research scientist at Google working on Federated
Learning, an effort to decentralize machine learning. Prior to joining
Google, Jakub completed his PhD at University of Edinburgh focusing on
optimization algorithms for machine learning.

Zeyuan Allen-Zhu; How to Swing By Saddle Points: Faster Non-Convex Optimization Than SGD

Feb 13th, 2018, 4pm
LOW 105
Zeyuan Allen-Zhu, Microsoft Research

“The diverse world of deep learning research has given rise to thousands of architectures for neural networks. However, to this date, the underlying training algorithms for neural networks are still stochastic gradient descent (SGD) and its heuristic variants.

In this talk, we present a new stochastic algorithm to train any smooth neural network to ε-approximate local minima, using O(ε^{-3.25}) backpropagations. The best provable result was O(ε^{-4}) by SGD before this work.

More broadly, it finds ε-approximate local minima of any smooth nonconvex function using only O(ε^{-3.25}) stochastic gradient computations.”

Felix Ye; Estimate exponential memory decay in Hidden Markov Model and its applications

Nov 14, 2017, 4pm
PDL C-401
Felix Ye, Department of Applied Mathematics
Inference in hidden Markov model has been challenging in terms of scalability due to dependencies in the observation data. In this paper, we utilize the inherent memory decay in hidden Markov models, such that the forward and backward probabilities can be carried out with subsequences, enabling efficient inference over long sequences of observations. We formulate this forward filtering process in the setting of the random dynamical system and there exist Lyapunov exponents in the i.i.d random matrices production. And the rate of the memory decay is known as $\lambda_2-\lambda_1$, the gap of the top two Lyapunov exponents almost surely. An efficient and accurate algorithm is proposed to numerically estimate the gap after the soft-max parametrization. The length of subsequences $B$ given the controlled error $\epsilon$ is $B=\log(\epsilon)/(\lambda_2-\lambda_1)$. We theoretically prove the validity of the algorithm and demonstrate the effectiveness with numerical examples. The method developed here can be applied to widely used algorithms, such as mini-batch stochastic gradient method. Moreover, the continuity of Lyapunov spectrum ensures the estimated $B$ could be reused for the nearby parameter during the inference.