Probability and Statistics Seminar 2025-2026
NEW [May 2021]: Our YOUTUBE CHANNEL contains videos of some recent talks.
(Starting from Spring 2026) Wednesdays 2-3 pm in KAP 414; tea is usually provided at 3:00pm
NEW [Aug 2023]: To sign up for our MAILING LIST, email: sympa@mymaillists.usc.edu with subject: Subscribe probstat-l Firstname Lastname [besides the subject line, the email itself should be empty]
Organizers: Xiaohui Chen <xiaohuic@usc.edu> Evgeni Dimitrov <edimitro@usc.edu> Steven Heilman <stevenmheilman@gmail.com> Stanislav Minsker <minsker@usc.edu> Yizhe Zhu <yizhezhu@usc.edu>
For seminars/colloquia on other topics, see the Department of Mathematics webpage.
Older seminars: Spring 2018-Fall 2024, Fall 2018, Spring 2018, Fall 2017, Spring 2017, Fall 2016, Spring 2016, Fall 2015, Spring 2015, Fall 2014, 2013-2014
Spring 2026
Jan 14:
Jan 21: Jason Fulman (USC)
Three applications of Stein’s method
Stein’s method is a remarkable technique for proving limit theorems in problems with dependence or limited information, and moreover sometimes yields results with error terms. In this talk we use a single framework to obtain results for three examples: one about random permutations, one about random walk, and one about random matrices.
Jan 28:
Feb 2:
Feb 11:
Feb 18: Benjamin McKenna (Georgia Tech)
Feb 25:
Mar 4:
Mar 11:
Mar 18: [no classes]
Mar 25:
Apr 1:
Apr 8:
Apr 15:
Apr 22:
Apr 29:
Fall 2025
Aug 29: Simone Bombari (Institute of Science and Technology Austria)
Privacy for Free in the Overparameterized Regime
Differentially private gradient descent (DP-GD) is a popular algorithm to train deep learning models with provable guarantees on the privacy of the training data. In the last decade, the problem of understanding its performance cost with respect to standard GD has received remarkable attention from the research community, which has led to upper bounds on the excess population risk in different learning settings. However, such bounds typically degrade with over-parameterization, i.e., as the number of parameters p gets larger than the number of training samples n — a regime which is ubiquitous in current deep-learning practice. As a result, the lack of theoretical insights leaves practitioners without clear guidance, leading some to reduce the effective number of trainable parameters to improve performance, while others use larger models to achieve better results through scale. In this work, we show that in the popular random features model with quadratic loss, for any sufficiently large p, privacy can be obtained for free, i.e., the with vanishing excess population risk, not only when the privacy parameter ε has constant order, but also in the strongly private setting ε = o(1). This challenges the common wisdom that over-parameterization inherently hinders performance in private learning.
September 5: Manuel Fernandez (USC)
Distance theorems and the smallest singular value of random matrices
In recent years, significant progress has been made in our understanding of the quantitative behavior of random matrices. One research direction of continued interest has been the estimation of the smallest singular value. A measurement of matrix’s “invertibility’’, quantitative bounds on the smallest singular value are important for a variety of tasks including establishing a circular law for a non-Hermitian random matrix and for proving stability of numerical methods. In view of the universality phenomena of random matrices, one tries to prove these estimates for more general matrix ensembles satisfying weaker assumptions.
In the geometric approach to proving smallest singular value estimates a key ingredient is the use of a ‘distance theorem’, which is a small ball estimate for the distance between a random vector and subspace. In this talk we will discuss a new distance theorem and its application to proving smallest singular value estimates for inhomogeneous random rectangular matrix with independent entries. We will also discuss how the recent resolution of the Slicing Conjecture, due to Klartag, Lehec, and Guan, implies smallest singular values estimates for a number of log-concave random matrix ensembles. In some cases, independent entries are no longer necessary!
The results above include joint work with Max Dabagia and with Galyna Livshyts and Stephanie Mui.
September 12: Kabir Verchand (USC)
The dynamics of iterative algorithms with random data: Beyond first-order methods
In this talk, I will present a toolbox to analyze a broad class of iterative algorithms for high-dimensional (non)-convex optimization with random data. This class is rich enough to include general first-order methods with non-separable, Lipschitz updates (such as gradient descent and approximate message passing) as well as commonly used higher-order updates such as the proximal point method, prox-linear methods, as well as variants of alternating minimization and expectation maximization. For this class of algorithms, I will provide an exact, deterministic description of their dynamics as well as finite-sample guarantees bounding the deviation of the empirical iterates from their deterministic counterparts. Our techniques are based on a sequential variant of Gordon’s Gaussian comparison inequalities applied in conjunction with Bolthausen’s Gaussian conditioning technique. Joint work with Michael Celentano, Chen Cheng, and Ashwin Pananjady.
September 19: Mihai Cucuringu (UCLA)
Spectral methods for clustering signed/directed networks and heterogeneous group synchronization
Graph clustering problems typically arise in settings where there exists a discrepancy in the edge density within different parts of the graph. In this work, we consider problem instances where the underlying cluster structure arises as a consequence of a signal present on the edges or nodes, and is not driven by edge density. We first consider the problem of clustering in two families of networks, signed (with edge weights taking positive or negative values) and directed, both solvable by exploiting the spectrum of certain graph Laplacian matrices. We consider a generalized eigenvalue problem involving graph Laplacians, and provide performance guarantees under the setting of a Signed Stochastic Block Model, along with regularized versions to handle very sparse graphs (below the connectivity threshold), a regime where standard spectral methods are known to underperform. We also propose a spectral clustering algorithm for directed graphs based on a complex-valued representation of the adjacency matrix, which is able to capture the underlying cluster structures, for which the information encoded in the direction of the edges is crucial. We evaluate the proposed algorithm in terms of a cut flow imbalance-based objective function, which, for a pair of given clusters, it captures the propensity of the edges to flow in a given direction. We analyze its theoretical performance on a Directed Stochastic Block Model. Finally, we discuss an extension of the classical angular synchronization problem that aims to recover unknown angles from a collection of noisy pairwise difference measurements. We consider a generalization to the heterogeneous setting where there exist k unknown groups of angles, and the measurement graph has an unknown edge-disjoint decomposition, where the subgraphs of noisy edge measurements correspond to each group. We propose a probabilistic generative model for this problem, along with a spectral algorithm that comes with theoretical guarantees in terms of robustness against sampling sparsity and noise. Time permitting, we discuss extensions to the temporal setting, and approaches based on graph neural networks.
September 26: Alexander Clay (USC)
[Undergrad Focused Talk]
Modern Approaches to Card Shuffling
We will explain some exciting connections between card shuffling and probability. Card shuffling is an old subject that combines probability with modern algebra. We will explore different shuffles through group theoretic and linear-algebraic points of view. Finally, we will investigate a model for casino card shuffling machines, and present some new research in this area.
October 3: Arthur Schichl (ETH Zürich)
Non-linear Degenerate Parabolic Flow Equations and a Finer Differential Structure on Wasserstein Spaces
We define new differential structures on the Wasserstein spaces $W_p(M)$ for $p > 2$ and a Riemannian manifold $(M,g)$. We consider a very general and possibly degenerate second-order partial differential flow equation with measure dependent coefficients to expand the notion of smooth curves and to ensure that the new differential structure is finer than the classical one. The theory allows for higher order calculus on Wasserstein spaces and admits numerical approximations in $W_p(M)$. We prove a generalized version of the Central Limit Theorem without requiring independence. We shall also present some of its economic applications.
October 10: [no classes]
October 17: Yeganeh Alimohammadi (USC)
How to Measure Differences in Rankings: A Data-Driven Extension of the Mallows Model
Ranking problems appear across domains, from consumer preferences and product recommendations to sports performance and hiring decisions. Probabilistic ranking models such as the Mallows model provide a principled way to capture uncertainty and infer a consensus order, assuming that observed rankings are noisy versions of an underlying truth. A key modeling challenge, however, lies in choosing the distance function that measures how “far apart” two rankings are, since different distances imply different ways of penalizing ranking swaps.
In this talk, I introduce a generalized framework based on Mallows model that learns this distance metric directly from data. I will show that the model forms an exponential-family distribution on permutations, and that its parameters (the central ranking, dispersion, and learned distance) can be estimated consistently via maximum likelihood with asymptotic normality guarantees. On the algorithmic side, I present a polynomial-time approximation scheme (PTAS) for efficient sampling and partition-function estimation.
Finally, I will discuss empirical validation on real datasets, demonstrating how learning the distance metric leads to more accurate predictions and interpretable insights about ranking behavior.
October 24: Alexander Tartakovsky (AGTStatConsult, Los Angeles) [Joint with CAMS Colloquium]
Optimal Sequential Tests of Multiple Composite Hypotheses for General Stochastic Models with Non-i.i.d. Observations
In many applications—such as monitoring sensor networks, detecting epidemics, or ensuring cybersecurity—it is necessary to test multiple composite hypotheses under strict error constraints. Sequential methods aim to minimize detection delay or sample size while controlling false identification risks. This talk presents a general framework for sequential multi-hypothesis testing in models with dependent and non-identically distributed observations. The resulting procedures achieve near-optimal performance in regimes of small misidentification probabilities. Applications include rapid epidemic detection (e.g., COVID-19), network intrusion monitoring, and tracking faint space objects. In particular, our methods identified regions of COVID outbreaks several days before official quarantine measures were imposed.
October 31: Yizhe Zhu (USC)
Extreme singular values of sparse random rectangular matrices
The bi-adjacency matrix of an Erdős–Rényi random bipartite graph with bounded aspect ratio is a rectangular random matrix with Bernoulli entries. Depending on the sparsity parameter $p$, its spectral behavior may either resemble that of a classical Wishart matrix or depart from this universal regime. In this talk, we study the extreme singular values at the critical density $np=c\log n$. We present the first quantitative characterization of the emergence of outlier singular values outside the Marčenko–Pastur law and determine their precise locations as functions of the largest and smallest degree vertices in the underlying random graph, which can be seen as an analogue of the Bai–Yin theorem in the sparse setting. These results uncover a clear mechanism by which combinatorial structures in sparse graphs generate spectral outliers.
Based on joint work with Ioana Dumitriu, Haixiao Wang and Zhichao Wang.
November 7: Yun Yang (University of Maryland, College Park)
Simulation-based Inference via Structured Score Matching
Simulation-based inference (SBI) provides an effective framework for statistical analysis when the likelihood is intractable but model simulations are available. This talk presents a unified SBI framework that leverages structured score matching for both frequentist and Bayesian inference. On the frequentist side, we develop a likelihood-free approach that combines score matching with gradient-based optimization and bootstrap procedures for parameter estimation and uncertainty quantification. On the Bayesian side, we integrate score matching with Langevin dynamics to efficiently explore complex posterior landscapes in moderate- to high-dimensional settings. In both cases, we design tailored score-matching estimators and architectural regularizations that embed the statistical structure of log-likelihood scores, which in turn improves estimation accuracy and scalability.
November 14: Bixing Qiao (USC)
A New Approach for the Continuous Time Kyle-Back Strategic Insider Equilibrium Problem
Abstract: In this talk, we consider a continuous-time Kyle-Back model which is a game between an insider and a market maker. The existing literature typically focuses on constructing equilibria with a PDE approach, which requires certain Markovian structures. We characterize all equilibria through a coupled system of forward-backward SDEs. In particular, when the time duration is small, we show that the FBSDE is well-posed, and therefore the game has a unique equilibrium. Moreover, this unique equilibrium may be non-Markovian and thus not attainable via the PDE approach. We next study the set value of the game, which roughly speaking is the set of insider’s values over all equilibria and thus is by nature unique. Finally, we characterize the set value through a level set of a certain standard HJB equation.
In the second part of the talk, we apply the new approach to the Kyle-Back model with dynamic legal risk. In the presence of an active regulator and a large number of noise traders, the insider chooses a strategy that conceals his identity within a large volume of surrounding trades and concentrates on medium-sized trades. We establish an intensity-based mathematical framework for explaining the interconnections between insider trading and stealth trading. It turns out that when the number of noise traders becomes large, the price impact of the insider asymptotically vanishes, and consequently the stochastic game becomes deterministic optimizations, which also carry implications for regulatory investigations and sanctions.
The results above include the joint work with Weixuan Xia and with Jianfeng Zhang.
November 21: Xiaocong Xu (USC)
A leave-one-out approach to approximate message passing
Approximate message passing (AMP) has emerged both as a popular class of iterative algorithms and as a powerful analytic tool in a wide range of statistical estimation problems and statistical physics models. A well established line of AMP theory proves Gaussian approximations for the empirical distributions of the AMP iterate in the high dimensional limit, under the GOE random matrix model and other rotational invariant ensembles. This work provides a non-asymptotic, leave-one-out representation for the AMP iterate that holds under a broad class of Gaussian random matrix models with general variance profiles. In contrast to the typical AMP theory that describes the first-order behavior for the empirical distributions of the AMP iterate via a low dimensional state evolution, our leave-one-out representation yields an intrinsically high dimensional state evolution formula, which provides a second-order, non-asymptotic characterization for the possibly heterogeneous, entrywise behavior of the AMP iterate under the prescribed random matrix models. To exemplify some distinct features of our AMP theory in applications, we analyze, in the context of regularized linear estimation, the precise stochastic behavior of the Ridge estimator for independent and non-identically distributed observations whose covariates exhibit general variance profiles. We find that its finite-sample distribution is characterized via a weighted Ridge estimator in a heterogeneous Gaussian sequence model. Notably, in contrast to the i.i.d. sampling scenario, the effective noise and regularization are now full dimensional vectors determined via a high dimensional system of equations. Our leave-one-out method of proof differs significantly from the widely adopted conditioning approach for rotational invariant ensembles, and relies instead on an inductive method that utilizes almost solely integration-by-parts and concentration techniques. This talk is based on joint work with Zhigang Bao and Qiyang Han.
November 28: [no classes]
December 5: Zebin Wang (Harvard)
DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications
Classical probabilistic graphical models face fundamental challenges in modern data environments, which are characterized by high dimensionality, source heterogeneity, and stringent data-sharing constraints. In this work, we revisit the Ising model, a well-established member of the Markov Random Field (MRF) family, and develop a distributed framework that enables scalable and privacy-preserving representation learning from large-scale binary data with inherent low-rank structure. Our approach optimizes a non-convex surrogate loss function via bi-factored gradient descent, offering substantial computational and communication advantages over conventional convex approaches. We evaluate our algorithm on multi-institutional electronic health record (EHR) datasets from 58,248 patients across the University of Pittsburgh Medical Center (UPMC) and Mass General Brigham (MGB), demonstrating superior performance in global representation learning and downstream clinical tasks, including relationship detection, patient phenotyping, and patient clustering. These results highlight a broader potential for statistical inference in federated, high-dimensional settings while addressing the practical challenges of data complexity and multi-institutional integration.
Spring 2025:
January 17: Bradley Rava (University of Sydney’s Business School)
Ask for More Than Bayes Optimal: A Theory of Indecisions for Classification
January 31: Paata Ivanisvili (UCI)
February 7: Nikita Gladkov (UCLA)
February 14 : Po-Ling Loh (University of Cambridge)
Differentially private M-estimation via noisy optimization
February 21: Lijun Ding (UCSD)
Flat minima generalize for low-rank matrix recovery
Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima — those around which the loss grows slowly — appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well.
February 28: Nils Detering (Heinrich-Heine University)
joint with Math Finance Colloquium
A class of point-wise operating SPDE coefficients for HJM models
We present a dynamic model for forward curves within the Heath-Jarrow-Morton framework under the Musiela parametrization. The forward curves take values in a function space $\h$, and their dynamics follows a stochastic partial differential equation with state-dependent coefficients. In particular, the coefficients are defined through point-wise operating maps on $\h$, resulting in a locally state-dependent structure. We first explore conditions under which these point-wise operators are well defined on $\h$. Next, we determine conditions to ensure that the resulting coefficient functions satisfy local growth and Lipschitz properties, so to guarantee the existence and uniqueness of mild solutions. The proposed model captures the behavior of the entire forward curve through a single equation, yet retains remarkable simplicity. Notably, we demonstrate that certain one-dimensional projections of the model are Markovian and satisfy a one-dimensional stochastic differential equation. This connects our Hilbert-space approach to well established models for forward contracts with fixed delivery times, for which existing formulas and numerical techniques can be applied. This link allows us to examine also conditions for maintaining positivity of the solutions. As concrete examples, we analyze Hilbert-space valued variants of an exponential model and of a constant elasticity of variance model.
March 7: Pedro Abdalla Teixeira (UCI)
Covariance Estimation through Empirical Process Theory
In this talk, we revisit the classical problem of covariance estimation from the perspective of empirical process theory. In particular, we explore the relationship between various covariance estimation problems and their associated empirical processes. This talk is based on a series of works by the author in collaboration with Nikita Zhivotovskiy and Shahar Mendelson.
March 28: Nikita Zhivotovskiy (UC Berkeley)
Sharper Risk Bounds for Statistical Aggregation
In this talk, we revisit classical results in the theory of statistical aggregation, focusing on the transition from global complexity to a more manageable local one. The goal of aggregation is to combine several base predictors to achieve a prediction nearly as accurate as the best one, without assumptions on the class structure or target. Though studied in both sequential and statistical settings, they traditionally use the same “global” complexity measure. We highlight the lesser-known PAC-Bayes localization enabling us to prove a localized bound for the exponential weights estimator by Leung and Barron, and a deviation-optimal localized bound for Q-aggregation. Finally, we demonstrate that our improvements allow us to obtain bounds based on the number of near-optimal functions in the class, and achieve polynomial improvements in sample size in certain nonparametric situations. This is contrary to the common belief that localization doesn’t benefit nonparametric classes. Joint work with Jaouad Mourtada and Tomas Vaškevičius.
April 4: Bohan Zhou (UCSB)
The Generalized Wasserstein Barycenter Problem
The Wasserstein barycenter problem, introduced by Agueh and Carlier, has gained widespread interest in machine learning, statistics, computer graphics, and engineering. Given the prominence of the Wasserstein distance, computing its barycenter is often a foundational step in understanding the properties of probability spaces equipped with this metric. In this talk, we explore the generalized Wasserstein barycenter, allowing for negative weights. This extension naturally arises when transitioning from interpolation to extrapolation, and it has immediate implications for high-order schemes in gradient flow (by Han, Esedoglu, and Garikipati), adversarial multiclass classification (by Garcia Trillos, Jacobs, and Kim), and Wasserstein Regression (by Chen, Lin and Muller) We will discuss the theoretical results (existence, optimality condition and uniqueness) of the problem, analyze special cases where all weights are positive except one or all are negative except one, and introduce new numerical algorithms applicable to both generalized and classical settings, concluding with experimental results.
April 11: Abiy Tasissa (Tufts)
From missing distances to structures: Theory, algorithms and applications
The advancement of technology has significantly enhanced our capacity to collect data. However, in many real-world applications, certain inherent limitations—such as the precision of measurement devices, environmental conditions, or operating costs—can result in missing data. In this talk, we focus on the setting where the available data consist of pairwise distances between a set of points, with the goal of estimating the configuration of the underlying points from incomplete distance measurements. This is known as the Euclidean distance geometry (EDG) problem and is central to many applications.
We first start by describing the solution when all distances are given using the classical multidimensional scaling (MDS) technique and then discuss a constructive approach to interpret the key mathematical objects in MDS. Next, we introduce a mathematical framework to address the EDG problem under two sampling models of the distance matrix: global sampling (uniform sampling of the entries of the distance matrix) and structured local sampling, where the measurements are limited to a subset of rows and columns. We discuss the conditions required for the exact recovery of the point configuration and the associated algorithms. The last part of the talk will illustrate the algorithms using synthetic and real data and discuss ongoing work.
April 18: Xiaohui Chen (USC)
Recent progress in statistical and computational optimal transport barycenters
The Wasserstein barycenter plays a fundamental role in averaging measure-valued data under the framework of optimal transport (OT). However, there are tremendous challenges in computing and estimating the Wasserstein barycenter for high-dimensional distributions. In this talk, we will discuss some recent progress in advancing the statistical and computational frontiers of optimal transport barycenters. We first introduce a multimarginal Schrödinger barycenter (MSB) based on the entropy regularized multimarginal optimal transport problem that admits general-purpose fast algorithms for computation. By recognizing a proper dual geometry, we derive sharp non-asymptotic rates of convergence for estimating several key MSB quantities (cost functional, Schrödinger coupling and barycenter) from point clouds randomly sampled from the input marginal distributions. We will also consider the computation exact (i.e., unregularized) Wasserstein barycenter, which can be recast into a nonconvex-concave minimax optimization. By alternating between the primal Wasserstein and dual potential Sobolev optimization geometries, we introduce a linear-time and linear-space Wasserstein-Descent H-Ascent (WDHA) algorithm and prove its algorithmic convergence to a stationary point.
April 25: Mo Zhou (UCLA)
Score-based neural ordinary differential equations and normalizing flow for computing mean field control problems
Mean Field Control (MFC) provides a mathematical framework for decision-making in large-scale systems and has strong connections to modern artificial intelligence, particularly generative models. In this talk, I will introduce a novel approach that computes MFC problems using score based neural ordinary differential equations (ODEs) and normalizing flows. Our method formulates a system of ODEs that computes both first- and second-order score functions along trajectories, transforming MFC into an unconstrained optimization problem. To improve accuracy, we introduce a regularization technique inspired by the Hamilton–Jacobi–Bellman (HJB) equations. I will show applications, including probability flow matching and Wasserstein proximal operators, explaining how this approach enhances both theoretical understanding and practical computation in control problems.
Fall 2024:
August 30: Yiyun He (UC Irvine)
Differentially Private Algorithms for Synthetic Data
September 6: Omer Tamuz (Caltech)
Asymptotic Renyi Entropies of Random Walks on Groups
We introduce asymptotic Renyi entropies as a parameterized family of invariants for random walks on groups. These invariants interpolate between various well-studied properties of the random walk, including the growth rate of the group, the Shannon entropy, and the spectral radius. They furthermore offer large deviation counterparts of the Shannon-McMillan-Breiman Theorem. We prove some basic properties of asymptotic Renyi entropies that apply to all groups, and discuss their analyticity and positivity for the free group and lamplighter groups.
Joint with Kimberly Golubeva and Minghao Pan
September 20: Weixin Yao (UC Riverside)
October 4: Richard Y. Zhang (UIUC)
Rank Overparameterization and Global Optimality Certification for Large-Scale Low-rank Optimization
Numerous important problems across applied statistics reduce into nonconvex estimation / optimization over a low-rank matrix. In principle, these can be reliably solved to global optimality via convex relaxation, but the computational costs can become prohibitive on a large scale. In practice, it is much more common to optimize over the low-rank matrices directly, as in the Burer-Monteiro approach, but their nonconvexity can cause failure by getting stuck at a spurious local minimum. For safety-critical societal applications, such as the operation and planning of an electricity grid, our inability to reliably achieve global optimality can have significant real-world consequences.
In the first part of this talk, we investigate how overparameterizing the low-rank factorization can render its nonconvexity increasingly benign. In the smooth and strongly-convex setting, we rigorously show that, as the rank is increased, spurious local minima become increasingly rare in a step-wise fashion. In other words, rank-2 has fewer spurious local minima than rank-1, and rank-3 has fewer than rank-2, etc. Once the rank exceeds an O(1) threshold, every remaining local minimum is a global minimum, and every saddle point can be escaped. In the second part of this talk, we use the rank deficiency brought on by rank overparameterization to certify convergence to global optimality after the fact. The certification is an a posteriori guarantee that is valid under much weaker assumptions than typical “no spurious local minima” guarantees. However, rank deficiency significantly slows down the convergence of gradient descent, from a linear rate to a sublinear rate. We propose an inexpensive preconditioner that restores the convergence rate of gradient descent back to linear in the overparameterized case.
Main related papers:
https://arxiv.org/abs/2207.01789
https://arxiv.org/abs/2206.03345 (joint work with Gavin Zhang and Salar Fattahi)
October 11: [no classes]
October 18: Camilo Hernández (USC ISE)
The mean field Schrödinger problem: a mean field control perspective
The mean field Schrödinger problem (MFSP) is the problem of finding the most likely path of a McKean-Vlasov type particle with constrained initial and final configurations. It was first introduced by Backhoff et al. (2020), who studied its existence and long-time behavior. This talk aims to show how ideas from mean field control theory allow us to derive new interesting results on the MFSP. In particular, we study its existence, characterization, and the so-called convergence problem. The method rests upon studying suitably penalized problems and stochastic control techniques. This talk is based on a joint work with Ludovic Tangpi (Princeton).
October 25: Yizhe Zhu (USC)
Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity
For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground truth. In contrast, while non-convex approaches are computationally less expensive, existing recovery guarantees assume that the number of samples scales at least quadratically with the rank. In this talk, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We improve the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. Joint work with Dominik Stöger (KU Eichstätt-Ingolstadt).
November 1: Wojciech Ozanski (Florida State University)
[Joint with PDE seminar]
Instantaneous continuous loss of regularity for the SQG equation
November 8: Johannes Wiesel (CMU)
[Joint with Mathematical Finance seminar]
Bounding adapted Wasserstein metrics
The Wasserstein distance $\mathcal{W}_p$ is an important instance of an optimal transport cost. Its numerous mathematical properties as well as applications to various fields such as mathematical finance and statistics have been well studied in recent years. The adapted Wasserstein distance $\mathcal{A}\mathcal{W}_p$ extends this theory to laws of discrete time stochastic processes in their natural filtrations, making it particularly well suited for analyzing time-dependent stochastic optimization problems.
While the topological differences between $\mathcal{A}\mathcal{W}_p$ and $\mathcal{W}_p$ are well understood, their differences as metrics remain largely unexplored beyond the trivial bound $\mathcal{W}_p\lesssim \mathcal{A}\mathcal{W}_p$. This paper closes this gap by providing upper bounds of $\mathcal{A}\mathcal{W}_p$ in terms of $\mathcal{W}_p$ through investigation of the smooth adapted Wasserstein distance. Our upper bounds are explicit and are given by a sum of $\mathcal{W}_p$, Eder’s modulus of continuity and a term characterizing the tail behavior of measures. As a consequence, upper bounds on $\mathcal{W}_p$ automatically hold for $\mathcal{AW}_p$ under mild regularity assumptions on the measures considered. A particular instance of our findings is the inequality $\mathcal{A}\mathcal{W}_1\le C\sqrt{\mathcal{W}_1}$ on the set of measures that have Lipschitz kernels.
Our work also reveals how smoothing of measures affects the adapted weak topology. In fact, we find that the topology induced by the smooth adapted Wasserstein distance exhibits a non-trivial interpolation property, which we characterize explicitly: it lies in between the adapted weak topology and the weak topology, and the inclusion is governed by the decay of the smoothing parameter.
This talk is based on joint work with Jose Blanchet, Martin Larsson and Jonghwa Park.
November 15: Greta Panova (USC)
Algebra meets probability: permutons from pipe dreams via integrable probability
November 22: Gil Kur (ETH Zürich)
Note the non-standard time and location: 1pm in KAP 265
Minimum Norm Interpolation Meets The Local Theory of Banach Spaces
Minimum-norm interpolators have recently gained attention as an analyzable model to shed light on the double descent phenomenon observed for neural networks. Most of the work has focused on analyzing interpolators in Hilbert spaces, where, typically, an effectively low-rank structure of the feature covariance prevents a large bias. This work takes a first step towards establishing a general framework that connects generalization properties of the interpolators to well-known concepts from high-dimensional geometry, specifically, from the local theory of Banach spaces.
November 29: [no classes]
December 6: Elliot Paquette (McGill) [2PM! Nonstandard time!]
Random matrix theory for high dimensional optimization, and an application to scaling laws
We describe a program of analysis of stochastic gradient methods on high-dimensional random objectives. We illustrate some assumptions under which the loss curves are universal, in that they can completely be described in terms of some underlying covariance structure of the problem setup. Furthermore, we give a description of these loss curves that can be analyzed precisely.
As a motivating application, we show how this can be applied to the power-law-random-features model. This is a simple two-hyperparameter family of optimization problems, which displays 5 distinct phases of SGD loss curves; these phases are determined by the relative complexities of the target, data distribution, and whether these are ‘high-dimensional’ or not (which in context can be precisely defined). In each phase, we can also give, for a given compute budget, the optimal random-feature dimensionality.
Joint work with Courtney Paquette (McGill & Google Deepmind), Jeffrey Pennington (Google Deepmind), and Lechao Xiao (Google Deepmind).
Spring 2024:
[CANCELLED] January 12: Roman Vershynin (UC Irvine) [Undergraduate lecture]
January 19: Grigory Franguridi, 2PM, (USC Center for Economic and Social Research)
Estimation of panels with attrition and refreshment samples
January 26: Mahdi Soltanolkotabi (USC)
Foundations for feature learning via gradient descent
One of the key mysteries in modern learning is that a variety of models such as deep neural networks when trained via (stochastic) gradient descent can extract useful features and learn high quality representations directly from data simultaneously with fitting the labels. This feature learning capability is also at the forefront of the recent success of a variety of contemporary paradigms such as transformer architectures, self-supervised and transfer learning. Despite a flurry of exciting activity over the past few years, existing theoretical results are often too crude and/or pessimistic to explain feature/representation learning in practical regimes of operation or serve as a guiding principle for practitioners. Indeed, existing literature often requires unrealistic hyperparameter choices (e.g. very small step sizes, large initialization or wide models). In this talk I will focus on demystifying this feature/representation learning phenomena for a variety of problems spanning single index models, low-rank factorization, matrix reconstruction, and neural networks. Our results are based on an intriguing spectral bias phenomena for gradient descent, that puts the iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well by simultaneously finding good features/representations of the data while fitting to the labels. The proofs combine ideas from high-dimensional probability/statistics, optimization and nonlinear control to develop a precise analysis of model generalization along the trajectory of gradient descent. Time permitting, I will explain the implications of these theoretical results for more contemporary use cases including transfer learning, self-attention, prompt-tuning via transformers and simple self-supervised learning settings.
February 2: Timothy Armstrong (USC)
Robust Estimation and Inference in Panels with Interactive Fixed Effects
We consider estimation and inference for a regression coefficient in panels with interactive fixed effects (i.e., with a factor structure). We show that previously developed estimators and confidence intervals (CIs) might be heavily biased and size-distorted when some of the factors are weak. We propose estimators with improved rates of convergence and bias-aware CIs that are uniformly valid regardless of whether the factors are strong or not. Our approach applies the theory of minimax linear estimation to form a debiased estimate using a nuclear norm bound on the error of an initial estimate of the interactive fixed effects. We use the obtained estimate to construct a bias-aware CI taking into account the remaining bias due to weak factors. In Monte Carlo experiments, we find a substantial improvement over conventional approaches when factors are weak, with little cost to estimation error when factors are strong.
Paper link: Timothy B. Armstrong, Martin Weidner, Andrei Zeleneev. https://arxiv.org/abs/2210.06639
February 9: Yuehao Bai (USC)
On the Efficiency of Finely Stratified Experiments
This paper studies the efficient estimation of a large class of treatment effect parameters that arise in the analysis of experiments. Here, efficiency is understood to be with respect to a broad class of treatment assignment schemes for which the marginal probability that any unit is assigned to treatment equals a pre-specified value, e.g., one half. Importantly, we do not require that treatment status is assigned in an i.i.d. fashion, thereby accommodating complicated treatment assignment schemes that are used in practice, such as stratified block randomization and matched pairs. The class of parameters considered are those that can be expressed as the solution to a restriction on the expectation of a known function of the observed data, including possibly the pre-specified value for the marginal probability of treatment assignment. We show that this class of parameters includes, among other things, average treatment effects, quantile treatment effects, local average treatment effects as well as the counterparts to these quantities in experiments in which the unit is itself a cluster. In this setting, we establish two results. First, we derive a lower bound on the asymptotic variance of estimators of the parameter of interest in the form of a convolution theorem. Second, we show that the na ̈ıve method of moments estimator achieves this bound on the asymptotic variance quite generally if treatment is assigned using a “finely stratified” design. By a “finely stratified” design, we mean experiments in which units are divided into groups of a fixed size and a proportion within each group is assigned to treatment uniformly at random so that it respects the restriction on the marginal probability of treatment assignment. In this sense, “finely stratified” experiments lead to efficient estimators of treatment effect parameters “by design” rather than through ex post covariate adjustment.
Paper link: Yuehao Bai, Jizhou Liu, Azeem M. Shaikh, Max Tabord-Meehan
https://arxiv.org/abs/2307.15181
February 23: Tryphon Georgiou (UC Irvine)
Stochastic Control meets Non-equilibrium Thermodynamics: Fundamental limits of power generation in thermodynamic engines
Thermodynamics was born in the 19th century in quest of a way to quantify efficiency of steam engines at the dawn of the industrial age. In the time since, thermodynamics has impacted virtually all other areas in science, from chemistry and biology to the physics of black holes, and yet, progress beyond the classical quasi-static limit towards finite-time thermodynamic transitions has been slow; finite-time is of essence for non-vanishing power generation. In recent years a deeper understanding of non-equilibrium processes has been achieved based on stochastic models with degrees of freedom (state variables) that are subject to Brownian excitation that models heat baths. Within this framework we will explain energy transduction, we will give insights on how anisotropy in thermal or chemical potentials can be tapped for power generation in engineered and physical processes, and we will highlight fundamental bounds on the amount of power that can drawn during finite-time thermodynamic transitions.
The talk is based on joint works with Rui Fu (UCI), Olga Movilla (UCI), Amir Taghvaei (UCI) and Yongxin Chen (GaTech). Research funding by AFOSR, ARO and NSF is gratefully acknowledged.
Monday February 26, 3:30PM, KAP 427: Yongtao Guan (Chinese University of Hong Kong, Shenzhen)
Group Network Hawkes Process
In this work, we study the event occurrences of individuals interacting in a network. To characterize the dynamic interactions among the individuals, we propose a group network Hawkes process (GNHP) model whose network structure is observed and fixed. In particular, we introduce a latent group structure among individuals to account for the heterogeneous user-specific characteristics. A maximum likelihood approach is proposed to simultaneously cluster individuals in the network and estimate model parameters. A fast EM algorithm is subsequently developed by utilizing the branching representation of the proposed GNHP model. Theoretical properties of the resulting estimators of group memberships and model parameters are investigated under both settings when the number of latent groups G is over-specified or correctly specified. A data-driven criterion that can consistently identify the true G under mild conditions is derived. Extensive simulation studies and an application to a data set collected from Sina Weibo are used to illustrate the effectiveness of the proposed methodology.
March 1: Morris Yau (MIT)
Are Neural Networks Optimal Approximation Algorithms
In this talk, we discuss the power of neural networks to compute solutions to NP-hard optimization problems focusing on the class of constraint satisfaction problems (boolean sat, Sudoku, etc.). We find there is a graph neural network architecture (OptGNN) that captures the optimal approximation algorithm for constrainst satisfaction, up to complexity theoretic assumptions via tools in semidefinite programming. Furthermore, OptGNN can act as a convex program solver and hence output a dual (a bound) on the optimality of a combinatorial problem. Evaluating OptGNN on benchmarks in the neural combinatorial optimization literature, we find our approach is competitive with the state of the art unsupervised neural baselines. We discuss further connections between neural networks and computation, and point to directions for future work.
March 8: Annie Qu (UC Irvine)
A Model-Agnostic Graph Neural Network for Integrating Local and Global Information
Graph neural networks (GNNs) have achieved promising performance in a variety of graph focused tasks. Despite their success, the two major limitations of existing GNNs are the capability of learning various-order representations and providing interpretability of such deep learning-based black-box models. To tackle these issues, we propose a novel Model-agnostic Graph Neural Network (MaGNet) framework. The proposed framework is able to extract knowledge from high-order neighbors, sequentially integrates information of various orders, and offers explanations for the learned model by identifying influential compact graph structures. In particular, MaGNet consists of two components: an estimation model for the latent representation of complex relationships under graph topology, and an interpretation model that identifies influential nodes, edges, and important node features. Theoretically, we establish the generalization error bound for MaGNet via empirical Rademacher complexity and showcase its power to represent the layer-wise neighborhood mixing. We conduct comprehensive numerical studies using both simulated data and a real-world case study on investigating the neural mechanisms of the rat hippocampus,
demonstrating that the performance of MaGNet is competitive with state-of-the-art methods.
March 15: no talk [spring break]
March 22: Xiaowu Dai (UCLA)
Kernel ordinary differential equations
The ordinary differential equation (ODE) is widely used in modelling biological and physical processes in science. A new reproducing kernelbased approach is proposed for the estimation and inference of ODE given noisy observations. The functional forms in ODE are not assumed to be known or restricted to be linear or additive, and pairwise interactions are allowed. Sparse estimation is performed to select individual functionals and construct confidence intervals for the estimated signal trajectories. The estimation optimality and selection consistency of kernel ODE are established under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. The proposal tackles several important problems that are not yet fully addressed in smoothing spline analysis of variance (SS-ANOVA) framework, and extends the existing methods of dynamic causal modeling.
March 29: Kengo Kato (Cornell)
Entropic optimal transport: limit theorems and algorithms
In this talk, I will discuss my recent work on entropic optimal transport (EOT). In the first part, I will discuss limit theorems for EOT maps, dual potentials, and the Sinkhorn divergence. The key technical tool we use is a first and second-order Hadamard differentiability analysis of EOT potentials with respect to the marginals, from which the limit theorems, bootstrap consistency, and asymptotic efficiency of the empirical estimators follow. The second part concerns the entropic Gromov-Wasserstein (EGW) distance, which serves as a computationally efficient proxy for the Gromov-Wasserstein distance. By leveraging a variational representation that ties the EGW problem with a series of EOT problems, we derive stability results of EGW with respect to the auxiliary matrix, which enables us to develop efficient algorithms for solving the EGW problem. This talk is based on joint work with Ziv Goldfeld, Gabriel Rioux, and Ritwik Sadhu.
April 5: Weixuan Xia (USC)
Set-Valued Stochastic Integrals and Convoluted Lévy Processes
In this talk, I will discuss set-valued Volterra-type stochastic integrals driven by Lévy processes. I will explain the definition of set-valued convoluted stochastic integrals by taking the closed decomposable hull of integral functionals over time, thereby extending classical definitions to convoluted integrals with square-integrable kernels. Two key insights include: (1) Aside from established results for set-valued Itô integrals, while set-valued stochastic integrals with respect to a finite-variation Poisson random measure are guaranteed to be integrally bounded for bounded integrands, this is not true when the random measure represents infinite variation; (2) It is a mutual effect of kernel singularity and jumps that the set-valued convoluted integrals are possibly explosive and can take extended vector values. These findings carry significant implications for the construction of set-valued fractional dynamical systems. Additionally, I will explore two classes of set-monotone processes for practical interests in economic and financial modeling.
April 12: Yosi Rinott (The Hebrew University of Jerusalem)
On the behavior of posterior probabilities with additional data: monotonicity and nonmonotonicity, asymptotic rates, log-concavity, and Turán’s inequality
Given a parametric model, a prior, and data, Bayesian statisticians quantify their belief that the true parameter is ϑ0 by its posterior probability. The starting question of this paper is whether the posterior at ϑ0 increases when the data are generated under ϑ0, and how it behaves when the data come from ϑ ≠ ϑ0. Can it decrease and then increase, and thus additional data may mislead Bayesian statisticians?
For data arriving sequentially, we consider monotonicity properties of the posterior probabilities as a function of the sample size with respect to certain stochastic orders, specifically starting with likelihood ratio dominance.
When the data is generated by ϑ ≠ ϑ0 , Doob’s consistency theorem says that the posterior at ϑ0converges a.s. to zero and therefore its expectation converges to zero. We obtain precise asymptotic rates of the latter convergence for observations from an exponential family and show that the expectation of the ϑ0 -posterior under ϑ ≠ ϑ0 is eventually strictly decreasing. Finally, we show that in a number of interesting cases this expectation is a log-concave function of the sample size, and thus unimodal. In the Bernoulli case we obtain this result by developing an inequality that is related to Turán’s inequality for Legendre polynomials.
The talk is based on a joint work with Sergiu Hart.
April 19: Andrew Holbrook (UCLA, Department of Biostatistics)
Quantum Markov chain Monte Carlo(s)
I discuss how one can use quantum circuits to accelerate multiproposal MCMC and point to promising avenues of future research, including quantum HMC, quantum-accelerated nonreversible MCMC and quantum-accelerated locally-balanced MCMC.
April 26: Sui Tang (UCSB)
Learning interaction kernels in particle and agent based systems
Interacting particle systems showcase a variety of collective behaviors and are fundamental to many scientific and engineering domains, such as the flocking of birds and the milling of fish. These systems are typically modeled using differential equations to elucidate how individual behaviors drive collective dynamics, an essential inquiry across multiple disciplines. Although recent theoretical and numerical studies have successfully replicated many qualitative collective patterns seen in nature, there is still a notable deficiency in quantitatively matching these models with empirical data.
We explore the data-driven discovery of latent interaction kernels from observed trajectory data in particle and agent-based systems. We discuss recent findings in stochastic systems where interaction kernels are derived from pairwise distances and introduce a nonparametric inference strategy using a regularized maximum likelihood estimator for precise kernel estimation. We show this approach can achieve near-optimal convergence rates, regardless of the state space dimensionality when dealing with multiple trajectory data. Additionally, we conduct error analysis related to discrete-time observations and validate our methodology through numerical experiments on models such as stochastic opinion dynamics and the Lennard-Jones potential. Moreover, we also consider microscopic models and advance our approach to estimate nonlocal interaction potentials in aggregation-diffusion equations from noisy data, using sparsity-promoting techniques. This research is conducted in collaboration with Fei Lu, Mauro Maggioni, Jose A. Carrillo, Gissell Estrada-Rodriguez, and Laszlo Mikolas.