Tag: gradient

  • Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits

    Fisher-Geometric Diffusion in Stochastic Gradient Descent: Optimal Rates, Oracle Complexity, and Information-Theoretic Limits arXiv:2603.02417v1 Announce Type: new Abstract: We develop a Fisher-geometric theory of stochastic gradient descent (SGD) in which mini-batch noise is an intrinsic, loss-induced matrix — not an exogenous scalar variance. Under exchangeable sampling, the mini-batch gradient covariance is pinned down (to leading…

  • Stochastic Gradient Variational Inference with Price’s Gradient Estimator from Bures-Wasserstein to Parameter Space

    Stochastic Gradient Variational Inference with Price’s Gradient Estimator from Bures-Wasserstein to Parameter Space arXiv:2602.18718v1 Announce Type: new Abstract: For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space)…

  • AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

    AI in Multiple GPUs: Gradient Accumulation & Data Parallelism Learn and implement gradient accum and data parallelism from scratch in PyTorch The post AI in Multiple GPUs: Gradient Accumulation & Data Parallelism appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source

  • Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

    Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation arXiv:2602.15925v1 Announce Type: new Abstract: Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic…

  • Finite-Particle Rates for Regularized Stein Variational Gradient Descent

    Finite-Particle Rates for Regularized Stein Variational Gradient Descent arXiv:2602.05172v1 Announce Type: new Abstract: We derive finite-particle rates for the regularized Stein variational gradient descent (R-SVGD) algorithm introduced by He et al. (2024) that corrects the constant-order bias of the SVGD by applying a resolvent-type preconditioner to the kernelized Wasserstein gradient. For the resulting interacting $N$-particle…

  • Radon–Wasserstein Gradient Flows for Interacting-Particle Sampling in High Dimensions

    Radon–Wasserstein Gradient Flows for Interacting-Particle Sampling in High Dimensions arXiv:2602.05227v1 Announce Type: new Abstract: Gradient flows of the Kullback–Leibler (KL) divergence, such as the Fokker–Planck equation and Stein Variational Gradient Descent, evolve a distribution toward a target density known only up to a normalizing constant. We introduce new gradient flows of the KL divergence with…

  • A Hitchhiker’s Guide to Poisson Gradient Estimation

    A Hitchhiker’s Guide to Poisson Gradient Estimation arXiv:2602.03896v1 Announce Type: new Abstract: Poisson-distributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this: Exponential Arrival Time (EAT) simulation and Gumbel-SoftMax (GSM) relaxation. We provide the first systematic comparison of these methods, along with practical…

  • The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

    The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel Gradient Descent, Momentum, RMSProp, and Adam all aim for the same minimum. They do not change the destination, only the path. Each method adds a mechanism that fixes a limitation of the previous one, making the movement faster, more stable, or more adaptive.…

  • Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

    Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds arXiv:2512.22473v1 Announce Type: new Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed “Bayesian wind tunnels” and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training…

  • The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

    The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel Gradient descent in function space with decision trees The post The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel appeared first on Towards Data Science. angela shi Go to original source

  • The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

    The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel From Random Ensembles to Optimization: Gradient Boosting Explained The post The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel appeared first on Towards Data Science. angela shi Go to original source

  • An operator splitting analysis of Wasserstein–Fisher–Rao gradient flows

    An operator splitting analysis of Wasserstein–Fisher–Rao gradient flows arXiv:2511.18060v1 Announce Type: new Abstract: Wasserstein-Fisher-Rao (WFR) gradient flows have been recently proposed as a powerful sampling tool that combines the advantages of pure Wasserstein (W) and pure Fisher-Rao (FR) gradient flows. Existing algorithmic developments implicitly make use of operator splitting techniques to numerically approximate the WFR…

  • Gradient flow for deep equilibrium single-index models

    Gradient flow for deep equilibrium single-index models arXiv:2511.16976v1 Announce Type: cross Abstract: Deep equilibrium models (DEQs) have recently emerged as a powerful paradigm for training infinitely deep weight-tied neural networks that achieve state of the art performance across many modern machine learning tasks. Despite their practical success, theoretically understanding the gradient descent dynamics for training…

  • Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

    Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks arXiv:2511.02258v1 Announce Type: new Abstract: This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD) for single-layer networks. Building on the seminal work of Saad and Solla, which analyzed the deterministic (ballistic) scaling limits of SGD corresponding to the gradient flow of…

  • A Visual Guide to Tuning Gradient Boosted Trees

    A Visual Guide to Tuning Gradient Boosted Trees Introduction My previous posts looked at the bog-standard decision tree and the wonder of a random forest. Now, to complete the triplet, I’ll visually explore gradient boosted trees! There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. However, for this I’m going…

  • Stochastic Gradients under Nuisances

    Stochastic Gradients under Nuisances arXiv:2508.20326v1 Announce Type: new Abstract: Stochastic gradient optimization is the dominant learning paradigm for a variety of scenarios, from classical supervised learning to modern self-supervised learning. We consider stochastic gradient algorithms for learning problems whose objectives rely on unknown nuisance parameters, and establish non-asymptotic convergence guarantees. Our results show that, while…

  • Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression

    Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression arXiv:2506.02336v1 Announce Type: new Abstract: We study gradient descent (GD) with a constant stepsize for $ell_2$-regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in $widetilde{mathcal{O}}(kappa)$ steps with $kappa$ being the condition…

  • Prototyping Gradient Descent in Machine Learning

    Prototyping Gradient Descent in Machine Learning Mathematical theorem and credit transaction prediction using Stochastic / Batch GD The post Prototyping Gradient Descent in Machine Learning appeared first on Towards Data Science. Kuriko Iwai Go to original source

  • Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems

    Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems arXiv:2504.13320v1 Announce Type: new Abstract: We introduce a gradient-free framework for Bayesian Optimal Experimental Design (BOED) in sequential settings, aimed at complex systems where gradient information is unavailable. Our method combines Ensemble Kalman Inversion (EKI) for design optimization with the Affine-Invariant Langevin Dynamics (ALDI) sampler for…

  • Gradient-based Sample Selection for Faster Bayesian Optimization

    Gradient-based Sample Selection for Faster Bayesian Optimization arXiv:2504.07742v1 Announce Type: new Abstract: Bayesian optimization (BO) is an effective technique for black-box optimization. However, its applicability is typically limited to moderate-budget problems due to the cubic complexity in computing the Gaussian process (GP) surrogate model. In large-budget scenarios, directly employing the standard GP model faces significant…

  • Smoothed Distance Kernels for MMDs and Applications in Wasserstein Gradient Flows

    Smoothed Distance Kernels for MMDs and Applications in Wasserstein Gradient Flows arXiv:2504.07820v1 Announce Type: new Abstract: Negative distance kernels $K(x,y) := – |x-y|$ were used in the definition of maximum mean discrepancies (MMDs) in statistics and lead to favorable numerical results in various applications. In particular, so-called slicing techniques for handling high-dimensional kernel summations profit…

  • Accelerated Stein Variational Gradient Flow

    Accelerated Stein Variational Gradient Flow arXiv:2503.23462v1 Announce Type: new Abstract: Stein variational gradient descent (SVGD) is a kernel-based particle method for sampling from a target distribution, e.g., in generative modeling and Bayesian inference. SVGD does not require estimating the gradient of the log-density, which is called score estimation. In practice, SVGD can be slow compared…

  • A stochastic gradient descent algorithm with random search directions

    A stochastic gradient descent algorithm with random search directions arXiv:2503.19942v1 Announce Type: new Abstract: Stochastic coordinate descent algorithms are efficient methods in which each iterate is obtained by fixing most coordinates at their values from the current iteration, and approximately minimizing the objective with respect to the remaining coordinates. However, this approach is usually restricted…

  • Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

    Reheated Gradient-based Discrete Sampling for Combinatorial Optimization arXiv:2503.04047v1 Announce Type: new Abstract: Recently, gradient-based discrete sampling has emerged as a highly efficient, general-purpose solver for various combinatorial optimization (CO) problems, achieving performance comparable to or surpassing the popular data-driven approaches. However, we identify a critical issue in these methods, which we term ”wandering in contours”.…

  • Gradient-free stochastic optimization for additive models

    Gradient-free stochastic optimization for additive models arXiv:2503.02131v1 Announce Type: new Abstract: We address the problem of zero-order optimization from noisy observations for an objective function satisfying the Polyak-{L}ojasiewicz or the strong convexity condition. Additionally, we assume that the objective function has an additive structure and satisfies a higher-order smoothness property, characterized by the H”older family…

  • Weighted quantization using MMD: From mean field to mean shift via gradient flows

    Weighted quantization using MMD: From mean field to mean shift via gradient flows arXiv:2502.10600v1 Announce Type: new Abstract: Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a finite weighted mixture of Dirac measures that best approximates…

  • Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

    Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks arXiv:2501.09137v1 Announce Type: cross Abstract: We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training…

  • LightGBM: The Fastest Option of Gradient Boosting

    LightGBM: The Fastest Option of Gradient Boosting Learn how to implement a fast and effective Gradient Boosting model using Python Continue reading on Towards Data Science » Gustavo R Santos Go to original source

  • Gradient-Based Non-Linear Inverse Learning

    Gradient-Based Non-Linear Inverse Learning arXiv:2412.16794v1 Announce Type: new Abstract: We study statistical inverse learning in the context of nonlinear inverse problems under random design. Specifically, we address a class of nonlinear problems by employing gradient descent (GD) and stochastic gradient descent (SGD) with mini-batching, both using constant step sizes. Our analysis derives convergence rates for…

  • From Point to probabilistic gradient boosting for claim frequency and severity prediction

    From Point to probabilistic gradient boosting for claim frequency and severity prediction arXiv:2412.14916v1 Announce Type: new Abstract: Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalized linear models. Many improvements and sophistications to the first gradient boosting machine algorithm exist. We present in…

  • Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizes

    Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizes arXiv:2412.14291v1 Announce Type: cross Abstract: We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the “vanilla” PG method, achieving the…

  • Explicit and data-Efficient Encoding via Gradient Flow

    Explicit and data-Efficient Encoding via Gradient Flow arXiv:2412.00864v1 Announce Type: new Abstract: The autoencoder model typically uses an encoder to map data to a lower dimensional latent space and a decoder to reconstruct it. However, relying on an encoder for inversion can lead to suboptimal representations, particularly limiting in physical sciences where precision is key.…