Category: aimldsaimlds

  • Non-Stationary Functional Bilevel Optimization

    Non-Stationary Functional Bilevel Optimization arXiv:2601.15363v1 Announce Type: new Abstract: Functional bilevel optimization (FBO) provides a powerful framework for hierarchical learning in function spaces, yet current methods are limited to static offline settings and perform suboptimally in online, non-stationary scenarios. We propose SmoothFBO, the first algorithm for non-stationary FBO with both theoretical guarantees and practical scalability.…

  • Low-Dimensional Adaptation of Rectified Flow: A New Perspective through the Lens of Diffusion and Stochastic Localization

    Low-Dimensional Adaptation of Rectified Flow: A New Perspective through the Lens of Diffusion and Stochastic Localization arXiv:2601.15500v1 Announce Type: new Abstract: In recent years, Rectified flow (RF) has gained considerable popularity largely due to its generation efficiency and state-of-the-art performance. In this paper, we investigate the degree to which RF automatically adapts to the intrinsic…

  • On damage of interpolation to adversarial robustness in regression

    On damage of interpolation to adversarial robustness in regression arXiv:2601.16070v1 Announce Type: new Abstract: Deep neural networks (DNNs) typically involve a large number of parameters and are trained to achieve zero or near-zero training error. Despite such interpolation, they often exhibit strong generalization performance on unseen data, a phenomenon that has motivated extensive theoretical investigations.…

  • Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

    Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add arXiv:2601.16120v1 Announce Type: new Abstract: Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and…

  • Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics

    Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics How to evaluate goal-oriented content designed to build engagement and deliver business results, and why structure matters. The post Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics appeared first on Towards Data Science. Diana Schneider Go to original source

  • Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026

    Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026 How I use analytics, automation, and AI to build better SaaS The post Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026 appeared first on Towards Data Science. Yassin Zehar Go to original source

  • Stop Writing Messy Boolean Masks: 10 Elegant Ways to Filter Pandas DataFrames

    Stop Writing Messy Boolean Masks: 10 Elegant Ways to Filter Pandas DataFrames Master the art of readable, high-performance data selection using .query(), .isin(), and advanced vectorized logic. The post Stop Writing Messy Boolean Masks: 10 Elegant Ways to Filter Pandas DataFrames appeared first on Towards Data Science. Ibrahim Salami Go to original source

  • What Other Industries Can Learn from Healthcare’s Knowledge Graphs

    What Other Industries Can Learn from Healthcare’s Knowledge Graphs How shared meaning, evidence, and standards create durable semantic infrastructure The post What Other Industries Can Learn from Healthcare’s Knowledge Graphs appeared first on Towards Data Science. Steve Hedden Go to original source

  • Meta Flow Maps enable scalable reward alignment

    Meta Flow Maps enable scalable reward alignment arXiv:2601.14430v1 Announce Type: new Abstract: Controlling generative models is computationally expensive. This is because optimal alignment with a reward function–whether via inference-time steering or fine-tuning–requires estimating the value function. This task demands access to the conditional posterior $p_{1|t}(x_1|x_t)$, the distribution of clean data $x_1$ consistent with an intermediate…

  • Large Data Limits of Laplace Learning for Gaussian Measure Data in Infinite Dimensions

    Large Data Limits of Laplace Learning for Gaussian Measure Data in Infinite Dimensions arXiv:2601.14515v1 Announce Type: new Abstract: Laplace learning is a semi-supervised method, a solution for finding missing labels from a partially labeled dataset utilizing the geometry given by the unlabeled data points. The method minimizes a Dirichlet energy defined on a (discrete) graph…

  • Communication-Efficient Federated Risk Difference Estimation for Time-to-Event Clinical Outcomes

    Communication-Efficient Federated Risk Difference Estimation for Time-to-Event Clinical Outcomes arXiv:2601.14609v1 Announce Type: new Abstract: Privacy-preserving model co-training in medical research is often hindered by server-dependent architectures incompatible with protected hospital data systems and by the predominant focus on relative effect measures (hazard ratios) which lack clinical interpretability for absolute survival risk assessment. We propose FedRD,…

  • Semi-Supervised Mixture Models under the Concept of Missing at Radom with Margin Confidence and Aranda Ordaz Function

    Semi-Supervised Mixture Models under the Concept of Missing at Radom with Margin Confidence and Aranda Ordaz Function arXiv:2601.14631v1 Announce Type: new Abstract: This paper presents a semi-supervised learning framework for Gaussian mixture modelling under a Missing at Random (MAR) mechanism. The method explicitly parameterizes the missingness mechanism by modelling the probability of missingness as a…

  • Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers

    Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers arXiv:2601.15014v1 Announce Type: new Abstract: We study in-context learning for nonparametric regression with $alpha$-H”older smooth regression functions, for some $alpha>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Theta(log n)$ parameters and $Omegabigl(n^{2alpha/(2alpha+d)}log^3 nbigr)$ pretraining sequences can achieve the minimax-optimal…

  • Google Trends is Misleading You: How to Do Machine Learning with Google Trends Data

    Google Trends is Misleading You: How to Do Machine Learning with Google Trends Data Google Trends is one of the most widely used tools for analysing human behaviour at scale. Journalists use it. Data scientists use it. Entire papers are built on it. But there is a fundamental property of Google Trends data that makes…

  • If You Want to Become a Data Scientist in 2026, Do This

    If You Want to Become a Data Scientist in 2026, Do This Learn from my mistakes and fast track your data science career The post If You Want to Become a Data Scientist in 2026, Do This appeared first on Towards Data Science. Egor Howell Go to original source

  • Building a Self-Healing Data Pipeline That Fixes Its Own Python Errors

    Building a Self-Healing Data Pipeline That Fixes Its Own Python Errors How I built a self-healing pipeline that automatically fixes bad CSVs, schema changes, and weird delimiters. The post Building a Self-Healing Data Pipeline That Fixes Its Own Python Errors appeared first on Towards Data Science. Benjamin Nweke Go to original source

  • A Case for the T-statistic

    A Case for the T-statistic And how it compares to the run-of-the-mill z-score The post A Case for the T-statistic appeared first on Towards Data Science. Aniruddha Karajgi Go to original source

  • Gradient-based Active Learning with Gaussian Processes for Global Sensitivity Analysis

    Gradient-based Active Learning with Gaussian Processes for Global Sensitivity Analysis arXiv:2601.11790v1 Announce Type: new Abstract: Global sensitivity analysis of complex numerical simulators is often limited by the small number of model evaluations that can be afforded. In such settings, surrogate models built from a limited set of simulations can substantially reduce the computational burden, provided…

  • A Kernel Approach for Semi-implicit Variational Inference

    A Kernel Approach for Semi-implicit Variational Inference arXiv:2601.12023v1 Announce Type: new Abstract: Semi-implicit variational inference (SIVI) enhances the expressiveness of variational families through hierarchical semi-implicit distributions, but the intractability of their densities makes standard ELBO-based optimization biased. Recent score-matching approaches to SIVI (SIVI-SM) address this issue via a minimax formulation, at the expense of an…

  • On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

    On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization arXiv:2601.12238v1 Announce Type: new Abstract: While momentum-based acceleration has been studied extensively in deterministic optimization problems, its behavior in nonstationary environments — where the data distribution and optimal parameters drift over time — remains underexplored. We analyze the tracking performance of Stochastic Gradient Descent…

  • A Theory of Diversity for Random Matrices with Applications to In-Context Learning of Schr”odinger Equations

    A Theory of Diversity for Random Matrices with Applications to In-Context Learning of Schr”odinger Equations arXiv:2601.12587v1 Announce Type: new Abstract: We address the following question: given a collection ${mathbf{A}^{(1)}, dots, mathbf{A}^{(N)}}$ of independent $d times d$ random matrices drawn from a common distribution $mathbb{P}$, what is the probability that the centralizer of ${mathbf{A}^{(1)}, dots, mathbf{A}^{(N)}}$…

  • Approximate full conformal prediction in RKHS

    Approximate full conformal prediction in RKHS arXiv:2601.13102v1 Announce Type: new Abstract: Full conformal prediction is a framework that implicitly formulates distribution-free confidence prediction regions for a wide range of estimators. However, a classical limitation of the full conformal framework is the computation of the confidence prediction regions, which is usually impossible since it requires training…

  • Does Calendar-Based Time-Intelligence Change Custom Logic?

    Does Calendar-Based Time-Intelligence Change Custom Logic? Let’s look at calculating the moving average over time The post Does Calendar-Based Time-Intelligence Change Custom Logic? appeared first on Towards Data Science. Salvatore Cagliari Go to original source

  • How to Perform Large Code Refactors in Cursor

    How to Perform Large Code Refactors in Cursor Learn how to perform code refactoring with LLMs The post How to Perform Large Code Refactors in Cursor appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • You Probably Don’t Need a Vector Database for Your RAG — Yet

    You Probably Don’t Need a Vector Database for Your RAG — Yet Numpy or SciKit-Learn might meet all your retrieval needs The post You Probably Don’t Need a Vector Database for Your RAG — Yet appeared first on Towards Data Science. Thomas Reid Go to original source

  • Bridging the Gap Between Research and Readability with Marco Hening Tallarico

    Bridging the Gap Between Research and Readability with Marco Hening Tallarico Diluting complex research, spotting silent data leaks, and why the best way to learn is often backwards. The post Bridging the Gap Between Research and Readability with Marco Hening Tallarico appeared first on Towards Data Science. TDS Editors Go to original source

  • Using Local LLMs to Discover High-Performance Algorithms

    Using Local LLMs to Discover High-Performance Algorithms How I used open-source models to explore new frontiers in efficient code generation, using my MacBook and local LLMs. The post Using Local LLMs to Discover High-Performance Algorithms appeared first on Towards Data Science. Stefano Bosisio Go to original source

  • Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting

    Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting Why modeling SKUs as a network reveals what traditional forecasts miss The post Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting appeared first on Towards Data Science. Partha Sarkar Go to original source

  • Mass Distribution versus Density Distribution in the Context of Clustering

    Mass Distribution versus Density Distribution in the Context of Clustering arXiv:2601.10759v1 Announce Type: new Abstract: This paper investigates two fundamental descriptors of data, i.e., density distribution versus mass distribution, in the context of clustering. Density distribution has been the de facto descriptor of data distribution since the introduction of statistics. We show that density distribution…

  • Memorize Early, Then Query: Inlier-Memorization-Guided Active Outlier Detection

    Memorize Early, Then Query: Inlier-Memorization-Guided Active Outlier Detection arXiv:2601.10993v1 Announce Type: new Abstract: Outlier detection (OD) aims to identify abnormal instances, known as outliers or anomalies, by learning typical patterns of normal data, or inliers. Performing OD under an unsupervised regime-without any information about anomalous instances in the training data-is challenging. A recently observed phenomenon,…

  • Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach

    Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach arXiv:2601.11016v1 Announce Type: new Abstract: In this paper, we introduce a framework for contextual distributionally robust optimization (DRO) that considers the causal and continuous structure of the underlying distribution by developing interpretable and tractable decision rules that prescribe decisions using covariates.…

  • Split-and-Conquer: Distributed Factor Modeling for High-Dimensional Matrix-Variate Time Series

    Split-and-Conquer: Distributed Factor Modeling for High-Dimensional Matrix-Variate Time Series arXiv:2601.11091v1 Announce Type: new Abstract: In this paper, we propose a distributed framework for reducing the dimensionality of high-dimensional, large-scale, heterogeneous matrix-variate time series data using a factor model. The data are first partitioned column-wise (or row-wise) and allocated to node servers, where each node estimates…

  • Fine Tuning a Simulation-Driven Estimator

    Fine Tuning a Simulation-Driven Estimator arXiv:2504.04480v2 Announce Type: cross Abstract: Many industries now deploy high-fidelity simulators (digital twins) to represent physical systems, yet their parameters must be calibrated to match the true system. This motivated the construction of simulation-driven parameter estimators, built by generating synthetic observations for sampled parameter values and learning a supervised mapping…

  • Weekly Entering & Transitioning – Thread 19 Jan, 2026 – 26 Jan, 2026

    Weekly Entering & Transitioning – Thread 19 Jan, 2026 – 26 Jan, 2026 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Companies

    The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Companies How to use n8n with multimodal AI and optimisation tools to help companies with low data maturity accelerate their digital transformation. The post The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Companies appeared first on Towards Data Science. Samir Saci…

  • Why Healthcare Leads in Knowledge Graphs

    Why Healthcare Leads in Knowledge Graphs How science, regulation, collaboration, and public funding shaped the world’s most mature semantic infrastructure The post Why Healthcare Leads in Knowledge Graphs appeared first on Towards Data Science. Steve Hedden Go to original source

  • Data Poisoning in Machine Learning: Why and How People Manipulate Training Data

    Data Poisoning in Machine Learning: Why and How People Manipulate Training Data Do you know where your data has been? The post Data Poisoning in Machine Learning: Why and How People Manipulate Training Data appeared first on Towards Data Science. Stephanie Kirmer Go to original source

  • A Geometric Method to Spot Hallucinations Without an LLM Judge

    A Geometric Method to Spot Hallucinations Without an LLM Judge Imagine a flock of birds in flight. There’s no leader. No central command. Each bird aligns with its neighbors—matching direction, adjusting speed, maintaining coherence through purely local coordination. The result is global order emerging from local consistency. Now imagine one bird flying with the same…

  • Maximum-Effiency Coding Setup

    Maximum-Effiency Coding Setup Learn how to be a more efficient programmer The post Maximum-Effiency Coding Setup appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels

    Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels Why your final LLM layer is OOMing and how to fix it with a custom Triton kernel. The post Cutting LLM Memory by 84%: A Deep Dive into Fused Kernels appeared first on Towards Data Science. Ryan Pégoud Go to original source

  • From RGB to Lab: Addressing Color Artifacts in AI Image Compositing

    From RGB to Lab: Addressing Color Artifacts in AI Image Compositing A multi-tier approach to segmentation, color correction, and domain-specific enhancement The post From RGB to Lab: Addressing Color Artifacts in AI Image Compositing appeared first on Towards Data Science. Eric Chung Go to original source

  • The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling

    The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling Acquisitions, venture, and an increasingly competitive landscape all point to a market ceiling The post The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling appeared first on Towards Data Science. Hugo Lu Go to original source

  • TDS Newsletter: Is It Time to Revisit RAG?

    TDS Newsletter: Is It Time to Revisit RAG? Let’s make sense of the current state of retrieval-augmented generation The post TDS Newsletter: Is It Time to Revisit RAG? appeared first on Towards Data Science. TDS Editors Go to original source

  • Accelerated Regularized Wasserstein Proximal Sampling Algorithms

    Accelerated Regularized Wasserstein Proximal Sampling Algorithms arXiv:2601.09848v1 Announce Type: new Abstract: We consider sampling from a Gibbs distribution by evolving a finite number of particles using a particular score estimator rather than Brownian motion. To accelerate the particles, we consider a second-order score-based ODE, similar to Nesterov acceleration. In contrast to traditional kernel density score…

  • CROCS: A Two-Stage Clustering Framework for Behaviour-Centric Consumer Segmentation with Smart Meter Data

    CROCS: A Two-Stage Clustering Framework for Behaviour-Centric Consumer Segmentation with Smart Meter Data arXiv:2601.10494v1 Announce Type: new Abstract: With grid operators confronting rising uncertainty from renewable integration and a broader push toward electrification, Demand-Side Management (DSM) — particularly Demand Response (DR) — has attracted significant attention as a cost-effective mechanism for balancing modern electricity systems.…

  • Coarsening Causal DAG Models

    Coarsening Causal DAG Models arXiv:2601.10531v1 Announce Type: new Abstract: Directed acyclic graphical (DAG) models are a powerful tool for representing causal relationships among jointly distributed random variables, especially concerning data from across different experimental settings. However, it is not always practical or desirable to estimate a causal model at the granularity of given features in…

  • Parametric RDT approach to computational gap of symmetric binary perceptron

    Parametric RDT approach to computational gap of symmetric binary perceptron arXiv:2601.10628v1 Announce Type: new Abstract: We study potential presence of statistical-computational gaps (SCG) in symmetric binary perceptrons (SBP) via a parametric utilization of emph{fully lifted random duality theory} (fl-RDT) [96]. A structural change from decreasingly to arbitrarily ordered $c$-sequence (a key fl-RDT parametric component) is…

  • Classification Imbalance as Transfer Learning

    Classification Imbalance as Transfer Learning arXiv:2601.10630v1 Announce Type: new Abstract: Classification imbalance arises when one class is much rarer than the other. We frame this setting as transfer learning under label (prior) shift between an imbalanced source distribution induced by the observed data and a balanced target distribution under which performance is evaluated. Within this…

  • When Shapley Values Break: A Guide to Robust Model Explainability

    When Shapley Values Break: A Guide to Robust Model Explainability Shapley Values are one of the most common methods for explainability, yet they can be misleading. Discover how to overcome these limitations to achieve better insights. The post When Shapley Values Break: A Guide to Robust Model Explainability appeared first on Towards Data Science. Alon…

  • How to Run Coding Agents in Parallel

    How to Run Coding Agents in Parallel Get the most out of Claude Code The post How to Run Coding Agents in Parallel appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon

    The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon Designing a centralized system to track daily habits and long-term goals The post The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon appeared first on Towards Data Science. Sabrine Bendimerad Go to…

  • Do You Smell That? Hidden Technical Debt in AI Development

    Do You Smell That? Hidden Technical Debt in AI Development Why speed without standards creates fragile AI products The post Do You Smell That? Hidden Technical Debt in AI Development appeared first on Towards Data Science. Erika Gomes-Gonçalves Go to original source

  • Tail-Sensitive KL and R’enyi Convergence of Unadjusted Hamiltonian Monte Carlo via One-Shot Couplings

    Tail-Sensitive KL and R’enyi Convergence of Unadjusted Hamiltonian Monte Carlo via One-Shot Couplings arXiv:2601.09019v1 Announce Type: new Abstract: Hamiltonian Monte Carlo (HMC) algorithms are among the most widely used sampling methods in high dimensional settings, yet their convergence properties are poorly understood in divergences that quantify relative density mismatch, such as Kullback-Leibler (KL) and R’enyi…

  • Horseshoe Mixtures-of-Experts (HS-MoE)

    Horseshoe Mixtures-of-Experts (HS-MoE) arXiv:2601.09043v1 Announce Type: new Abstract: Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior’s adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the…

  • MLCBART: Multilabel Classification with Bayesian Additive Regression Trees

    MLCBART: Multilabel Classification with Bayesian Additive Regression Trees arXiv:2601.08964v1 Announce Type: cross Abstract: Multilabel Classification (MLC) deals with the simultaneous classification of multiple binary labels. The task is challenging because, not only may there be arbitrarily different and complex relationships between predictor variables and each label, but associations among labels may exist even after accounting…

  • SCaLE: Switching Cost aware Learning and Exploration

    SCaLE: Switching Cost aware Learning and Exploration arXiv:2601.09042v1 Announce Type: cross Abstract: This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the…

  • Efficient Clustering in Stochastic Bandits

    Efficient Clustering in Stochastic Bandits arXiv:2601.09162v1 Announce Type: cross Abstract: We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the…

  • Why Human-Centered Data Analytics Matters More Than Ever

    Why Human-Centered Data Analytics Matters More Than Ever From optimizing metrics to designing meaning: putting people back into data-driven decisions The post Why Human-Centered Data Analytics Matters More Than Ever appeared first on Towards Data Science. Rashi Desai Go to original source

  • What Is a Knowledge Graph — and Why It Matters

    What Is a Knowledge Graph — and Why It Matters How structured knowledge became healthcare’s quiet advantage The post What Is a Knowledge Graph — and Why It Matters appeared first on Towards Data Science. Steve Hedden Go to original source

  • Glitches in the Attention Matrix

    Glitches in the Attention Matrix A history of Transformer artifacts and the latest research on how to fix them The post Glitches in the Attention Matrix appeared first on Towards Data Science. Jonathan Williford Go to original source

  • Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries

    Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries Seeded topic modeling, integration with LLMs, and training on summarized data are the fresh parts of the NLP toolkit. The post Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries appeared first on Towards Data Science. Petr Koráb Go to…

  • Decentralized Online Convex Optimization with Unknown Feedback Delays

    Decentralized Online Convex Optimization with Unknown Feedback Delays arXiv:2601.07901v1 Announce Type: new Abstract: Decentralized online convex optimization (D-OCO), where multiple agents within a network collaboratively learn optimal decisions in real-time, arises naturally in applications such as federated learning, sensor networks, and multi-agent control. In this paper, we study D-OCO under unknown, time-and agent-varying feedback delays.…

  • A Statistical Assessment of Amortized Inference Under Signal-to-Noise Variation and Distribution Shift

    A Statistical Assessment of Amortized Inference Under Signal-to-Noise Variation and Distribution Shift arXiv:2601.07944v1 Announce Type: new Abstract: Since the turn of the century, approximate Bayesian inference has steadily evolved as new computational techniques have been incorporated to handle increasingly complex and large-scale predictive problems. The recent success of deep neural networks and foundation models has…

  • Towards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds

    Towards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds arXiv:2601.08100v1 Announce Type: new Abstract: Understanding the generalization behavior of deep neural networks remains a fundamental challenge in modern statistical learning theory. Among existing approaches, PAC-Bayesian norm-based bounds have demonstrated particular promise due to their data-dependent nature and their ability to capture algorithmic and geometric properties…

  • Structural Dimension Reduction in Bayesian Networks

    Structural Dimension Reduction in Bayesian Networks arXiv:2601.08236v1 Announce Type: new Abstract: This work introduces a novel technique, named structural dimension reduction, to collapse a Bayesian network onto a minimum and localized one while ensuring that probabilistic inferences between the original and reduced networks remain consistent. To this end, we propose a new combinatorial structure in…

  • Robust low-rank estimation with multiple binary responses using pairwise AUC loss

    Robust low-rank estimation with multiple binary responses using pairwise AUC loss arXiv:2601.08618v1 Announce Type: new Abstract: Multiple binary responses arise in many modern data-analytic problems. Although fitting separate logistic regressions for each response is computationally attractive, it ignores shared structure and can be statistically inefficient, especially in high-dimensional and class-imbalanced regimes. Low-rank models offer a…

  • An introduction to AWS Bedrock

    An introduction to AWS Bedrock The how, why, what and where of Amazon’s LLM access layer The post An introduction to AWS Bedrock appeared first on Towards Data Science. Thomas Reid Go to original source

  • From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric

    From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric Dataflows were (rightly?) considered “the slowest and least performant option” for ingesting data into Power BI/Microsoft Fabric. However, things are changing rapidly and the latest Dataflow enhancements changes how we play the game The post From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in…

  • Under the Uzès Sun: When Historical Data Reveals the Climate Change

    Under the Uzès Sun: When Historical Data Reveals the Climate Change Longer summers, milder winters: analysis of temperature trends in Uzès, France, year after year. The post Under the Uzès Sun: When Historical Data Reveals the Climate Change appeared first on Towards Data Science. Marc Polizzi Go to original source

  • Why Your ML Model Works in Training But Fails in Production

    Why Your ML Model Works in Training But Fails in Production Hard lessons from building production ML systems where data leaks, defaults lie, populations shift, and time does not behave the way we expect. The post Why Your ML Model Works in Training But Fails in Production appeared first on Towards Data Science. Sudheer Singamsetty…

  • How to Maximize Claude Code Effectiveness

    How to Maximize Claude Code Effectiveness Learn how to get the most out of agentic coding The post How to Maximize Claude Code Effectiveness appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Physics-informed Gaussian Process Regression in Solving Eigenvalue Problem of Linear Operators

    Physics-informed Gaussian Process Regression in Solving Eigenvalue Problem of Linear Operators arXiv:2601.06462v1 Announce Type: new Abstract: Applying Physics-Informed Gaussian Process Regression to the eigenvalue problem $(mathcal{L}-lambda)u = 0$ poses a fundamental challenge, where the null source term results in a trivial predictive mean and a degenerate marginal likelihood. Drawing inspiration from system identification, we construct…

  • Inference-Time Alignment for Diffusion Models via Doob’s Matching

    Inference-Time Alignment for Diffusion Models via Doob’s Matching arXiv:2601.06514v1 Announce Type: new Abstract: Inference-time alignment for diffusion models aims to adapt a pre-trained diffusion model toward a target distribution without retraining the base score network, thereby preserving the generative capacity of the base model while enforcing desired properties at the inference time. A central mechanism…

  • Dimension-reduced outcome-weighted learning for estimating individualized treatment regimes in observational studies

    Dimension-reduced outcome-weighted learning for estimating individualized treatment regimes in observational studies arXiv:2601.06782v1 Announce Type: new Abstract: Individualized treatment regimes (ITRs) aim to improve clinical outcomes by assigning treatment based on patient-specific characteristics. However, existing methods often struggle with high-dimensional covariates, limiting accuracy, interpretability, and real-world applicability. We propose a novel sufficient dimension reduction approach that…

  • Constrained Density Estimation via Optimal Transport

    Constrained Density Estimation via Optimal Transport arXiv:2601.06830v1 Announce Type: new Abstract: A novel framework for density estimation under expectation constraints is proposed. The framework minimizes the Wasserstein distance between the estimated density and a prior, subject to the constraints that the expected value of a set of functions adopts or exceeds given values. The framework…

  • The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks

    The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks arXiv:2601.06961v1 Announce Type: new Abstract: The success of deep neural networks largely depends on the statistical structure of the training data. While learning dynamics and generalization on isotropic data are well-established, the impact of pronounced anisotropy on these crucial…

  • How AI Can Become Your Personal Language Tutor

    How AI Can Become Your Personal Language Tutor How I used n8n to build AI study partners for learning Mandarin: vocabulary, listening, and pronunciation correction. The post How AI Can Become Your Personal Language Tutor appeared first on Towards Data Science. Samir Saci Go to original source

  • Why 90% Accuracy in Text-to-SQL is 100% Useless

    Why 90% Accuracy in Text-to-SQL is 100% Useless The eternal promise of self-service analytics The post Why 90% Accuracy in Text-to-SQL is 100% Useless appeared first on Towards Data Science. Gary Zavaleta Go to original source

  • When Does Adding Fancy RAG Features Work?

    When Does Adding Fancy RAG Features Work? Looking at the performance of different pipelines The post When Does Adding Fancy RAG Features Work? appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

  • Optimizing Data Transfer in Batched AI/ML Inference Workloads

    Optimizing Data Transfer in Batched AI/ML Inference Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 2 The post Optimizing Data Transfer in Batched AI/ML Inference Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

  • Machine learning assisted state prediction of misspecified linear dynamical system via modal reduction

    Machine learning assisted state prediction of misspecified linear dynamical system via modal reduction arXiv:2601.05297v1 Announce Type: new Abstract: Accurate prediction of structural dynamics is imperative for preserving digital twin fidelity throughout operational lifetimes. Parametric models with fixed nominal parameters often omit critical physical effects due to simplifications in geometry, material behavior, damping, or boundary conditions,…

  • A Bayesian Generative Modeling Approach for Arbitrary Conditional Inference

    A Bayesian Generative Modeling Approach for Arbitrary Conditional Inference arXiv:2601.05355v1 Announce Type: new Abstract: Modern data analysis increasingly requires flexible conditional inference P(X_B | X_A) where (X_A, X_B) is an arbitrary partition of observed variable X. Existing conditional inference methods lack this flexibility as they are tied to a fixed conditioning structure and cannot perform…

  • A brief note on learning problem with global perspectives

    A brief note on learning problem with global perspectives arXiv:2601.05441v1 Announce Type: new Abstract: This brief note considers the problem of learning with dynamic-optimizing principal-agent setting, in which the agents are allowed to have global perspectives about the learning process, i.e., the ability to view things according to their relative importances or in their true…

  • Multi-task Modeling for Engineering Applications with Sparse Data

    Multi-task Modeling for Engineering Applications with Sparse Data arXiv:2601.05910v1 Announce Type: new Abstract: Modern engineering and scientific workflows often require simultaneous predictions across related tasks and fidelity levels, where high-fidelity data is scarce and expensive, while low-fidelity data is more abundant. This paper introduces an Multi-Task Gaussian Processes (MTGP) framework tailored for engineering systems characterized…

  • Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem

    Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem arXiv:2601.06009v1 Announce Type: new Abstract: We develop a practical framework for distinguishing diffusive stochastic processes from deterministic signals using only a single discrete time series. Our approach is based on classical excursion and crossing theorems for continuous semimartingales, which correlates number $N_varepsilon$ of excursions of magnitude…

  • Weekly Entering & Transitioning – Thread 12 Jan, 2026 – 19 Jan, 2026

    Weekly Entering & Transitioning – Thread 12 Jan, 2026 – 19 Jan, 2026 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example

    Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example Walkthrough using open-source prompt optimization algorithms in Python to improve the accuracy of an autonomous vehicle car safety agent running on OpenAI’s GPT 5.2 The post Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Car Example appeared first on Towards Data Science. Vincent Koc Go to…

  • How to Leverage Slash Commands to Code Effectively

    How to Leverage Slash Commands to Code Effectively Learn how I utilize slash commands to be a more efficient engineer The post How to Leverage Slash Commands to Code Effectively appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Federated Learning, Part 1: The Basics of Training Models Where the Data Lives

    Federated Learning, Part 1: The Basics of Training Models Where the Data Lives Understanding the foundations of federated learning The post Federated Learning, Part 1: The Basics of Training Models Where the Data Lives appeared first on Towards Data Science. Parul Pandey Go to original source

  • Beyond the Flat Table: Building an Enterprise-Grade Financial Model in Power BI

    Beyond the Flat Table: Building an Enterprise-Grade Financial Model in Power BI A step-by-step journey through data transformation, star schema modeling, and DAX variance analysis with lessons learned along the way. The post Beyond the Flat Table: Building an Enterprise-Grade Financial Model in Power BI appeared first on Towards Data Science. Ibrahim Salami Go to original source

  • How LLMs Handle Infinite Context With Finite Memory

    How LLMs Handle Infinite Context With Finite Memory Achieving infinite context with 114× less memory The post How LLMs Handle Infinite Context With Finite Memory appeared first on Towards Data Science. Moulik Gupta Go to original source

  • Data Science Spotlight: Selected Problems from Advent of Code 2025

    Data Science Spotlight: Selected Problems from Advent of Code 2025 Hands-on walkthroughs of problems and solution approaches that power real‑world data science use cases The post Data Science Spotlight: Selected Problems from Advent of Code 2025 appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

  • Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer

    Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer Forget stiff lines and wild polynomials. Discover why Splines are the “Goldilocks” of feature engineering, offering the perfect balance of flexibility and discipline for non-linear data using Scikit-Learn’s SplineTransformer. The post Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer appeared first on Towards Data Science. Gustavo Santos…

  • Teaching a Neural Network the Mandelbrot Set

    Teaching a Neural Network the Mandelbrot Set And why Fourier features change everything The post Teaching a Neural Network the Mandelbrot Set appeared first on Towards Data Science. Carlos Redondo Go to original source

  • ROOFS: RObust biOmarker Feature Selection

    ROOFS: RObust biOmarker Feature Selection arXiv:2601.05151v1 Announce Type: new Abstract: Feature selection (FS) is essential for biomarker discovery and in the analysis of biomedical datasets. However, challenges such as high-dimensional feature space, low sample size, multicollinearity, and missing values make FS non-trivial. Moreover, FS performances vary across datasets and predictive tasks. We propose roofs, a…

  • CAOS: Conformal Aggregation of One-Shot Predictors

    CAOS: Conformal Aggregation of One-Shot Predictors arXiv:2601.05219v1 Announce Type: new Abstract: One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and…

  • Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data

    Stochastic Deep Learning: A Probabilistic Framework for Modeling Uncertainty in Structured Temporal Data arXiv:2601.05227v1 Announce Type: new Abstract: I propose a novel framework that integrates stochastic differential equations (SDEs) with deep generative models to improve uncertainty quantification in machine learning applications involving structured and temporal data. This approach, termed Stochastic Latent Differential Inference (SLDI), embeds…

  • Learning Multinomial Logits in $O(n log n)$ time

    Learning Multinomial Logits in $O(n log n)$ time arXiv:2601.04423v1 Announce Type: cross Abstract: A Multinomial Logit (MNL) model is composed of a finite universe of items $[n]={1,…, n}$, each assigned a positive weight. A query specifies an admissible subset — called a slate — and the model chooses one item from that slate with probability…

  • Aligned explanations in neural networks

    Aligned explanations in neural networks arXiv:2601.04378v1 Announce Type: cross Abstract: Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model’s prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be…