Category: aimldsaimlds

Data Catalog Tool – Sanity Check

Data Catalog Tool – Sanity Check submitted by /u/FirCoat [link] [comments] /u/FirCoat Go to original source

February 23, 2026
Roast my AB test analysis [A]

Roast my AB test analysis [A] I have just finished up a sample analysis on an AB test dummy dataset, and would love feedback. The dataset is from Udacity’s AB Testing course. It tracks data on two landing page variations, treatment and control, with mean conversion rate as the defining metric. In my analysis, I…

February 23, 2026
The Reality of Vibe Coding: AI Agents and the Security Debt Crisis

The Reality of Vibe Coding: AI Agents and the Security Debt Crisis Why optimizing for speed over safety is leaving applications vulnerable, and how to fix it. The post The Reality of Vibe Coding: AI Agents and the Security Debt Crisis appeared first on Towards Data Science. Reya Vir Go to original source

February 23, 2026
Architecting GPUaaS for Enterprise AI On-Prem

Architecting GPUaaS for Enterprise AI On-Prem Multi-tenancy, scheduling, and cost modeling on Kubernetes The post Architecting GPUaaS for Enterprise AI On-Prem appeared first on Towards Data Science. Joe Sasson Go to original source

February 22, 2026
Donkeys, Not Unicorns

Donkeys, Not Unicorns The New Rules of Entrepreneurship in the Era of Commoditized Magic The post Donkeys, Not Unicorns appeared first on Towards Data Science. Yariv Adan Go to original source

February 21, 2026
An End-to-End Guide to Beautifying Your Open-Source Repo with Agentic AI

An End-to-End Guide to Beautifying Your Open-Source Repo with Agentic AI The guide to automated improvement of scientific and industrial repositories using open-source AI agents The post An End-to-End Guide to Beautifying Your Open-Source Repo with Agentic AI appeared first on Towards Data Science. Nikolay Nikitin Go to original source

February 21, 2026
From Monolith to Contract-Driven Data Mesh

From Monolith to Contract-Driven Data Mesh A pragmatic journey using website analytics as a real-world example The post From Monolith to Contract-Driven Data Mesh appeared first on Towards Data Science. Corné POTGIETER Go to original source

February 21, 2026
Beyond Procedure: Substantive Fairness in Conformal Prediction

Beyond Procedure: Substantive Fairness in Conformal Prediction arXiv:2602.16794v1 Announce Type: new Abstract: Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream…

February 20, 2026
Poisson-MNL Bandit: Nearly Optimal Dynamic Joint Assortment and Pricing with Decision-Dependent Customer Arrivals

Poisson-MNL Bandit: Nearly Optimal Dynamic Joint Assortment and Pricing with Decision-Dependent Customer Arrivals arXiv:2602.16923v1 Announce Type: new Abstract: We study dynamic joint assortment and pricing where a seller updates decisions at regular accounting/operating intervals to maximize the cumulative per-period revenue over a horizon $T$. In many settings, assortment and prices affect not only what an…

February 20, 2026
Anti-causal domain generalization: Leveraging unlabeled data

Anti-causal domain generalization: Leveraging unlabeled data arXiv:2602.17187v1 Announce Type: new Abstract: The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we…

February 20, 2026
Semi-Supervised Learning on Graphs using Graph Neural Networks

Semi-Supervised Learning on Graphs using Graph Neural Networks arXiv:2602.17115v1 Announce Type: new Abstract: Graph neural networks (GNNs) work remarkably well in semi-supervised node regression, yet a rigorous theory explaining when and why they succeed remains lacking. To address this gap, we study an aggregate-and-readout model that encompasses several common message passing architectures: node features are…

February 20, 2026
MGD: Moment Guided Diffusion for Maximum Entropy Generation

MGD: Moment Guided Diffusion for Maximum Entropy Generation arXiv:2602.17211v1 Announce Type: new Abstract: Generating samples from limited information is a fundamental problem across scientific domains. Classical maximum entropy methods provide principled uncertainty quantification from moment constraints but require sampling via MCMC or Langevin dynamics, which typically exhibit exponential slowdown in high dimensions. In contrast, generative…

February 20, 2026
The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents

The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents AI can write the code, but you have to steer the ship. Master the knowledge to keep you relevant in the age of AI. The post The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding…

February 20, 2026
Understanding the Chi-Square Test Beyond the Formula

Understanding the Chi-Square Test Beyond the Formula How categorical data becomes statistical evidence. The post Understanding the Chi-Square Test Beyond the Formula appeared first on Towards Data Science. Nikhil Dasari Go to original source

February 20, 2026
AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving

AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving All you need to know about Chain of Causation reasoning and the current state of Autonomous Driving! The post AlpamayoR1: Large Causal Reasoning Models for Autonomous Driving appeared first on Towards Data Science. Ryan Pégoud Go to original source

February 20, 2026
AI in Multiple GPUs: How GPUs Communicate

AI in Multiple GPUs: How GPUs Communicate A deep dive into the hardware infrastructure that enables multi-GPU communication for AI workloads The post AI in Multiple GPUs: How GPUs Communicate appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source

February 20, 2026
Generalized Leverage Score for Scalable Assessment of Privacy Vulnerability

Generalized Leverage Score for Scalable Assessment of Privacy Vulnerability arXiv:2602.15919v1 Announce Type: new Abstract: Can the privacy vulnerability of individual data points be assessed without retraining models or explicitly simulating attacks? We answer affirmatively by showing that exposure to membership inference attack (MIA) is fundamentally governed by a data point’s influence on the learned model.…

February 19, 2026
Including Node Textual Metadata in Laplacian-constrained Gaussian Graphical Models

Including Node Textual Metadata in Laplacian-constrained Gaussian Graphical Models arXiv:2602.15920v1 Announce Type: new Abstract: This paper addresses graph learning in Gaussian Graphical Models (GGMs). In this context, data matrices often come with auxiliary metadata (e.g., textual descriptions associated with each node) that is usually ignored in traditional graph estimation processes. To fill this gap, we…

February 19, 2026
Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation arXiv:2602.15925v1 Announce Type: new Abstract: Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic…

February 19, 2026
Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models arXiv:2602.16061v1 Announce Type: new Abstract: Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard…

February 19, 2026
Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis arXiv:2602.16131v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via…

February 19, 2026
Can AI Solve Failures in Your Supply Chain?

Can AI Solve Failures in Your Supply Chain? When your warehouse and transportation teams blame each other for late deliveries, who’s right? We can ask an agent connected to the data settle the debate. The post Can AI Solve Failures in Your Supply Chain? appeared first on Towards Data Science. Samir Saci Go to original…

February 19, 2026
Building Cost-Efficient Agentic RAG on Long-Text Documents in SQL Tables

Building Cost-Efficient Agentic RAG on Long-Text Documents in SQL Tables Designing a hybrid SQL + vector retrieval system without schema changes, data migration, or performance trade-offs The post Building Cost-Efficient Agentic RAG on Long-Text Documents in SQL Tables appeared first on Towards Data Science. Partha Sarkar Go to original source

February 19, 2026
Why Every Analytics Engineer Needs to Understand Data Architecture

Why Every Analytics Engineer Needs to Understand Data Architecture Get the data architecture right, and everything else becomes easier. I know it sounds simple, but in reality, little nuances in designing your data architecture may have costly implications. This article provides a crash course on the architectures that shape your daily decisions – from relational…

February 19, 2026
Agentic AI for Modern Deep Learning Experimentation

Agentic AI for Modern Deep Learning Experimentation Stop babysitting training runs. Start shipping research. Autonomous experiment management built for/by deep learning engineers. The post Agentic AI for Modern Deep Learning Experimentation appeared first on Towards Data Science. Sam Black Go to original source

February 19, 2026
Mixture-of-Experts under Finite-Rate Gating: Communication–Generalization Trade-offs

Mixture-of-Experts under Finite-Rate Gating: Communication–Generalization Trade-offs arXiv:2602.15091v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize…

February 18, 2026
Universal priors: solving empirical Bayes via Bayesian inference and pretraining

Universal priors: solving empirical Bayes via Bayesian inference and pretraining arXiv:2602.15136v1 Announce Type: new Abstract: We theoretically justify the recent empirical finding of [Teh et al., 2025] that a transformer pretrained on synthetically generated data achieves strong performance on empirical Bayes (EB) problems. We take an indirect approach to this question: rather than analyzing the…

February 18, 2026
Functional Central Limit Theorem for Stochastic Gradient Descent

Functional Central Limit Theorem for Stochastic Gradient Descent arXiv:2602.15538v1 Announce Type: new Abstract: We study the asymptotic shape of the trajectory of the stochastic gradient descent algorithm applied to a convex objective function. Under mild regularity assumptions, we prove a functional central limit theorem for the properly rescaled trajectory. Our result characterizes the long-term fluctuations…

February 18, 2026
Sparse Additive Model Pruning for Order-Based Causal Structure Learning

Sparse Additive Model Pruning for Order-Based Causal Structure Learning arXiv:2602.15306v1 Announce Type: new Abstract: Causal structure learning, also known as causal discovery, aims to estimate causal relationships between variables as a form of a causal directed acyclic graph (DAG) from observational data. One of the major frameworks is the order-based approach that first estimates a…

February 18, 2026
Near-Optimal Sample Complexity for Online Constrained MDPs

Near-Optimal Sample Complexity for Online Constrained MDPs arXiv:2602.15076v1 Announce Type: cross Abstract: Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to enforce safety constraints while optimizing performance. However, existing methods often suffer…

February 18, 2026
Advance Planning for AI Project Evaluation

Advance Planning for AI Project Evaluation The work to do before the work begins The post Advance Planning for AI Project Evaluation appeared first on Towards Data Science. Stephanie Kirmer Go to original source

February 18, 2026
Use OpenClaw to Make a Personal AI Assistant

Use OpenClaw to Make a Personal AI Assistant Learn how to set up OpenClaw as a personalized AI agent The post Use OpenClaw to Make a Personal AI Assistant appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

February 18, 2026
Building a LangGraph Agent from Scratch

Building a LangGraph Agent from Scratch Everything you need to know to get started The post Building a LangGraph Agent from Scratch appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

February 18, 2026
Iron Triangles: Powerful Tools for Analyzing Trade-Offs in AI Product Development

Iron Triangles: Powerful Tools for Analyzing Trade-Offs in AI Product Development Conceptual overview and practical guidance The post Iron Triangles: Powerful Tools for Analyzing Trade-Offs in AI Product Development appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

February 18, 2026
Nonparametric Distribution Regression Re-calibration

Nonparametric Distribution Regression Re-calibration arXiv:2602.13362v1 Announce Type: new Abstract: A key challenge in probabilistic regression is ensuring that predictive distributions accurately reflect true empirical uncertainty. Minimizing overall prediction error often encourages models to prioritize informativeness over calibration, producing narrow but overconfident predictions. However, in safety-critical settings, trustworthy uncertainty estimates are often more valuable than narrow…

February 17, 2026
Metabolic cost of information processing in Poisson variational autoencoders

Metabolic cost of information processing in Poisson variational autoencoders arXiv:2602.13421v1 Announce Type: new Abstract: Computation in biological systems is fundamentally energy-constrained, yet standard theories of computation treat energy as freely available. Here, we argue that variational free energy minimization under a Poisson assumption offers a principled path toward an energy-aware theory of computation. Our key…

February 17, 2026
Locally Private Parametric Methods for Change-Point Detection

Locally Private Parametric Methods for Change-Point Detection arXiv:2602.13619v1 Announce Type: new Abstract: We study parametric change-point detection, where the goal is to identify distributional changes in time series, under local differential privacy. In the non-private setting, we derive improved finite-sample accuracy guarantees for a change-point detection algorithm based on the generalized log-likelihood ratio test, via…

February 17, 2026
A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization

A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization arXiv:2602.13942v1 Announce Type: new Abstract: In the era of large language models (LLMs), fine-tuning pretrained models has become ubiquitous. Yet the theoretical underpinning remains an open question. A central question is why only a few epochs of fine-tuning are typically sufficient to achieve…

February 17, 2026
Quantifying Normality: Convergence Rate to Gaussian Limit for Stochastic Approximation and Unadjusted OU Algorithm

Quantifying Normality: Convergence Rate to Gaussian Limit for Stochastic Approximation and Unadjusted OU Algorithm arXiv:2602.13906v1 Announce Type: new Abstract: Stochastic approximation (SA) is a method for finding the root of an operator perturbed by noise. There is a rich literature establishing the asymptotic normality of rescaled SA iterates under fairly mild conditions. However, these asymptotic…

February 17, 2026
The Strangest Bottleneck in Modern LLMs

The Strangest Bottleneck in Modern LLMs Why insanely fast GPUs still can’t make LLMs feel instant The post The Strangest Bottleneck in Modern LLMs appeared first on Towards Data Science. Moulik Gupta Go to original source

February 17, 2026
Linear Regression with Unknown Truncation Beyond Gaussian Features

Linear Regression with Unknown Truncation Beyond Gaussian Features arXiv:2602.12534v1 Announce Type: new Abstract: In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^star$. This problem has a long history of study in Statistics and…

February 16, 2026
A Regularization-Sharpness Tradeoff for Linear Interpolators

A Regularization-Sharpness Tradeoff for Linear Interpolators arXiv:2602.12680v1 Announce Type: new Abstract: The rule of thumb regarding the relationship between the bias-variance tradeoff and model size plays a key role in classical machine learning, but is now well-known to break down in the overparameterized setting as per the double descent curve. In particular, minimum-norm interpolating estimators…

February 16, 2026
Blessings of Multiple Good Arms in Multi-Objective Linear Bandits

Blessings of Multiple Good Arms in Multi-Objective Linear Bandits arXiv:2602.12901v1 Announce Type: new Abstract: The multi objective bandit setting has traditionally been regarded as more complex than the single objective case, as multiple objectives must be optimized simultaneously. In contrast to this prevailing view, we demonstrate that when multiple good arms exist for multiple objectives,…

February 16, 2026
TFTF: Training-Free Targeted Flow for Conditional Sampling

TFTF: Training-Free Targeted Flow for Conditional Sampling arXiv:2602.12932v1 Announce Type: new Abstract: We propose a training-free conditional sampling method for flow matching models based on importance sampling. Because a na”ive application of importance sampling suffers from weight degeneracy in high-dimensional settings, we modify and incorporate a resampling technique in sequential Monte Carlo (SMC) during intermediate…

February 16, 2026
Annealing in variational inference mitigates mode collapse: A theoretical study on Gaussian mixtures

Annealing in variational inference mitigates mode collapse: A theoretical study on Gaussian mixtures arXiv:2602.12923v1 Announce Type: new Abstract: Mode collapse, the failure to capture one or more modes when targetting a multimodal distribution, is a central challenge in modern variational inference. In this work, we provide a mathematical analysis of annealing based strategies for mitigating…

February 16, 2026
Weekly Entering & Transitioning – Thread 16 Feb, 2026 – 23 Feb, 2026

Weekly Entering & Transitioning – Thread 16 Feb, 2026 – 23 Feb, 2026 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

February 16, 2026
LLMs for data pipelines without losing control (API → DuckDB in ~10 mins)

LLMs for data pipelines without losing control (API → DuckDB in ~10 mins) Hey folks, I’ve been doing data engineering long enough to believe that “real” pipelines meant writing every parser by hand, dealing with pagination myself, and debugging nested JSON until it finally stopped exploding. I’ve also been pretty skeptical of the “just prompt…

February 16, 2026
Best technique for training models on a sample of data?

Best technique for training models on a sample of data? Due to memory limits on my work computer I’m unable to train machine learning models on our entire analysis dataset. Given my data is highly imbalanced I’m under-sampling from the majority class of the binary outcome. What is the proper method to train ML models…

February 16, 2026
What differentiates a high impact analytics function from one that just produces dashboards?

What differentiates a high impact analytics function from one that just produces dashboards? I’m curious to hear from folks who’ve worked inside or alongside analytics teams. In your experience, what actually separates analytics groups that influence business decisions from those that mostly deliver reporting? submitted by /u/Proof_Wrap_2150 [link] [comments] /u/Proof_Wrap_2150 Go to original source

February 16, 2026
Where do you see HR/People Analytics evolving over the next 5 years?

Where do you see HR/People Analytics evolving over the next 5 years? Curious how practitioners see the field shifting, particularly around: AI integration Predictive workforce modeling Skills-based org design Ethical boundaries Data ownership changes HR decision automation What capabilities do you think will define leading functions going forward? submitted by /u/Proof_Wrap_2150 [link] [comments] /u/Proof_Wrap_2150 Go…

February 16, 2026
A beginner’s guide to Tmux: a multitasking superpower for your terminal

A beginner’s guide to Tmux: a multitasking superpower for your terminal One of the new things I’ve come across recently, while researching command-line-based coding assistants, is the mention and use of a tool I hadn’t heard of before. That tool is called Tmux, which stands for Terminal Multiplexer. In the simplest possible terms, Tmux allows you…

February 16, 2026
Your First 90 Days as a Data Scientist

Your First 90 Days as a Data Scientist A practical onboarding checklist for building trust, business fluency, and data intuition The post Your First 90 Days as a Data Scientist appeared first on Towards Data Science. Yu Dong Go to original source

February 15, 2026
The Evolving Role of the ML Engineer

The Evolving Role of the ML Engineer Stephanie Kirmer on the $200 billion investment bubble, how AI companies can rebuild trust, and how her day-to-day work changed with the rise of LLMs. The post The Evolving Role of the ML Engineer appeared first on Towards Data Science. TDS Editors Go to original source

February 14, 2026
AI in Multiple GPUs: Point-to-Point and Collective Operations

AI in Multiple GPUs: Point-to-Point and Collective Operations Learn PyTorch distributed operations for multi GPU AI workloads The post AI in Multiple GPUs: Point-to-Point and Collective Operations appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source

February 14, 2026
The Cost of Learning under Multiple Change Points

The Cost of Learning under Multiple Change Points arXiv:2602.11406v1 Announce Type: new Abstract: We consider an online learning problem in environments with multiple change points. In contrast to the single change point problem that is widely studied using classical “high confidence” detection schemes, the multiple change point environment presents new learning-theoretic and algorithmic challenges. Specifically,…

February 13, 2026
Amortised and provably-robust simulation-based inference

Amortised and provably-robust simulation-based inference arXiv:2602.11325v1 Announce Type: new Abstract: Complex simulator-based models are now routinely used to perform inference across the sciences and engineering, but existing inference methods are often unable to account for outliers and other extreme values in data which occur due to faulty measurement instruments or human error. In this paper,…

February 13, 2026
Provable Offline Reinforcement Learning for Structured Cyclic MDPs

Provable Offline Reinforcement Learning for Structured Cyclic MDPs arXiv:2602.11679v1 Announce Type: new Abstract: We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions…

February 13, 2026
Estimation of instrument and noise parameters for inverse problem based on prior diffusion model

Estimation of instrument and noise parameters for inverse problem based on prior diffusion model arXiv:2602.11711v1 Announce Type: new Abstract: This article addresses the issue of estimating observation parameters (response and error parameters) in inverse problems. The focus is on cases where regularization is introduced in a Bayesian framework and the prior is modeled by a…

February 13, 2026
PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers

PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers arXiv:2602.11722v1 Announce Type: new Abstract: Classical PAC generalization bounds on the prediction risk of a classifier are insufficient to provide theoretical guarantees on fairness when the goal is to learn models balancing predictive risk and fairness constraints. We propose a PAC-Bayesian framework for deriving generalization…

February 13, 2026
How to Leverage Explainable AI for Better Business Decisions

How to Leverage Explainable AI for Better Business Decisions Moving beyond the black box to turn complex model outputs into actionable organizational strategies. The post How to Leverage Explainable AI for Better Business Decisions appeared first on Towards Data Science. Rodrigo Almeida Go to original source

February 13, 2026
AI in Multiple GPUs: Understanding the Host and Device Paradigm

AI in Multiple GPUs: Understanding the Host and Device Paradigm Learn how CPU and GPUs interact in the host-device paradigm The post AI in Multiple GPUs: Understanding the Host and Device Paradigm appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source

February 13, 2026
When LLMs get significantly worse: A statistical approach to detect model degradations

When LLMs get significantly worse: A statistical approach to detect model degradations arXiv:2602.10144v1 Announce Type: new Abstract: Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to…

February 12, 2026
Dissecting Performative Prediction: A Comprehensive Survey

Dissecting Performative Prediction: A Comprehensive Survey arXiv:2602.10176v1 Announce Type: new Abstract: The field of performative prediction had its beginnings in 2020 with the seminal paper “Performative Prediction” by Perdomo et al., which established a novel machine learning setup where the deployment of a predictive model causes a distribution shift in the environment, which in turn…

February 12, 2026
Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning

Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning arXiv:2602.10273v1 Announce Type: new Abstract: Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $pi_alpha(ymid x)propto p_theta(ymid…

February 12, 2026
Causal Effect Estimation with Learned Instrument Representations

Causal Effect Estimation with Learned Instrument Representations arXiv:2602.10370v1 Announce Type: new Abstract: Instrumental variable (IV) methods mitigate bias from unobserved confounding in observational causal inference but rely on the availability of a valid instrument, which can often be difficult or infeasible to identify in practice. In this paper, we propose a representation learning approach that…

February 12, 2026
Generalized Robust Adaptive-Bandwidth Multi-View Manifold Learning in High Dimensions with Noise

Generalized Robust Adaptive-Bandwidth Multi-View Manifold Learning in High Dimensions with Noise arXiv:2602.10530v1 Announce Type: new Abstract: Multiview datasets are common in scientific and engineering applications, yet existing fusion methods offer limited theoretical guarantees, particularly in the presence of heterogeneous and high-dimensional noise. We propose Generalized Robust Adaptive-Bandwidth Multiview Diffusion Maps (GRAB-MDM), a new kernel-based diffusion…

February 12, 2026
Building an AI Agent to Detect and Handle Anomalies in Time-Series Data

Building an AI Agent to Detect and Handle Anomalies in Time-Series Data Combining statistical detection with agentic decision-making The post Building an AI Agent to Detect and Handle Anomalies in Time-Series Data appeared first on Towards Data Science. MADHURA RAUT Go to original source

February 12, 2026
Not All RecSys Problems Are Created Equal

Not All RecSys Problems Are Created Equal How baseline strength, churn, and subjectivity determine complexity The post Not All RecSys Problems Are Created Equal appeared first on Towards Data Science. Diogo Leitão Go to original source

February 12, 2026
Minimum Distance Summaries for Robust Neural Posterior Estimation

Minimum Distance Summaries for Robust Neural Posterior Estimation arXiv:2602.09161v1 Announce Type: new Abstract: Simulation-based inference (SBI) enables amortized Bayesian inference by first training a neural posterior estimator (NPE) on prior-simulator pairs, typically through low-dimensional summary statistics, which can then be cheaply reused for fast inference by querying it on new test observations. Because NPE is…

February 11, 2026
Persistent Entropy as a Detector of Phase Transitions

Persistent Entropy as a Detector of Phase Transitions arXiv:2602.09058v1 Announce Type: new Abstract: Persistent entropy (PE) is an information-theoretic summary statistic of persistence barcodes that has been widely used to detect regime changes in complex systems. Despite its empirical success, a general theoretical understanding of when and why persistent entropy reliably detects phase transitions has…

February 11, 2026
Quantifying Epistemic Uncertainty in Diffusion Models

Quantifying Epistemic Uncertainty in Diffusion Models arXiv:2602.09170v1 Announce Type: new Abstract: To ensure high quality outputs, it is important to quantify the epistemic uncertainty of diffusion models.Existing methods are often unreliable because they mix epistemic and aleatoric uncertainty. We introduce a method based on Fisher information that explicitly isolates epistemic variance, producing more reliable plausibility…

February 11, 2026
Mutual Information Collapse Explains Disentanglement Failure in $beta$-VAEs

Mutual Information Collapse Explains Disentanglement Failure in $beta$-VAEs arXiv:2602.09277v1 Announce Type: new Abstract: The $beta$-VAE is a foundational framework for unsupervised disentanglement, using $beta$ to regulate the trade-off between latent factorization and reconstruction fidelity. Empirically, however, disentanglement performance exhibits a pervasive non-monotonic trend: benchmarks such as MIG and SAP typically peak at intermediate $beta$ and…

February 11, 2026
The Critical Horizon: Inspection Design Principles for Multi-Stage Operations and Deep Reasoning

The Critical Horizon: Inspection Design Principles for Multi-Stage Operations and Deep Reasoning arXiv:2602.09394v1 Announce Type: new Abstract: Manufacturing lines, service journeys, supply chains, and AI reasoning chains share a common challenge: attributing a terminal outcome to the intermediate stage that caused it. We establish an information-theoretic barrier to this credit assignment problem: the signal connecting…

February 11, 2026
How to Model The Expected Value of Marketing Campaigns

How to Model The Expected Value of Marketing Campaigns The approach that takes companies to the next level of data maturity The post How to Model The Expected Value of Marketing Campaigns appeared first on Towards Data Science. Rodrigo Almeida Go to original source

February 11, 2026
Implementing the Snake Game in Python

Implementing the Snake Game in Python An easy step-by-step guide to building the snake game from scratch The post Implementing the Snake Game in Python appeared first on Towards Data Science. Mahnoor Javed Go to original source

February 11, 2026
How to Personalize Claude Code

How to Personalize Claude Code Learn how to get more out of Claude code by giving it access to more information. The post How to Personalize Claude Code appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

February 11, 2026
Fast and Robust Likelihood-Guided Diffusion Posterior Sampling with Amortized Variational Inference

Fast and Robust Likelihood-Guided Diffusion Posterior Sampling with Amortized Variational Inference arXiv:2602.07102v1 Announce Type: new Abstract: Zero-shot diffusion posterior sampling offers a flexible framework for inverse problems by accommodating arbitrary degradation operators at test time, but incurs high computational cost due to repeated likelihood-guided updates. In contrast, previous amortized diffusion approaches enable fast inference by…

February 10, 2026
Discrete Adjoint Matching

Discrete Adjoint Matching arXiv:2602.07132v1 Announce Type: new Abstract: Computation methods for solving entropy-regularized reward optimization — a class of problems widely used for fine-tuning generative models — have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to…

February 10, 2026
Flow-Based Conformal Predictive Distributions

Flow-Based Conformal Predictive Distributions arXiv:2602.07633v1 Announce Type: new Abstract: Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate…

February 10, 2026
Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization

Scalable Mean-Field Variational Inference via Preconditioned Primal-Dual Optimization arXiv:2602.07632v1 Announce Type: new Abstract: In this work, we investigate the large-scale mean-field variational inference (MFVI) problem from a mini-batch primal-dual perspective. By reformulating MFVI as a constrained finite-sum problem, we develop a novel primal-dual algorithm based on an augmented Lagrangian formulation, termed primal-dual variational inference (PD-VI).…

February 10, 2026
On Generation in Metric Spaces

On Generation in Metric Spaces arXiv:2602.07710v1 Announce Type: new Abstract: We study generation in separable metric instance spaces. We extend the language generation framework from Kleinberg and Mullainathan [2024] beyond countable domains by defining novelty through metric separation and allowing asymmetric novelty parameters for the adversary and the generator. We introduce the $(varepsilon,varepsilon’)$-closure dimension, a…

February 10, 2026
The Machine Learning Lessons I’ve Learned Last Month

The Machine Learning Lessons I’ve Learned Last Month Delayed January: deadlines, downtimes, and flow times The post The Machine Learning Lessons I’ve Learned Last Month appeared first on Towards Data Science. Pascal Janetzky Go to original source

February 10, 2026
The Death of the “Everything Prompt”: Google’s Move Toward Structured AI

The Death of the “Everything Prompt”: Google’s Move Toward Structured AI How the new Interactions API enables deep-reasoning, stateful, agentic workflows. The post The Death of the “Everything Prompt”: Google’s Move Toward Structured AI appeared first on Towards Data Science. Thomas Reid Go to original source

February 10, 2026
Deep networks learn to parse uniform-depth context-free languages from local statistics

Deep networks learn to parse uniform-depth context-free languages from local statistics arXiv:2602.06065v1 Announce Type: new Abstract: Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text…

February 9, 2026
Algebraic Robustness Verification of Neural Networks

Algebraic Robustness Verification of Neural Networks arXiv:2602.06105v1 Announce Type: new Abstract: We formulate formal robustness verification of neural networks as an algebraic optimization problem. We leverage the Euclidean Distance (ED) degree, which is the generic number of complex critical points of the distance minimization problem to a classifier’s decision boundary, as an architecture-dependent measure of…

February 9, 2026
Inheritance Between Feedforward and Convolutional Networks via Model Projection

Inheritance Between Feedforward and Convolutional Networks via Model Projection arXiv:2602.06245v1 Announce Type: new Abstract: Techniques for feedforward networks (FFNs) and convolutional networks (CNNs) are frequently reused across families, but the relationship between the underlying model classes is rarely made explicit. We introduce a unified node-level formalization with tensor-valued activations and show that generalized feedforward networks…

February 9, 2026
High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory arXiv:2602.06320v1 Announce Type: new Abstract: Modern machine learning models are typically trained via multi-pass stochastic gradient descent (SGD) with small batch sizes, and understanding their dynamics in high dimensions is of great interest. However, an analytical framework for describing the high-dimensional asymptotic behavior of multi-pass…

February 9, 2026
Time-uniform conformal and PAC prediction

Time-uniform conformal and PAC prediction arXiv:2602.06297v1 Announce Type: new Abstract: Given that machine learning algorithms are increasingly being deployed to aid in high stakes decision-making, uncertainty quantification methods that wrap around these black box models such as conformal prediction have received much attention in recent years. In sequential settings, where data are observed/generated in a…

February 9, 2026
Weekly Entering & Transitioning – Thread 09 Feb, 2026 – 16 Feb, 2026

Weekly Entering & Transitioning – Thread 09 Feb, 2026 – 16 Feb, 2026 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

February 9, 2026
How I scraped 5.3 million jobs (including 5,335 data science jobs)

How I scraped 5.3 million jobs (including 5,335 data science jobs) Background During my PhD in Data Science at Stanford, I got sick and tired of ghost jobs & 3rd party offshore agencies on LinkedIn & Indeed. So I wrote a script that fetches jobs from 30k+ company websites’ career pages and uses GPT4o-mini to…

February 9, 2026
Thoughts about going from Senior data scientist at company A to Senior Data Analyst at Company B

Thoughts about going from Senior data scientist at company A to Senior Data Analyst at Company B The senior data analyst at company B is significant higher pay ($50k/year more) and scope seems to be bigger with more ownership What kind of setback (if any) does losing the data scientist title have? submitted by /u/StatGoddess…

February 9, 2026
Retraining strategy with evolving classes + imbalanced labels?

Retraining strategy with evolving classes + imbalanced labels? Hi all — I’m looking for advice on the best retraining strategy for a multi-class classifier in a setting where the label space can evolve. Right now I have about 6 labels, but I don’t know how many will show up over time, and some labels appear…

February 9, 2026
Finding myself disillusioned with the quality of discussion in this sub

Finding myself disillusioned with the quality of discussion in this sub I see multiple highly-upvoted comments per day saying things like “LLMs aren’t AI,” demonstrating a complete misunderstanding of the technical definitions of these terms. Or worse, comments that say “this stuff isn’t AI, AI is like *insert sci-fi reference*.” And this is just comments…

February 9, 2026
What I Am Doing to Stay Relevant as a Senior Analytics Consultant in 2026

What I Am Doing to Stay Relevant as a Senior Analytics Consultant in 2026 Learn how to work with AI, while strengthening your unique human skills that technology cannot replace The post What I Am Doing to Stay Relevant as a Senior Analytics Consultant in 2026 appeared first on Towards Data Science. Rashi Desai Go…

February 8, 2026
Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently

Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently The real value lies in writing clearer code and using your tools right The post Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently appeared first on Towards Data Science. Mike Huls Go to original source

February 7, 2026
Prompt Fidelity: Measuring How Much of Your Intent an AI Agent Actually Executes

Prompt Fidelity: Measuring How Much of Your Intent an AI Agent Actually Executes How much of your AI agent’s output is real data versus confident guesswork? The post Prompt Fidelity: Measuring How Much of Your Intent an AI Agent Actually Executes appeared first on Towards Data Science. James Barney Go to original source

February 7, 2026
TDS Newsletter: Vibe Coding Is Great. Until It’s Not.

TDS Newsletter: Vibe Coding Is Great. Until It’s Not. Sorting through the good, bad, and ambiguous aspects of vibe coding The post TDS Newsletter: Vibe Coding Is Great. Until It’s Not. appeared first on Towards Data Science. TDS Editors Go to original source

February 7, 2026
Total Variation Rates for Riemannian Flow Matching

Total Variation Rates for Riemannian Flow Matching arXiv:2602.05174v1 Announce Type: new Abstract: Riemannian flow matching (RFM) extends flow-based generative modeling to data supported on manifolds by learning a time-dependent tangent vector field whose flow-ODE transports a simple base distribution to the data law. We develop a nonasymptotic Total Variation (TV) convergence analysis for RFM samplers…

February 6, 2026
Finite-Particle Rates for Regularized Stein Variational Gradient Descent

Finite-Particle Rates for Regularized Stein Variational Gradient Descent arXiv:2602.05172v1 Announce Type: new Abstract: We derive finite-particle rates for the regularized Stein variational gradient descent (R-SVGD) algorithm introduced by He et al. (2024) that corrects the constant-order bias of the SVGD by applying a resolvent-type preconditioner to the kernelized Wasserstein gradient. For the resulting interacting $N$-particle…

February 6, 2026
Logarithmic-time Schedules for Scaling Language Models with Momentum

Logarithmic-time Schedules for Scaling Language Models with Momentum arXiv:2602.05298v1 Announce Type: new Abstract: In practice, the hyperparameters $(beta_1, beta_2)$ and weight-decay $lambda$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of…

February 6, 2026