Category: aimldsaimlds

  • TDS Newsletter: December Must-Reads on GraphRAG, Data Contracts, and More

    TDS Newsletter: December Must-Reads on GraphRAG, Data Contracts, and More Don’t miss our most popular articles of the previous month The post TDS Newsletter: December Must-Reads on GraphRAG, Data Contracts, and More appeared first on Towards Data Science. TDS Editors Go to original source

  • Beyond Prompting: The Power of Context Engineering

    Beyond Prompting: The Power of Context Engineering Using ACE to create self-improving LLM workflows and structured playbooks The post Beyond Prompting: The Power of Context Engineering appeared first on Towards Data Science. Mariya Mansurova Go to original source

  • Retrieval for Time-Series: How Looking Back Improves Forecasts

    Retrieval for Time-Series: How Looking Back Improves Forecasts Why Retrieval Helps in Time Series Forecasting We all know how it goes: Time-series data is tricky. Traditional forecasting models are unprepared for incidents like sudden market crashes, black swan events, or rare weather patterns. Even large fancy models like Chronos sometimes struggle because they haven’t dealt…

  • How to Improve the Performance of Visual Anomaly Detection Models

    How to Improve the Performance of Visual Anomaly Detection Models Apply the best methods from academia to get the most out of practical applications The post How to Improve the Performance of Visual Anomaly Detection Models appeared first on Towards Data Science. Aimira Baitieva Go to original source

  • Faster Is Not Always Better: Choosing the Right PostgreSQL Insert Strategy in Python (+Benchmarks)

    Faster Is Not Always Better: Choosing the Right PostgreSQL Insert Strategy in Python (+Benchmarks) PostgreSQL is fast. Whether your Python code can or should keep up depends on context. This article compares and benchmarks various insert strategies, focusing not on micro-benchmarks but on trade-offs between safety, abstraction, and throughput — and choosing the right tool…

  • On the Identifiability of Regime-Switching Models with Multi-Lag Dependencies

    On the Identifiability of Regime-Switching Models with Multi-Lag Dependencies arXiv:2601.03325v1 Announce Type: new Abstract: Identifiability is central to the interpretability of deep latent variable models, ensuring parameterisations are uniquely determined by the data-generating distribution. However, it remains underexplored for deep regime-switching time series. We develop a general theoretical framework for multi-lag Regime-Switching Models (RSMs), encompassing…

  • Microeconomic Foundations of Multi-Agent Learning

    Microeconomic Foundations of Multi-Agent Learning arXiv:2601.03451v1 Announce Type: new Abstract: Modern AI systems increasingly operate inside markets and institutions where data, behavior, and incentives are endogenous. This paper develops an economic foundation for multi-agent learning by studying a principal-agent interaction in a Markov decision process with strategic externalities, where both the principal and the agent…

  • Online Learning with Limited Information in the Sliding Window Model

    Online Learning with Limited Information in the Sliding Window Model arXiv:2601.03533v1 Announce Type: new Abstract: Motivated by recent work on the experts problem in the streaming model, we consider the experts problem in the sliding window model. The sliding window model is a well-studied model that captures applications such as traffic monitoring, epidemic tracking, and…

  • A Theoretical and Empirical Taxonomy of Imbalance in Binary Classification

    A Theoretical and Empirical Taxonomy of Imbalance in Binary Classification arXiv:2601.04149v1 Announce Type: new Abstract: Class imbalance significantly degrades classification performance, yet its effects are rarely analyzed from a unified theoretical perspective. We propose a principled framework based on three fundamental scales: the imbalance coefficient $eta$, the sample–dimension ratio $kappa$, and the intrinsic separability $Delta$.…

  • A path to natural language through tokenisation and transformers

    A path to natural language through tokenisation and transformers arXiv:2601.03368v1 Announce Type: cross Abstract: Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf’s and Heaps’ laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation schemes used in contemporary transformer models. In this note,…

  • HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows

    HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows How approximate vector search silently degrades Recall—and what to do about It The post HNSW at Scale: Why Your RAG System Gets Worse as the Vector Database Grows appeared first on Towards Data Science. Partha Sarkar Go to original source

  • I Evaluated Half a Million Credit Records with Federated Learning. Here’s What I Found

    I Evaluated Half a Million Credit Records with Federated Learning. Here’s What I Found Why privacy breaks fairness at small scale—and how collaboration fixes both without sharing a single record The post I Evaluated Half a Million Credit Records with Federated Learning. Here’s What I Found appeared first on Towards Data Science. Arjun Kaarat Go…

  • Probabilistic Multi-Variant Reasoning: Turning Fluent LLM Answers Into Weighted Options

    Probabilistic Multi-Variant Reasoning: Turning Fluent LLM Answers Into Weighted Options Human-guided AI collaboration The post Probabilistic Multi-Variant Reasoning: Turning Fluent LLM Answers Into Weighted Options appeared first on Towards Data Science. alan nekhom Go to original source

  • Why Supply Chain is the Best Domain for Data Scientists in 2026 (And How to Learn It)

    Why Supply Chain is the Best Domain for Data Scientists in 2026 (And How to Learn It) My take after 10 years in Supply Chain on why this can be an excellent playground for data scientists who want to see their skills valued. The post Why Supply Chain is the Best Domain for Data Scientists in…

  • Mitigating Long-Tailed Anomaly Score Distributions with Importance-Weighted Loss

    Mitigating Long-Tailed Anomaly Score Distributions with Importance-Weighted Loss arXiv:2601.02440v1 Announce Type: new Abstract: Anomaly detection is crucial in industrial applications for identifying rare and unseen patterns to ensure system reliability. Traditional models, trained on a single class of normal data, struggle with real-world distributions where normal data exhibit diverse patterns, leading to class imbalance and…

  • Fast Conformal Prediction using Conditional Interquantile Intervals

    Fast Conformal Prediction using Conditional Interquantile Intervals arXiv:2601.02769v1 Announce Type: new Abstract: We introduce Conformal Interquantile Regression (CIR), a conformal regression method that efficiently constructs near-minimal prediction intervals with guaranteed coverage. CIR leverages black-box machine learning models to estimate outcome distributions through interquantile ranges, transforming these estimates into compact prediction intervals while achieving approximate conditional…

  • Self-Supervised Learning from Noisy and Incomplete Data

    Self-Supervised Learning from Noisy and Incomplete Data arXiv:2601.03244v1 Announce Type: new Abstract: Many important problems in science and engineering involve inferring a signal from noisy and/or incomplete observations, where the observation process is known. Historically, this problem has been tackled using hand-crafted regularization (e.g., sparsity, total-variation) to obtain meaningful estimates. Recent data-driven methods often offer…

  • Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis

    Detecting and Mitigating Treatment Leakage in Text-Based Causal Inference: Distillation and Sensitivity Analysis arXiv:2601.02400v1 Announce Type: cross Abstract: Text-based causal inference increasingly employs textual data as proxies for unobserved confounders, yet this approach introduces a previously undertheorized source of bias: treatment leakage. Treatment leakage occurs when text intended to capture confounding information also contains signals…

  • First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data

    First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data arXiv:2601.02523v1 Announce Type: cross Abstract: Artificial intelligence has advanced rapidly through large neural networks trained on massive datasets using thousands of GPUs or TPUs. Such training can occupy entire data centers for weeks and requires enormous computational and energy resources. Yet the optimization algorithms behind…

  • Measuring What Matters with NeMo Agent Toolkit

    Measuring What Matters with NeMo Agent Toolkit A practical guide to observability, evaluations, and model comparisons The post Measuring What Matters with NeMo Agent Toolkit appeared first on Towards Data Science. Mariya Mansurova Go to original source

  • The Best Data Scientists Are Always Learning

    The Best Data Scientists Are Always Learning Part 2: Avoiding burnout, learning strategies and the superpower of solitude The post The Best Data Scientists Are Always Learning appeared first on Towards Data Science. Jarom Hulet Go to original source

  • How to Optimize Your AI Coding Agent Context

    How to Optimize Your AI Coding Agent Context Make your coding agents more efficient The post How to Optimize Your AI Coding Agent Context appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • GliNER2: Extracting Structured Information from Text

    GliNER2: Extracting Structured Information from Text From unstructured text to structured Knowledge Graphs The post GliNER2: Extracting Structured Information from Text appeared first on Towards Data Science. Tomaz Bratanic Go to original source

  • Beyond Demand Estimation: Consumer Surplus Evaluation via Cumulative Propensity Weights

    Beyond Demand Estimation: Consumer Surplus Evaluation via Cumulative Propensity Weights arXiv:2601.01029v1 Announce Type: new Abstract: This paper develops a practical framework for using observational data to audit the consumer surplus effects of AI-driven decisions, specifically in targeted pricing and algorithmic lending. Traditional approaches first estimate demand functions and then integrate to compute consumer surplus, but…

  • Fibonacci-Driven Recursive Ensembles: Algorithms, Convergence, and Learning Dynamics

    Fibonacci-Driven Recursive Ensembles: Algorithms, Convergence, and Learning Dynamics arXiv:2601.01055v1 Announce Type: new Abstract: This paper develops the algorithmic and dynamical foundations of recursive ensemble learning driven by Fibonacci-type update flows. In contrast with classical boosting Freund and Schapire (1997); Friedman (2001), where the ensemble evolves through first-order additive updates, we study second-order recursive architectures in…

  • Neural Networks on Symmetric Spaces of Noncompact Type

    Neural Networks on Symmetric Spaces of Noncompact Type arXiv:2601.01097v1 Announce Type: new Abstract: Recent works have demonstrated promising performances of neural networks on hyperbolic spaces and symmetric positive definite (SPD) manifolds. These spaces belong to a family of Riemannian manifolds referred to as symmetric spaces of noncompact type. In this paper, we propose a novel…

  • Conformal Blindness: A Note on $A$-Cryptic change-points

    Conformal Blindness: A Note on $A$-Cryptic change-points arXiv:2601.01147v1 Announce Type: new Abstract: Conformal Test Martingales (CTMs) are a standard method within the Conformal Prediction framework for testing the crucial assumption of data exchangeability by monitoring deviations from uniformity in the p-value sequence. Although exchangeability implies uniform p-values, the converse does not hold. This raises the…

  • Evidence Slopes and Effective Dimension in Singular Linear Models

    Evidence Slopes and Effective Dimension in Singular Linear Models arXiv:2601.01238v1 Announce Type: new Abstract: Bayesian model selection commonly relies on Laplace approximation or the Bayesian Information Criterion (BIC), which assume that the effective model dimension equals the number of parameters. Singular learning theory replaces this assumption with the real log canonical threshold (RLCT), an effective…

  • Feature Detection, Part 3: Harris Corner Detection

    Feature Detection, Part 3: Harris Corner Detection Finding the most informative points in images The post Feature Detection, Part 3: Harris Corner Detection appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

  • Ray: Distributed Computing for All, Part 1

    Ray: Distributed Computing for All, Part 1 From single to multi-core on your local PC and beyond The post Ray: Distributed Computing for All, Part 1 appeared first on Towards Data Science. Thomas Reid Go to original source

  • Stop Blaming the Data: A Better Way to Handle Covariance Shift

    Stop Blaming the Data: A Better Way to Handle Covariance Shift Instead of using shift as an excuse for poor performance, use Inverse Probability Weighting to estimate how your model should perform in the new environment The post Stop Blaming the Data: A Better Way to Handle Covariance Shift appeared first on Towards Data Science.…

  • YOLOv1 Loss Function Walkthrough: Regression for All

    YOLOv1 Loss Function Walkthrough: Regression for All An explanation of how YOLOv1 measures the correctness of its object detection and classification predictions The post YOLOv1 Loss Function Walkthrough: Regression for All appeared first on Towards Data Science. Muhammad Ardi Go to original source

  • Active learning for data-driven reduced models of parametric differential systems with Bayesian operator inference

    Active learning for data-driven reduced models of parametric differential systems with Bayesian operator inference arXiv:2601.00038v1 Announce Type: new Abstract: This work develops an active learning framework to intelligently enrich data-driven reduced-order models (ROMs) of parametric dynamical systems, which can serve as the foundation of virtual assets in a digital twin. Data-driven ROMs are explainable, computationally…

  • Detecting Unobserved Confounders: A Kernelized Regression Approach

    Detecting Unobserved Confounders: A Kernelized Regression Approach arXiv:2601.00200v1 Announce Type: new Abstract: Detecting unobserved confounders is crucial for reliable causal inference in observational studies. Existing methods require either linearity assumptions or multiple heterogeneous environments, limiting applicability to nonlinear single-environment settings. To bridge this gap, we propose Kernel Regression Confounder Detection (KRCD), a novel method for…

  • Generative Conditional Missing Imputation Networks

    Generative Conditional Missing Imputation Networks arXiv:2601.00517v1 Announce Type: new Abstract: In this study, we introduce a sophisticated generative conditional strategy designed to impute missing values within datasets, an area of considerable importance in statistical analysis. Specifically, we initially elucidate the theoretical underpinnings of the Generative Conditional Missing Imputation Networks (GCMI), demonstrating its robust properties in…

  • Deep learning estimation of the spectral density of functional time series on large domains

    Deep learning estimation of the spectral density of functional time series on large domains arXiv:2601.00284v1 Announce Type: cross Abstract: We derive an estimator of the spectral density of a functional time series that is the output of a multilayer perceptron neural network. The estimator is motivated by difficulties with the computation of existing spectral density…

  • Identification and Estimation under Multiple Versions of Treatment: Mixture-of-Experts Approach

    Identification and Estimation under Multiple Versions of Treatment: Mixture-of-Experts Approach arXiv:2601.00287v1 Announce Type: cross Abstract: The Stable Unit Treatment Value Assumption (SUTVA) includes the condition that there are no multiple versions of treatment in causal inference. Though we could not control the implementation of treatment in observational studies, multiple versions may exist in the treatment.…

  • Weekly Entering & Transitioning – Thread 05 Jan, 2026 – 12 Jan, 2026

    Weekly Entering & Transitioning – Thread 05 Jan, 2026 – 12 Jan, 2026 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • [Official] 2025 End of Year Salary Sharing thread

    [Official] 2025 End of Year Salary Sharing thread This is the official thread for sharing your current salaries (or recent offers). See last year’s Salary Sharing thread here. Please only post salaries/offers if you’re including hard numbers, but feel free to use a throwaway account if you’re concerned about anonymity. You can also generalize some…

  • Learning Python by doing projects: What does that even mean?

    Learning Python by doing projects: What does that even mean? I’m learning Python and considering this approach: choose a real dataset, frame a question I want to answer, then work toward it step by step by breaking it into small tasks and researching each step as needed. For those of you who are already comfortable…

  • Tips for standing out in this market?

    Tips for standing out in this market? Hey all, I just finished my master’s in data science last month and I want to see what it takes to break into a mid level DS role. I haven’t had a chance to sterilize my resume yet (2 young kids and a lot of recent travel), but…

  • Which class should I take to help me get a job?

    Which class should I take to help me get a job? I’m in my final semester of my MS program and am deciding between Spatial and Non-Parametric statistics. I feel like spatial is less common but would make me stand out more for jobs specifically looking for spatial whereas NP would be more common but…

  • Prompt Engineering vs RAG for Editing Resumes

    Prompt Engineering vs RAG for Editing Resumes Running a code-free comparison in Azure The post Prompt Engineering vs RAG for Editing Resumes appeared first on Towards Data Science. Robert Etter Go to original source

  • How to Filter for Dates, Including or Excluding Future Dates, in Semantic Models

    How to Filter for Dates, Including or Excluding Future Dates, in Semantic Models It is common to have either planning data or the previous year’s data displayed beyond today’s date. But future data can be confusing. How can I add a Slicer to show or hide future data? Let’s see how to do it. The…

  • Optimizing Data Transfer in AI/ML Workloads

    Optimizing Data Transfer in AI/ML Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems The post Optimizing Data Transfer in AI/ML Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

  • How to Keep MCPs Useful in Agentic Pipelines

    How to Keep MCPs Useful in Agentic Pipelines Check the tools your LLM uses before replacing it with just a more powerful model The post How to Keep MCPs Useful in Agentic Pipelines appeared first on Towards Data Science. Roman S Go to original source

  • Drift Detection in Robust Machine Learning Systems

    Drift Detection in Robust Machine Learning Systems A prerequisite for long-term success of machine learning systems The post Drift Detection in Robust Machine Learning Systems appeared first on Towards Data Science. Morris Stallmann Go to original source

  • Off-Beat Careers That Are the Future Of Data

    Off-Beat Careers That Are the Future Of Data The unconventional career paths you need to explore The post Off-Beat Careers That Are the Future Of Data appeared first on Towards Data Science. Rashi Desai Go to original source

  • The Real Challenge in Data Storytelling: Getting Buy-In for Simplicity

    The Real Challenge in Data Storytelling: Getting Buy-In for Simplicity What happens when your clear dashboard meets stakeholders who want everything on one screen The post The Real Challenge in Data Storytelling: Getting Buy-In for Simplicity appeared first on Towards Data Science. Benjamin Nweke Go to original source

  • EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas

    EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas How to build, score, and interpret RFM segments step by step The post EDA in Public (Part 3): RFM Analysis for Customer Segmentation in Pandas appeared first on Towards Data Science. Ibrahim Salami Go to original source

  • Deep Reinforcement Learning: The Actor-Critic Method

    Deep Reinforcement Learning: The Actor-Critic Method Robot friends collaborate to learn to fly a drone The post Deep Reinforcement Learning: The Actor-Critic Method appeared first on Towards Data Science. Vedant Jumle Go to original source

  • Energy-Tweedie: Score meets Score, Energy meets Energy

    Energy-Tweedie: Score meets Score, Energy meets Energy arXiv:2512.23818v1 Announce Type: new Abstract: Denoising and score estimation have long been known to be linked via the classical Tweedie’s formula. In this work, we first extend the latter to a wider range of distributions often called “energy models” and denoted elliptical distributions in this work. Next, we…

  • Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

    Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting arXiv:2512.23805v1 Announce Type: new Abstract: Fitted Q-evaluation (FQE) is a central method for off-policy evaluation in reinforcement learning, but it generally requires Bellman completeness: that the hypothesis class is closed under the evaluation Bellman operator. This requirement is challenging because enlarging the hypothesis class can worsen…

  • Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

    Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration arXiv:2512.23927v1 Announce Type: new Abstract: Fitted Q-iteration (FQI) and its entropy-regularized variant, soft FQI, are central tools for value-based model-free offline reinforcement learning, but can behave poorly under function approximation and distribution shift. In the entropy-regularized setting, we show that the soft Bellman operator is locally…

  • Implicit geometric regularization in flow matching via density weighted Stein operators

    Implicit geometric regularization in flow matching via density weighted Stein operators arXiv:2512.23956v1 Announce Type: new Abstract: Flow Matching (FM) has emerged as a powerful paradigm for continuous normalizing flows, yet standard FM implicitly performs an unweighted $L^2$ regression over the entire ambient space. In high dimensions, this leads to a fundamental inefficiency: the vast majority…

  • Constructive Approximation of Random Process via Stochastic Interpolation Neural Network Operators

    Constructive Approximation of Random Process via Stochastic Interpolation Neural Network Operators arXiv:2512.24106v1 Announce Type: new Abstract: In this paper, we construct a class of stochastic interpolation neural network operators (SINNOs) with random coefficients activated by sigmoidal functions. We establish their boundedness, interpolation accuracy, and approximation capabilities in the mean square sense, in probability, as well…

  • Production-Ready LLMs Made Simple with the NeMo Agent Toolkit

    Production-Ready LLMs Made Simple with the NeMo Agent Toolkit From simple chat to multi-agent reasoning and real-time REST APIs The post Production-Ready LLMs Made Simple with the NeMo Agent Toolkit appeared first on Towards Data Science. Mariya Mansurova Go to original source

  • What Advent of Code Has Taught Me About Data Science

    What Advent of Code Has Taught Me About Data Science Five key learnings that I discovered during a programming challenge and how they apply to data science The post What Advent of Code Has Taught Me About Data Science appeared first on Towards Data Science. Jasper Schroeder Go to original source

  • Chunk Size as an Experimental Variable in RAG Systems

    Chunk Size as an Experimental Variable in RAG Systems Understanding retrieval in RAG systems by experimenting with different chunk sizes The post Chunk Size as an Experimental Variable in RAG Systems appeared first on Towards Data Science. Sarah Schürch Go to original source

  • The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

    The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel Gradient Descent, Momentum, RMSProp, and Adam all aim for the same minimum. They do not change the destination, only the path. Each method adds a mechanism that fixes a limitation of the previous one, making the movement faster, more stable, or more adaptive.…

  • Overcoming Nonsmoothness and Control Chattering in Nonconvex Optimal Control Problems

    Overcoming Nonsmoothness and Control Chattering in Nonconvex Optimal Control Problems With some hints for good numerics The post Overcoming Nonsmoothness and Control Chattering in Nonconvex Optimal Control Problems appeared first on Towards Data Science. Willem Esterhuizen Go to original source

  • The Machine Learning “Advent Calendar” Bonus 1: AUC in Excel

    The Machine Learning “Advent Calendar” Bonus 1: AUC in Excel AUC measures how well a model ranks positives above negatives, independent of any chosen threshold. The post The Machine Learning “Advent Calendar” Bonus 1: AUC in Excel appeared first on Towards Data Science. angela shi Go to original source

  • Agents Under the Curve (AUC)

    Agents Under the Curve (AUC) Towards understanding if your agentic solution is actually better The post Agents Under the Curve (AUC) appeared first on Towards Data Science. Lambert Leong Go to original source

  • A review of NMF, PLSA, LBA, EMA, and LCA with a focus on the identifiability issue

    A review of NMF, PLSA, LBA, EMA, and LCA with a focus on the identifiability issue arXiv:2512.22282v1 Announce Type: new Abstract: Across fields such as machine learning, social science, geography, considerable attention has been given to models that factorize a nonnegative matrix into the product of two or three matrices, subject to nonnegative or row-sum-to-1…

  • A General Weighting Theory for Ensemble Learning: Beyond Variance Reduction via Spectral and Geometric Structure

    A General Weighting Theory for Ensemble Learning: Beyond Variance Reduction via Spectral and Geometric Structure arXiv:2512.22286v1 Announce Type: new Abstract: Ensemble learning is traditionally justified as a variance-reduction strategy, explaining its strong performance for unstable predictors such as decision trees. This explanation, however, does not account for ensembles constructed from intrinsically stable estimators-including smoothing splines,…

  • On Fibonacci Ensembles: An Alternative Approach to Ensemble Learning Inspired by the Timeless Architecture of the Golden Ratio

    On Fibonacci Ensembles: An Alternative Approach to Ensemble Learning Inspired by the Timeless Architecture of the Golden Ratio arXiv:2512.22284v1 Announce Type: new Abstract: Nature rarely reveals her secrets bluntly, yet in the Fibonacci sequence she grants us a glimpse of her quiet architecture of growth, harmony, and recursive stability citep{Koshy2001Fibonacci, Livio2002GoldenRatio}. From spiral galaxies to…

  • Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

    Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds arXiv:2512.22473v1 Announce Type: new Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed “Bayesian wind tunnels” and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training…

  • Likelihood-Preserving Embeddings for Statistical Inference

    Likelihood-Preserving Embeddings for Statistical Inference arXiv:2512.22638v1 Announce Type: new Abstract: Modern machine learning embeddings provide powerful compression of high-dimensional data, yet they typically destroy the geometric structure required for classical likelihood-based statistical inference. This paper develops a rigorous theory of likelihood-preserving embeddings: learned representations that can replace raw data in likelihood-based workflows — hypothesis testing,…

  • Machine Learning vs AI Engineer: What Are the Differences?

    Machine Learning vs AI Engineer: What Are the Differences? One of the most confusing questions in tech right now is: What is the difference between an AI engineer and a machine learning engineer? Both are six-figure jobs, but if you choose the wrong one, you could waste months of your career learning the wrong skills…

  • How to Facilitate Effective AI Programming

    How to Facilitate Effective AI Programming How to ensure your coding agent has the same context as you The post How to Facilitate Effective AI Programming appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Implementing Vibe Proving with Reinforcement Learning

    Implementing Vibe Proving with Reinforcement Learning How to make LLMs reason with verifiable, step-by-step logic (Part 2) The post Implementing Vibe Proving with Reinforcement Learning appeared first on Towards Data Science. Jacopo Tagliabue Go to original source

  • An approach to Fisher-Rao metric for infinite dimensional non-parametric information geometry

    An approach to Fisher-Rao metric for infinite dimensional non-parametric information geometry arXiv:2512.21451v1 Announce Type: new Abstract: Being infinite dimensional, non-parametric information geometry has long faced an “intractability barrier” due to the fact that the Fisher-Rao metric is now a functional incurring difficulties in defining its inverse. This paper introduces a novel framework to resolve the…

  • Residual Prior Diffusion: A Probabilistic Framework Integrating Coarse Latent Priors with Diffusion Models

    Residual Prior Diffusion: A Probabilistic Framework Integrating Coarse Latent Priors with Diffusion Models arXiv:2512.21593v1 Announce Type: new Abstract: Diffusion models have become a central tool in deep generative modeling, but standard formulations rely on a single network and a single diffusion schedule to transform a simple prior, typically a standard normal distribution, into the target…

  • Tilt Matching for Scalable Sampling and Fine-Tuning

    Tilt Matching for Scalable Sampling and Fine-Tuning arXiv:2512.21829v1 Announce Type: new Abstract: We propose a simple, scalable algorithm for using stochastic interpolants to sample from unnormalized densities and for fine-tuning generative models. The approach, Tilt Matching, arises from a dynamical equation relating the flow matching velocity to one targeting the same distribution tilted by a…

  • Automated Pollen Recognition in Optical and Holographic Microscopy Images

    Automated Pollen Recognition in Optical and Holographic Microscopy Images arXiv:2512.08589v1 Announce Type: cross Abstract: This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L…

  • Thermodynamic Characterizations of Singular Bayesian Models: Specific Heat, Susceptibility, and Entropy Flow in Posterior Geometry

    Thermodynamic Characterizations of Singular Bayesian Models: Specific Heat, Susceptibility, and Entropy Flow in Posterior Geometry arXiv:2512.21411v1 Announce Type: cross Abstract: Singular learning theory (SLT) citep{watanabe2009algebraic,watanabe2018mathematical} provides a rigorous asymptotic framework for Bayesian models with non-identifiable parameterizations, yet the statistical meaning of its second-order invariant, the emph{singular fluctuation}, has remained unclear. In this work, we show…

  • Weekly Entering & Transitioning – Thread 29 Dec, 2025 – 05 Jan, 2026

    Weekly Entering & Transitioning – Thread 29 Dec, 2025 – 05 Jan, 2026 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Modern Git-aware File Tree and global search/replace in Jupyter

    Modern Git-aware File Tree and global search/replace in Jupyter I used jupyter lab for years, but the file browser menu is lack of some important features like tree view/aware of git status; I tried some of the old 3rd extensions but none of them fit those modern demands which most of editors/IDE have(like vscode) so…

  • What skills did you learn on the job this past year?

    What skills did you learn on the job this past year? What skills did you actually learn on the job this past year? Not from self-study or online courses, but through live hands-on training or genuinely challenging assignments. My hunch is that learning opportunities have declined recently, with many companies leaning on “you own your…

  • Are some people really as busy as they really look?

    Are some people really as busy as they really look? There is someone I have to work together and we both work remotely. I’m a data scientist and he is a product manager. This person appears to be always busy. His Slack status is either on a huddle or on a meeting. He is probably…

  • PhD microbiologist pivoting to GCC data analytics. Is a master’s needed or portfolio and projects sufficient?

    PhD microbiologist pivoting to GCC data analytics. Is a master’s needed or portfolio and projects sufficient? I am finishing a wet-lab microbiology PhD. Over the last year I realised that I prefer data work. I use R, Excel and command line regularly and want to move toward analytics roles in industry rather than academic biology.…

  • Breaking the Hardware Barrier: Software FP8 for Older GPUs

    Breaking the Hardware Barrier: Software FP8 for Older GPUs Deep learning workloads are increasingly memory-bound, with GPU cores sitting idle while waiting for data transfers. FP8 precision solves this on newer hardware, but what about the millions of RTX 30 and 20 series GPUs already deployed? Feather demonstrates that software-based FP8 emulation through bitwise packing…

  • Hugging Face Transformers in Action: Learning How To Leverage AI for NLP

    Hugging Face Transformers in Action: Learning How To Leverage AI for NLP A practical guide to Hugging Face Transformers and to how you can analyze your resumé sentiment in seconds with AI The post Hugging Face Transformers in Action: Learning How To Leverage AI for NLP appeared first on Towards Data Science. Gustavo Santos Go…

  • Exploring TabPFN: A Foundation Model Built for Tabular Data

    Exploring TabPFN: A Foundation Model Built for Tabular Data Understanding the architecture, training pipeline and implementing TabPFN in practice The post Exploring TabPFN: A Foundation Model Built for Tabular Data appeared first on Towards Data Science. Parul Pandey Go to original source

  • How IntelliNode Automates Complex Workflows with Vibe Agents

    How IntelliNode Automates Complex Workflows with Vibe Agents Many AI systems focus on isolated tasks or simple prompt engineering. This approach allowed us to build interesting applications from a single prompt, but we are starting to hit a limit. Simple prompting falls short when we tackle complex AI tasks that require multiple stages or enterprise…

  • Think Your Python Code Is Slow? Stop Guessing and Start Measuring

    Think Your Python Code Is Slow? Stop Guessing and Start Measuring A hands-on tour of using cProfile + SnakeViz to find (and fix) the “hot” paths in your code. The post Think Your Python Code Is Slow? Stop Guessing and Start Measuring appeared first on Towards Data Science. Thomas Reid Go to original source

  • How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

    How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard A step-by-step guide from weather API ETL to dashboard on Databricks The post How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard appeared first on Towards Data Science. Gustavo Santos Go to…

  • Keeping Probabilities Honest: The Jacobian Adjustment

    Keeping Probabilities Honest: The Jacobian Adjustment An intuitive explanation of transforming random variables correctly. The post Keeping Probabilities Honest: The Jacobian Adjustment appeared first on Towards Data Science. Aniruddha Karajgi Go to original source

  • Why MAP and MRR Fail for Search Ranking (and What to Use Instead)

    Why MAP and MRR Fail for Search Ranking (and What to Use Instead) MAP and MRR look intuitive, but they quietly break ranking evaluation. Here’s why these metrics mislead—and how better alternatives fix it. The post Why MAP and MRR Fail for Search Ranking (and What to Use Instead) appeared first on Towards Data Science.…

  • Fast and Exact Least Absolute Deviations Line Fitting via Piecewise Affine Lower-Bounding

    Fast and Exact Least Absolute Deviations Line Fitting via Piecewise Affine Lower-Bounding arXiv:2512.20682v1 Announce Type: new Abstract: Least-absolute-deviations (LAD) line fitting is robust to outliers but computationally more involved than least squares regression. Although the literature includes linear and near-linear time algorithms for the LAD line fitting problem, these methods are difficult to implement and,…

  • Diffusion Models in Simulation-Based Inference: A Tutorial Review

    Diffusion Models in Simulation-Based Inference: A Tutorial Review arXiv:2512.20685v1 Announce Type: new Abstract: Diffusion models have recently emerged as powerful learners for simulation-based inference (SBI), enabling fast and accurate estimation of latent parameters from simulated and real data. Their score-based formulation offers a flexible way to learn conditional or joint distributions over parameters and observations,…

  • Weighted MCC: A Robust Measure of Multiclass Classifier Performance for Observations with Individual Weights

    Weighted MCC: A Robust Measure of Multiclass Classifier Performance for Observations with Individual Weights arXiv:2512.20811v1 Announce Type: new Abstract: Several performance measures are used to evaluate binary and multiclass classification tasks. But individual observations may often have distinct weights, and none of these measures are sensitive to such varying weights. We propose a new weighted…

  • Enhancing diffusion models with Gaussianization preprocessing

    Enhancing diffusion models with Gaussianization preprocessing arXiv:2512.21020v1 Announce Type: new Abstract: Diffusion models are a class of generative models that have demonstrated remarkable success in tasks such as image generation. However, one of the bottlenecks of these models is slow sampling due to the delay before the onset of trajectory bifurcation, at which point substantial…

  • Learning from Neighbors with PHIBP: Predicting Infectious Disease Dynamics in Data-Sparse Environments

    Learning from Neighbors with PHIBP: Predicting Infectious Disease Dynamics in Data-Sparse Environments arXiv:2512.21005v1 Announce Type: new Abstract: Modeling sparse count data, which arise across numerous scientific fields, presents significant statistical challenges. This chapter addresses these challenges in the context of infectious disease prediction, with a focus on predicting outbreaks in geographic regions that have historically…

  • The Machine Learning “Advent Calendar” Day 24: Transformers for Text in Excel

    The Machine Learning “Advent Calendar” Day 24: Transformers for Text in Excel An intuitive, step-by-step look at how Transformers use self-attention to turn static word embeddings into contextual representations, illustrated with simple examples and an Excel-friendly walkthrough. The post The Machine Learning “Advent Calendar” Day 24: Transformers for Text in Excel appeared first on Towards…

  • Is Your Model Time-Blind? The Case for Cyclical Feature Encoding

    Is Your Model Time-Blind? The Case for Cyclical Feature Encoding How cyclical encoding improves machine learning prediction The post Is Your Model Time-Blind? The Case for Cyclical Feature Encoding appeared first on Towards Data Science. Gustavo Santos Go to original source

  • 4 Techniques to Optimize AI Coding Efficiency

    4 Techniques to Optimize AI Coding Efficiency Learn how to code more effectively using AI The post 4 Techniques to Optimize AI Coding Efficiency appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Bonferroni vs. Benjamini-Hochberg: Choosing Your P-Value Correction

    Bonferroni vs. Benjamini-Hochberg: Choosing Your P-Value Correction Multiple hypothesis testing, P-values, and Monte Carlo The post Bonferroni vs. Benjamini-Hochberg: Choosing Your P-Value Correction appeared first on Towards Data Science. Marco Hening Tallarico Go to original source

  • Robust Causal Directionality Inference in Quantum Inference under MNAR Observation and High-Dimensional Noise

    Robust Causal Directionality Inference in Quantum Inference under MNAR Observation and High-Dimensional Noise arXiv:2512.19746v1 Announce Type: new Abstract: In quantum mechanics, observation actively shapes the system, paralleling the statistical notion of Missing Not At Random (MNAR). This study introduces a unified framework for textbf{robust causal directionality inference} in quantum engineering, determining whether relations are system$to$observation,…

  • Quasiprobabilistic Density Ratio Estimation with a Reverse Engineered Classification Loss Function

    Quasiprobabilistic Density Ratio Estimation with a Reverse Engineered Classification Loss Function arXiv:2512.19913v1 Announce Type: new Abstract: We consider a generalization of the classifier-based density-ratio estimation task to a quasiprobabilistic setting where probability densities can be negative. The problem with most loss functions used for this task is that they implicitly define a relationship between the…