Category: aimldsaimlds

Semiparametric KSD test: unifying score and distance-based approaches for goodness-of-fit testing

Semiparametric KSD test: unifying score and distance-based approaches for goodness-of-fit testing arXiv:2512.20007v1 Announce Type: new Abstract: Goodness-of-fit (GoF) tests are fundamental for assessing model adequacy. Score-based tests are appealing because they require fitting the model only once under the null. However, extending them to powerful nonparametric alternatives is difficult due to the lack of suitable…

December 24, 2025
Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models

Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models arXiv:2512.20021v1 Announce Type: new Abstract: Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be…

December 24, 2025
Generative Bayesian Hyperparameter Tuning

Generative Bayesian Hyperparameter Tuning arXiv:2512.20051v1 Announce Type: new Abstract: noindent Hyper-parameter selection is a central practical problem in modern machine learning, governing regularization strength, model capacity, and robustness choices. Cross-validation is often computationally prohibitive at scale, while fully Bayesian hyper-parameter learning can be difficult due to the cost of posterior sampling. We develop a generative…

December 24, 2025
The Machine Learning “Advent Calendar” Day 23: CNN in Excel

The Machine Learning “Advent Calendar” Day 23: CNN in Excel A step-by-step 1D CNN for text, built in Excel, where every filter, weight, and decision is fully visible. The post The Machine Learning “Advent Calendar” Day 23: CNN in Excel appeared first on Towards Data Science. angela shi Go to original source

December 24, 2025
How Agents Plan Tasks with To-Do Lists

How Agents Plan Tasks with To-Do Lists Understanding the process behind agentic planning and task management in LangChain The post How Agents Plan Tasks with To-Do Lists appeared first on Towards Data Science. Kenneth Leung Go to original source

December 24, 2025
Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline

Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline A data scientist’s guide to population stability index (PSI) The post Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline appeared first on Towards Data Science. Gustavo Santos Go to original source

December 24, 2025
Synergy in Clicks: Harsanyi Dividends for E-Commerce

Synergy in Clicks: Harsanyi Dividends for E-Commerce A brief overview of the math behind the Harsanyi Dividend and a real-world application in Streamlit The post Synergy in Clicks: Harsanyi Dividends for E-Commerce appeared first on Towards Data Science. Jacob Ingle Go to original source

December 24, 2025
Sampling from multimodal distributions with warm starts: Non-asymptotic bounds for the Reweighted Annealed Leap-Point Sampler

Sampling from multimodal distributions with warm starts: Non-asymptotic bounds for the Reweighted Annealed Leap-Point Sampler arXiv:2512.17977v1 Announce Type: new Abstract: Sampling from multimodal distributions is a central challenge in Bayesian inference and machine learning. In light of hardness results for sampling — classical MCMC methods, even with tempering, can suffer from exponential mixing times —…

December 23, 2025
Causal Inference as Distribution Adaptation: Optimizing ATE Risk under Propensity Uncertainty

Causal Inference as Distribution Adaptation: Optimizing ATE Risk under Propensity Uncertainty arXiv:2512.18083v1 Announce Type: new Abstract: Standard approaches to causal inference, such as Outcome Regression and Inverse Probability Weighted Regression Adjustment (IPWRA), are typically derived through the lens of missing data imputation and identification theory. In this work, we unify these methods from a Machine…

December 23, 2025
Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning

Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning arXiv:2512.18720v1 Announce Type: new Abstract: Effective feature selection is essential for high-dimensional data analysis and machine learning. Unsupervised feature selection (UFS) aims to simultaneously cluster data and identify the most discriminative features. Most existing UFS methods linearly project features into a pseudo-label space for clustering,…

December 23, 2025
On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction

On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction arXiv:2512.18971v1 Announce Type: new Abstract: Identifying low-dimensional sufficient structures in nonlinear sufficient dimension reduction (SDR) has long been a fundamental yet challenging problem. Most existing methods lack theoretical guarantees of exhaustiveness in identifying lower dimensional structures, either at the population level or at the sample…

December 23, 2025
Cluster-Based Generalized Additive Models Informed by Random Fourier Features

Cluster-Based Generalized Additive Models Informed by Random Fourier Features arXiv:2512.19373v1 Announce Type: new Abstract: Explainable machine learning aims to strike a balance between prediction accuracy and model transparency, particularly in settings where black-box predictive models, such as deep neural networks or kernel-based methods, achieve strong empirical performance but remain difficult to interpret. This work introduces…

December 23, 2025
The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel Understanding text embeddings through simple models and Excel The post The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel appeared first on Towards Data Science. angela shi Go to original source

December 23, 2025
The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel Gradient descent in function space with decision trees The post The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel appeared first on Towards Data Science. angela shi Go to original source

December 23, 2025
The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel From Random Ensembles to Optimization: Gradient Boosting Explained The post The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel appeared first on Towards Data Science. angela shi Go to original source

December 23, 2025
ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI For the last couple of years, a lot of the conversation around AI has revolved around a single, deceptively simple question: Which model is the best? But the next question was always, the best for what? The best for reasoning? Writing? Coding? Or…

December 23, 2025
The Geometry of Laziness: What Angles Reveal About AI Hallucinations

The Geometry of Laziness: What Angles Reveal About AI Hallucinations A story about failing forward, spheres you can’t visualize, and why sometimes the math knows things before we do The post The Geometry of Laziness: What Angles Reveal About AI Hallucinations appeared first on Towards Data Science. Javier Marin Go to original source

December 23, 2025
Disentangled representations via score-based variational autoencoders

Disentangled representations via score-based variational autoencoders arXiv:2512.17127v1 Announce Type: new Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of…

December 22, 2025
Sharp Structure-Agnostic Lower Bounds for General Functional Estimation

Sharp Structure-Agnostic Lower Bounds for General Functional Estimation arXiv:2512.17341v1 Announce Type: new Abstract: The design of efficient nonparametric estimators has long been a central problem in statistics, machine learning, and decision making. Classical optimal procedures often rely on strong structural assumptions, which can be misspecified in practice and complicate deployment. This limitation has sparked growing…

December 22, 2025
Generative modeling of conditional probability distributions on the level-sets of collective variables

Generative modeling of conditional probability distributions on the level-sets of collective variables arXiv:2512.17374v1 Announce Type: new Abstract: Given a probability distribution $mu$ in $mathbb{R}^d$ represented by data, we study in this paper the generative modeling of its conditional probability distributions on the level-sets of a collective variable $xi: mathbb{R}^d rightarrow mathbb{R}^k$, where $1 le k…

December 22, 2025
Fast and Robust: Computationally Efficient Covariance Estimation for Sub-Weibull Vectors

Fast and Robust: Computationally Efficient Covariance Estimation for Sub-Weibull Vectors arXiv:2512.17632v1 Announce Type: new Abstract: High-dimensional covariance estimation is notoriously sensitive to outliers. While statistically optimal estimators exist for general heavy-tailed distributions, they often rely on computationally expensive techniques like semidefinite programming or iterative M-estimation ($O(d^3)$). In this work, we target the specific regime of…

December 22, 2025
Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing

Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing arXiv:2512.17426v1 Announce Type: new Abstract: We consider sparse signal reconstruction via minimization of the smoothly clipped absolute deviation (SCAD) penalty, and develop one-step replica-symmetry-breaking (1RSB) extensions of approximate message passing (AMP), termed 1RSB-AMP. Starting from the 1RSB formulation of belief propagation, we…

December 22, 2025
Weekly Entering & Transitioning – Thread 22 Dec, 2025 – 29 Dec, 2025

Weekly Entering & Transitioning – Thread 22 Dec, 2025 – 29 Dec, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

December 22, 2025
workforce moving to oversee

workforce moving to oversee My company is investing more and more in its overseas workforce, mostly in India. For every one job posted in the U.S., there are about ten in India. Is my company an exception, or is this happening everywhere? submitted by /u/Alarmed-Reporter-230 [link] [comments] /u/Alarmed-Reporter-230 Go to original source

December 22, 2025
A memory effecient TF-IDF project in Python to vectorize datasets large than RAM

A memory effecient TF-IDF project in Python to vectorize datasets large than RAM Re-designed at C++ level, this library can easily process datasets around 100GB and beyond on as small as a 4GB memory It does have its constraints but the outputs are comparable to sklearn’s output fasttfidf submitted by /u/mrnerdy59 [link] [comments] /u/mrnerdy59 Go…

December 22, 2025
New Data Science Team Lead struggling with aggressive PM on timelines and model expectations

New Data Science Team Lead struggling with aggressive PM on timelines and model expectations I’m a data scientist who was recently promoted to be a data science team lead. Overall I enjoy the role, but I’m running into a recurring challenge with a very aggressive product manager (also a leader) that I’m not sure how…

December 22, 2025
How complex are your experiment setups?

How complex are your experiment setups? Are you all also just running t tests or are yours more complex? How often do you run complex setups? I think my org wrongly only runs t tests and are not understanding of the downfalls of defaulting to those submitted by /u/ds_contractor [link] [comments] /u/ds_contractor Go to original…

December 22, 2025
How to Do Evals on a Bloated RAG Pipeline

How to Do Evals on a Bloated RAG Pipeline Comparing metrics across datasets and models The post How to Do Evals on a Bloated RAG Pipeline appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

December 22, 2025
Tools for Your LLM: a Deep Dive into MCP

Tools for Your LLM: a Deep Dive into MCP MCP is a key enabler into turning your LLM into an agent by providing it with tools to retrieve real-time information or perform actions. In this deep dive we cover how MCP works, when to use it, and what to watch out for. The post Tools…

December 22, 2025
Understanding the Generative AI User

Understanding the Generative AI User What do regular technology users think (and know) about AI? The post Understanding the Generative AI User appeared first on Towards Data Science. Stephanie Kirmer Go to original source

December 21, 2025
EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas

EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas Learn how to analyze product performance, extract time-series features, and uncover key seasonal trends in your sales data. The post EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas appeared first on Towards Data Science. Ibrahim Salami Go to original source

December 21, 2025
The Machine Learning “Advent Calendar” Day 19: Bagging in Excel

The Machine Learning “Advent Calendar” Day 19: Bagging in Excel Understanding ensemble learning from first principles in Excel The post The Machine Learning “Advent Calendar” Day 19: Bagging in Excel appeared first on Towards Data Science. angela shi Go to original source

December 20, 2025
Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC)

Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC) Using Agentic AI prompts with the Artificial Bee Colony algorithm to enhance unsupervised clustering and optimization workflows. The post Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC) appeared first on Towards Data Science. Gal Arav Go to original source

December 20, 2025
How I Optimized My Leaf Raking Strategy Using Linear Programming

How I Optimized My Leaf Raking Strategy Using Linear Programming From a weekend chore to a fun application of valuable operations research principles The post How I Optimized My Leaf Raking Strategy Using Linear Programming appeared first on Towards Data Science. Josiah DeValois Go to original source

December 20, 2025
Six Lessons Learned Building RAG Systems in Production

Six Lessons Learned Building RAG Systems in Production Best practices for data quality, retrieval design, and evaluation in production RAG systems The post Six Lessons Learned Building RAG Systems in Production appeared first on Towards Data Science. Sabrine Bendimerad Go to original source

December 20, 2025
2025 Must-Reads: Agents, Python, LLMs, and More

2025 Must-Reads: Agents, Python, LLMs, and More Don’t miss our most popular articles of the past year! The post 2025 Must-Reads: Agents, Python, LLMs, and More appeared first on Towards Data Science. TDS Editors Go to original source

December 20, 2025
BayesSum: Bayesian Quadrature in Discrete Spaces

BayesSum: Bayesian Quadrature in Discrete Spaces arXiv:2512.16105v1 Announce Type: new Abstract: This paper addresses the challenging computational problem of estimating intractable expectations over discrete domains. Existing approaches, including Monte Carlo and Russian Roulette estimators, are consistent but often require a large number of samples to achieve accurate results. We propose a novel estimator, emph{BayesSum}, which…

December 19, 2025
DAG Learning from Zero-Inflated Count Data Using Continuous Optimization

DAG Learning from Zero-Inflated Count Data Using Continuous Optimization arXiv:2512.16233v1 Announce Type: new Abstract: We address network structure learning from zero-inflated count data by casting each node as a zero-inflated generalized linear model and optimizing a smooth, score-based objective under a directed acyclic graph constraint. Our Zero-Inflated Continuous Optimization (ZICO) approach uses node-wise likelihoods with…

December 19, 2025
Advantages and limitations in the use of transfer learning for individual treatment effects in causal machine learning

Advantages and limitations in the use of transfer learning for individual treatment effects in causal machine learning arXiv:2512.16489v1 Announce Type: new Abstract: Generalizing causal knowledge across diverse environments is challenging, especially when estimates from large-scale datasets must be applied to smaller or systematically different contexts, where external validity is critical. Model-based estimators of individual treatment…

December 19, 2025
Riemannian Stochastic Interpolants for Amorphous Particle Systems

Riemannian Stochastic Interpolants for Amorphous Particle Systems arXiv:2512.16607v1 Announce Type: new Abstract: Modern generative models hold great promise for accelerating diverse tasks involving the simulation of physical systems, but they must be adapted to the specific constraints of each domain. Significant progress has been made for biomolecules and crystalline materials. Here, we address amorphous materials…

December 19, 2025
On The Hidden Biases of Flow Matching Samplers

On The Hidden Biases of Flow Matching Samplers arXiv:2512.16768v1 Announce Type: new Abstract: We study the implicit bias of flow matching (FM) samplers via the lens of empirical flow matching. Although population FM may produce gradient-field velocities resembling optimal transport (OT), we show that the empirical FM minimizer is almost never a gradient field, even…

December 19, 2025
The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel

The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel Understanding forward propagation and backpropagation through explicit formulas The post The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel appeared first on Towards Data Science. angela shi Go to original source

December 19, 2025
4 Ways to Supercharge Your Data Science Workflow with Google AI Studio

4 Ways to Supercharge Your Data Science Workflow with Google AI Studio With concrete examples of using AI Studio Build mode to learn faster, prototype smarter, communicate clearer, and automate quicker. The post 4 Ways to Supercharge Your Data Science Workflow with Google AI Studio appeared first on Towards Data Science. Shuai Guo Go to…

December 19, 2025
The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs

The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs An optimal solution to the well-known NP-complete problem, when the input values are close enough to each other. The post The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs appeared first on Towards Data Science. Tigran Hayrapetyan Go to original…

December 19, 2025
Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting

Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting Using Python to generate art The post Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting appeared first on Towards Data Science. Mahnoor Javed Go to original source

December 19, 2025
Online Partitioned Local Depth for semi-supervised applications

Online Partitioned Local Depth for semi-supervised applications arXiv:2512.15436v1 Announce Type: new Abstract: We introduce an extension of the partitioned local depth (PaLD) algorithm that is adapted to online applications such as semi-supervised prediction. The new algorithm we present, online PaLD, is well-suited to situations where it is a possible to pre-compute a cohesion network from…

December 18, 2025
A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point arXiv:2512.15606v1 Announce Type: new Abstract: Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for…

December 18, 2025
High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations

High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations arXiv:2512.15684v1 Announce Type: new Abstract: Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this…

December 18, 2025
Model inference for ranking from pairwise comparisons

Model inference for ranking from pairwise comparisons arXiv:2512.15269v1 Announce Type: cross Abstract: We consider the problem of ranking objects from noisy pairwise comparisons, for example, ranking tennis players from the outcomes of matches. We follow a standard approach to this problem and assume that each object has an unobserved strength and that the outcome of…

December 18, 2025
A Bayesian latent class reinforcement learning framework to capture adaptive, feedback-driven travel behaviour

A Bayesian latent class reinforcement learning framework to capture adaptive, feedback-driven travel behaviour arXiv:2512.14713v1 Announce Type: cross Abstract: Many travel decisions involve a degree of experience formation, where individuals learn their preferences over time. At the same time, there is extensive scope for heterogeneity across individual travellers, both in their underlying preferences and in how…

December 18, 2025
A Practical Toolkit for Time Series Anomaly Detection, Using Python

A Practical Toolkit for Time Series Anomaly Detection, Using Python Here’s how to detect point anomalies within each series, and identify anomalous signals across the whole bank The post A Practical Toolkit for Time Series Anomaly Detection, Using Python appeared first on Towards Data Science. Piero Paialunga Go to original source

December 18, 2025
The Machine Learning “Advent Calendar” Day 17: Neural Network Regressor in Excel

The Machine Learning “Advent Calendar” Day 17: Neural Network Regressor in Excel Neural networks often feel like black boxes. In this article, we build a neural network regressor from scratch using only Excel formulas. By making every step explicit, from forward propagation to backpropagation, we show how a neural network learns to approximate non-linear functions…

December 18, 2025
Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach

Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach LLM-as-a-Judge, regression testing, and end-to-end traceability of multi-agent LLM systems The post Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach appeared first on Towards Data Science. Partha Sarkar Go to original source

December 18, 2025
3 Techniques to Effectively Utilize AI Agents for Coding

3 Techniques to Effectively Utilize AI Agents for Coding Learn how to be an effective engineer with coding agents The post 3 Techniques to Effectively Utilize AI Agents for Coding appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

December 18, 2025
Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics

Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics arXiv:2512.13997v1 Announce Type: new Abstract: Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power.…

December 17, 2025
One Permutation Is All You Need: Fast, Reliable Variable Importance and Model Stress-Testing

One Permutation Is All You Need: Fast, Reliable Variable Importance and Model Stress-Testing arXiv:2512.13892v1 Announce Type: new Abstract: Reliable estimation of feature contributions in machine learning models is essential for trust, transparency and regulatory compliance, especially when models are proprietary or otherwise operate as black boxes. While permutation-based methods are a standard tool for this…

December 17, 2025
On the Hardness of Conditional Independence Testing In Practice

On the Hardness of Conditional Independence Testing In Practice arXiv:2512.14000v1 Announce Type: new Abstract: Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample…

December 17, 2025
Weighted Conformal Prediction Provides Adaptive and Valid Mask-Conditional Coverage for General Missing Data Mechanisms

Weighted Conformal Prediction Provides Adaptive and Valid Mask-Conditional Coverage for General Missing Data Mechanisms arXiv:2512.14221v1 Announce Type: new Abstract: Conformal prediction (CP) offers a principled framework for uncertainty quantification, but it fails to guarantee coverage when faced with missing covariates. In addressing the heterogeneity induced by various missing patterns, Mask-Conditional Valid (MCV) Coverage has emerged…

December 17, 2025
Improving the Accuracy of Amortized Model Comparison with Self-Consistency

Improving the Accuracy of Amortized Model Comparison with Self-Consistency arXiv:2512.14308v1 Announce Type: new Abstract: Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope…

December 17, 2025
When (Not) to Use Vector DB

When (Not) to Use Vector DB When indexing hurts more than it helps: how we realized our RAG use case needed a key-value store, not a vector database The post When (Not) to Use Vector DB appeared first on Towards Data Science. Uri Peled Go to original source

December 17, 2025
Separate Numbers and Text in One Column Using Power Query

Separate Numbers and Text in One Column Using Power Query An Excel sheet with a column containing numbers and text? What a mess! The post Separate Numbers and Text in One Column Using Power Query appeared first on Towards Data Science. Salvatore Cagliari Go to original source

December 17, 2025
The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel

The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel Kernel SVM often feels abstract, with kernels, dual formulations, and support vectors. In this article, we take a different path. Starting from Kernel Density Estimation, we build Kernel SVM step by step as a sum of local bells, weighted and selected by hinge loss,…

December 17, 2025
Lessons Learned After 8 Years of Machine Learning

Lessons Learned After 8 Years of Machine Learning Deep work, over-identification, sports, and blogging The post Lessons Learned After 8 Years of Machine Learning appeared first on Towards Data Science. Pascal Janetzky Go to original source

December 17, 2025
Interval Fisher’s Discriminant Analysis and Visualisation

Interval Fisher’s Discriminant Analysis and Visualisation arXiv:2512.11945v1 Announce Type: new Abstract: In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher’s Discriminant Analysis to interval-valued data, using Moore’s…

December 16, 2025
Hellinger loss function for Generative Adversarial Networks

Hellinger loss function for Generative Adversarial Networks arXiv:2512.12267v1 Announce Type: new Abstract: We propose Hellinger-type loss functions for training Generative Adversarial Networks (GANs), motivated by the boundedness, symmetry, and robustness properties of the Hellinger distance. We define an adversarial objective based on this divergence and study its statistical properties within a general parametric framework. We…

December 16, 2025
Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees

Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees arXiv:2512.12435v1 Announce Type: new Abstract: Identifying the graphical structure underlying the observed multivariate data is essential in numerous applications. Current methodologies are predominantly confined to deducing a singular graph under the presumption that the observed data are uniform. However, many contexts involve heterogeneous datasets that feature…

December 16, 2025
Towards a pretrained deep learning estimator of the Linfoot informational correlation

Towards a pretrained deep learning estimator of the Linfoot informational correlation arXiv:2512.12358v1 Announce Type: new Abstract: We develop a supervised deep-learning approach to estimate mutual information between two continuous random variables. As labels, we use the Linfoot informational correlation, a transformation of mutual information that has many important properties. Our method is based on ground…

December 16, 2025
Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data

Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data arXiv:2512.12442v1 Announce Type: new Abstract: Almost all scientific data have uncertainties originating from different sources. Gaussian process regression (GPR) models are a natural way to model data with Gaussian-distributed uncertainties. GPR also has the benefit of reducing I/O bandwidth and storage requirements for large scientific simulations.…

December 16, 2025
The Machine Learning “Advent Calendar” Day 15: SVM in Excel

The Machine Learning “Advent Calendar” Day 15: SVM in Excel Instead of starting with margins and geometry, this article builds the Support Vector Machine step by step from familiar models. By changing the loss function and reusing regularization, SVM appears naturally as a linear classifier trained by optimization. This perspective unifies logistic regression, SVM, and…

December 16, 2025
6 Technical Skills That Make You a Senior Data Scientist

6 Technical Skills That Make You a Senior Data Scientist Beyond writing code, these are the design-level decisions, trade-offs, and habits that quietly separate senior data scientists from everyone else. The post 6 Technical Skills That Make You a Senior Data Scientist appeared first on Towards Data Science. Piero Paialunga Go to original source

December 16, 2025
Geospatial exploratory data analysis with GeoPandas and DuckDB

Geospatial exploratory data analysis with GeoPandas and DuckDB In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through…

December 16, 2025
Lessons Learned from Upgrading to LangChain 1.0 in Production

Lessons Learned from Upgrading to LangChain 1.0 in Production What worked, what broke, and why I did it The post Lessons Learned from Upgrading to LangChain 1.0 in Production appeared first on Towards Data Science. Clara Chong Go to original source

December 16, 2025
STARK denoises spatial transcriptomics images via adaptive regularization

STARK denoises spatial transcriptomics images via adaptive regularization arXiv:2512.10994v1 Announce Type: new Abstract: We present an approach to denoising spatial transcriptomics images that is particularly effective for uncovering cell identities in the regime of ultra-low sequencing depths, and also allows for interpolation of gene expression. The method — Spatial Transcriptomics via Adaptive Regularization and Kernels…

December 15, 2025
An Efficient Variant of One-Class SVM with Lifelong Online Learning Guarantees

An Efficient Variant of One-Class SVM with Lifelong Online Learning Guarantees arXiv:2512.11052v1 Announce Type: new Abstract: We study outlier (a.k.a., anomaly) detection for single-pass non-stationary streaming data. In the well-studied offline or batch outlier detection problem, traditional methods such as kernel One-Class SVM (OCSVM) are both computationally heavy and prone to large false-negative (Type II)…

December 15, 2025
Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

Provable Recovery of Locally Important Signed Features and Interactions from Random Forest arXiv:2512.11081v1 Announce Type: new Abstract: Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are…

December 15, 2025
TPV: Parameter Perturbations Through the Lens of Test Prediction Variance

TPV: Parameter Perturbations Through the Lens of Test Prediction Variance arXiv:2512.11089v1 Announce Type: new Abstract: We identify test prediction variance (TPV) — the first-order sensitivity of model outputs to parameter perturbations around a trained solution — as a unifying quantity that links several classical observations about generalization in deep networks. TPV is a fully label-free…

December 15, 2025
Data-Driven Model Reduction using WeldNet: Windowed Encoders for Learning Dynamics

Data-Driven Model Reduction using WeldNet: Windowed Encoders for Learning Dynamics arXiv:2512.11090v1 Announce Type: new Abstract: Many problems in science and engineering involve time-dependent, high dimensional datasets arising from complex physical processes, which are costly to simulate. In this work, we propose WeldNet: Windowed Encoders for Learning Dynamics, a data-driven nonlinear model reduction framework to build…

December 15, 2025
Weekly Entering & Transitioning – Thread 15 Dec, 2025 – 22 Dec, 2025

Weekly Entering & Transitioning – Thread 15 Dec, 2025 – 22 Dec, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

December 15, 2025
I got three offers from a two month job search – here’s what I wish I knew earlier

I got three offers from a two month job search – here’s what I wish I knew earlier There’s a lot of doom and gloom on reddit and elsewhere about the current state of the job market. And yes, it’s bad. But reading all these stories of people going months and years without getting a…

December 15, 2025
Has anyone tried training models on raw discussions instead of curated datasets?

Has anyone tried training models on raw discussions instead of curated datasets? I’ve always followed the usual advice when training models, like clean the data, normalize everything, remove noise, structure it nicely Recently I tried something different. Instead of polished datasets, I fed models long, messy discussion threads, real conversations, people arguing, correcting themselves, misunderstanding…

December 15, 2025
While 72% of Executives Back AI, Public Trust Is Tanking

While 72% of Executives Back AI, Public Trust Is Tanking submitted by /u/disforwork [link] [comments] /u/disforwork Go to original source

December 15, 2025
Gemini Deep Research: Autonomous Intelligence for Enterprise Research

Gemini Deep Research: Autonomous Intelligence for Enterprise Research submitted by /u/WarChampion90 [link] [comments] /u/WarChampion90 Go to original source

December 15, 2025
The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel

The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel Softmax Regression is simply Logistic Regression extended to multiple classes. By computing one linear score per class and normalizing them with Softmax, we obtain multiclass probabilities without changing the core logic. The loss, the gradients, and the optimization remain the same. Only the number…

December 15, 2025
The Skills That Bridge Technical Work and Business Impact

The Skills That Bridge Technical Work and Business Impact In the Author Spotlight series, TDS Editors chat with members of our community about their career path in data science and AI, their writing, and their sources of inspiration. Today, we’re thrilled to share our conversation with Maria Mouschoutzi. Maria is a Data Analyst and Project…

December 15, 2025
Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case

Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case Introduction If you work in data science, data engineering, or as as a frontend/backend developer, you deal with JSON. For professionals, its basically only death, taxes, and JSON-parsing that is inevitable. The issue is that parsing JSON is often a serious pain. Whether you are…

December 15, 2025
The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel Ridge and Lasso regression are often perceived as more complex versions of linear regression. In reality, the prediction model remains exactly the same. What changes is the training objective. By adding a penalty on the coefficients, regularization forces the model to choose…

December 14, 2025
How to Increase Coding Iteration Speed

How to Increase Coding Iteration Speed Learn how to become a more efficient programmer with local testing The post How to Increase Coding Iteration Speed appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

December 14, 2025
NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating

NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating This one little trick can bring about enhanced training stability, the use of larger learning rates and improved scaling properties The post NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating appeared first on Towards Data Science. Sean Moran Go to original…

December 14, 2025
The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel

The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel In this article, we rebuild Logistic Regression step by step directly in Excel. Starting from a binary dataset, we explore why linear regression struggles as a classifier, how the logistic function fixes these issues, and how log-loss naturally appears from the likelihood. With a…

December 13, 2025
Decentralized Computation: The Hidden Principle Behind Deep Learning

Decentralized Computation: The Hidden Principle Behind Deep Learning Most breakthroughs in deep learning — from simple neural networks to large language models — are built upon a principle that is much older than AI itself: decentralization. Instead of relying on a powerful “central planner” coordinating and commanding the behaviors of other components, modern deep-learning-based AI…

December 13, 2025
EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas Hey everyone! Welcome to the start of a major data journey that I’m calling “EDA in Public.” For those who know me, I believe the best way to learn anything is to tackle a real-world problem and share the entire messy process — including mistakes, victories,…

December 13, 2025
Spectral Community Detection in Clinical Knowledge Graphs

Spectral Community Detection in Clinical Knowledge Graphs Introduction How do we identify latent groups of patients in a large cohort? How can we find similarities among patients that go beyond the well-known comorbidity clusters associated with specific diseases? And more importantly, how can we extract quantitative signals that can be analyzed, compared, and reused across…

December 13, 2025
LxCIM: a new rank-based binary classifier performance metric invariant to local exchange of classes

LxCIM: a new rank-based binary classifier performance metric invariant to local exchange of classes arXiv:2512.10053v1 Announce Type: new Abstract: Binary classification is one of the oldest, most prevalent, and studied problems in machine learning. However, the metrics used to evaluate model performance have received comparatively little attention. The area under the receiver operating characteristic curve…

December 12, 2025
The Interplay of Statistics and Noisy Optimization: Learning Linear Predictors with Random Data Weights

The Interplay of Statistics and Noisy Optimization: Learning Linear Predictors with Random Data Weights arXiv:2512.10188v1 Announce Type: new Abstract: We analyze gradient descent with randomly weighted data points in a linear regression model, under a generic weighting distribution. This includes various forms of stochastic gradient descent, importance sampling, but also extends to weighting distributions with…

December 12, 2025
Diffusion differentiable resampling

Diffusion differentiable resampling arXiv:2512.10401v1 Announce Type: new Abstract: This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). We propose a new informative resampling method that is instantly pathwise differentiable, based on an ensemble score diffusion model. We prove that our diffusion resampling method provides a consistent estimate…

December 12, 2025
Error Analysis of Generalized Langevin Equations with Approximated Memory Kernels

Error Analysis of Generalized Langevin Equations with Approximated Memory Kernels arXiv:2512.10256v1 Announce Type: new Abstract: We analyze prediction error in stochastic dynamical systems with memory, focusing on generalized Langevin equations (GLEs) formulated as stochastic Volterra equations. We establish that, under a strongly convex potential, trajectory discrepancies decay at a rate determined by the decay of…

December 12, 2025
Supervised Learning of Random Neural Architectures Structured by Latent Random Fields on Compact Boundaryless Multiply-Connected Manifolds

Supervised Learning of Random Neural Architectures Structured by Latent Random Fields on Compact Boundaryless Multiply-Connected Manifolds arXiv:2512.10407v1 Announce Type: new Abstract: This paper introduces a new probabilistic framework for supervised learning in neural systems. It is designed to model complex, uncertain systems whose random outputs are strongly non-Gaussian given deterministic inputs. The architecture itself is…

December 12, 2025
The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel

The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel Linear Regression looks simple, but it introduces the core ideas of modern machine learning: loss functions, optimization, gradients, scaling, and interpretation. In this article, we rebuild Linear Regression in Excel, compare the closed-form solution with Gradient Descent, and see how the coefficients evolve step…

December 12, 2025
Drawing Shapes with the Python Turtle Module

Drawing Shapes with the Python Turtle Module A step-by-step tutorial that explores the Python Turtle Module The post Drawing Shapes with the Python Turtle Module appeared first on Towards Data Science. Mahnoor Javed Go to original source

December 12, 2025
7 Pandas Performance Tricks Every Data Scientist Should Know

7 Pandas Performance Tricks Every Data Scientist Should Know What I’ve learned about making Pandas faster after too many slow notebooks and frozen sessions The post 7 Pandas Performance Tricks Every Data Scientist Should Know appeared first on Towards Data Science. Benjamin Nweke Go to original source

December 12, 2025