Category: aimldsaimlds

  • Semiparametric KSD test: unifying score and distance-based approaches for goodness-of-fit testing

    Semiparametric KSD test: unifying score and distance-based approaches for goodness-of-fit testing arXiv:2512.20007v1 Announce Type: new Abstract: Goodness-of-fit (GoF) tests are fundamental for assessing model adequacy. Score-based tests are appealing because they require fitting the model only once under the null. However, extending them to powerful nonparametric alternatives is difficult due to the lack of suitable…

  • Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models

    Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models arXiv:2512.20021v1 Announce Type: new Abstract: Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be…

  • Generative Bayesian Hyperparameter Tuning

    Generative Bayesian Hyperparameter Tuning arXiv:2512.20051v1 Announce Type: new Abstract: noindent Hyper-parameter selection is a central practical problem in modern machine learning, governing regularization strength, model capacity, and robustness choices. Cross-validation is often computationally prohibitive at scale, while fully Bayesian hyper-parameter learning can be difficult due to the cost of posterior sampling. We develop a generative…

  • The Machine Learning “Advent Calendar” Day 23: CNN in Excel

    The Machine Learning “Advent Calendar” Day 23: CNN in Excel A step-by-step 1D CNN for text, built in Excel, where every filter, weight, and decision is fully visible. The post The Machine Learning “Advent Calendar” Day 23: CNN in Excel appeared first on Towards Data Science. angela shi Go to original source

  • How Agents Plan Tasks with To-Do Lists

    How Agents Plan Tasks with To-Do Lists Understanding the process behind agentic planning and task management in LangChain The post How Agents Plan Tasks with To-Do Lists appeared first on Towards Data Science. Kenneth Leung Go to original source

  • Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline

    Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline A data scientist’s guide to population stability index (PSI) The post Stop Retraining Blindly: Use PSI to Build a Smarter Monitoring Pipeline appeared first on Towards Data Science. Gustavo Santos Go to original source

  • Synergy in Clicks: Harsanyi Dividends for E-Commerce

    Synergy in Clicks: Harsanyi Dividends for E-Commerce A brief overview of the math behind the Harsanyi Dividend and a real-world application in Streamlit The post Synergy in Clicks: Harsanyi Dividends for E-Commerce appeared first on Towards Data Science. Jacob Ingle Go to original source

  • Sampling from multimodal distributions with warm starts: Non-asymptotic bounds for the Reweighted Annealed Leap-Point Sampler

    Sampling from multimodal distributions with warm starts: Non-asymptotic bounds for the Reweighted Annealed Leap-Point Sampler arXiv:2512.17977v1 Announce Type: new Abstract: Sampling from multimodal distributions is a central challenge in Bayesian inference and machine learning. In light of hardness results for sampling — classical MCMC methods, even with tempering, can suffer from exponential mixing times —…

  • Causal Inference as Distribution Adaptation: Optimizing ATE Risk under Propensity Uncertainty

    Causal Inference as Distribution Adaptation: Optimizing ATE Risk under Propensity Uncertainty arXiv:2512.18083v1 Announce Type: new Abstract: Standard approaches to causal inference, such as Outcome Regression and Inverse Probability Weighted Regression Adjustment (IPWRA), are typically derived through the lens of missing data imputation and identification theory. In this work, we unify these methods from a Machine…

  • Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning

    Unsupervised Feature Selection via Robust Autoencoder and Adaptive Graph Learning arXiv:2512.18720v1 Announce Type: new Abstract: Effective feature selection is essential for high-dimensional data analysis and machine learning. Unsupervised feature selection (UFS) aims to simultaneously cluster data and identify the most discriminative features. Most existing UFS methods linearly project features into a pseudo-label space for clustering,…

  • On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction

    On Conditional Stochastic Interpolation for Generative Nonlinear Sufficient Dimension Reduction arXiv:2512.18971v1 Announce Type: new Abstract: Identifying low-dimensional sufficient structures in nonlinear sufficient dimension reduction (SDR) has long been a fundamental yet challenging problem. Most existing methods lack theoretical guarantees of exhaustiveness in identifying lower dimensional structures, either at the population level or at the sample…

  • Cluster-Based Generalized Additive Models Informed by Random Fourier Features

    Cluster-Based Generalized Additive Models Informed by Random Fourier Features arXiv:2512.19373v1 Announce Type: new Abstract: Explainable machine learning aims to strike a balance between prediction accuracy and model transparency, particularly in settings where black-box predictive models, such as deep neural networks or kernel-based methods, achieve strong empirical performance but remain difficult to interpret. This work introduces…

  • The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel

    The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel Understanding text embeddings through simple models and Excel The post The Machine Learning “Advent Calendar” Day 22: Embeddings in Excel appeared first on Towards Data Science. angela shi Go to original source

  • The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

    The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel Gradient descent in function space with decision trees The post The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel appeared first on Towards Data Science. angela shi Go to original source

  • The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

    The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel From Random Ensembles to Optimization: Gradient Boosting Explained The post The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel appeared first on Towards Data Science. angela shi Go to original source

  • ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

    ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI For the last couple of years, a lot of the conversation around AI has revolved around a single, deceptively simple question: Which model is the best? But the next question was always, the best for what?  The best for reasoning? Writing? Coding? Or…

  • The Geometry of Laziness: What Angles Reveal About AI Hallucinations

    The Geometry of Laziness: What Angles Reveal About AI Hallucinations A story about failing forward, spheres you can’t visualize, and why sometimes the math knows things before we do The post The Geometry of Laziness: What Angles Reveal About AI Hallucinations appeared first on Towards Data Science. Javier Marin Go to original source

  • Disentangled representations via score-based variational autoencoders

    Disentangled representations via score-based variational autoencoders arXiv:2512.17127v1 Announce Type: new Abstract: We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of…

  • Sharp Structure-Agnostic Lower Bounds for General Functional Estimation

    Sharp Structure-Agnostic Lower Bounds for General Functional Estimation arXiv:2512.17341v1 Announce Type: new Abstract: The design of efficient nonparametric estimators has long been a central problem in statistics, machine learning, and decision making. Classical optimal procedures often rely on strong structural assumptions, which can be misspecified in practice and complicate deployment. This limitation has sparked growing…

  • Generative modeling of conditional probability distributions on the level-sets of collective variables

    Generative modeling of conditional probability distributions on the level-sets of collective variables arXiv:2512.17374v1 Announce Type: new Abstract: Given a probability distribution $mu$ in $mathbb{R}^d$ represented by data, we study in this paper the generative modeling of its conditional probability distributions on the level-sets of a collective variable $xi: mathbb{R}^d rightarrow mathbb{R}^k$, where $1 le k…

  • Fast and Robust: Computationally Efficient Covariance Estimation for Sub-Weibull Vectors

    Fast and Robust: Computationally Efficient Covariance Estimation for Sub-Weibull Vectors arXiv:2512.17632v1 Announce Type: new Abstract: High-dimensional covariance estimation is notoriously sensitive to outliers. While statistically optimal estimators exist for general heavy-tailed distributions, they often rely on computationally expensive techniques like semidefinite programming or iterative M-estimation ($O(d^3)$). In this work, we target the specific regime of…

  • Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing

    Perfect reconstruction of sparse signals using nonconvexity control and one-step RSB message passing arXiv:2512.17426v1 Announce Type: new Abstract: We consider sparse signal reconstruction via minimization of the smoothly clipped absolute deviation (SCAD) penalty, and develop one-step replica-symmetry-breaking (1RSB) extensions of approximate message passing (AMP), termed 1RSB-AMP. Starting from the 1RSB formulation of belief propagation, we…

  • Weekly Entering & Transitioning – Thread 22 Dec, 2025 – 29 Dec, 2025

    Weekly Entering & Transitioning – Thread 22 Dec, 2025 – 29 Dec, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • workforce moving to oversee

    workforce moving to oversee My company is investing more and more in its overseas workforce, mostly in India. For every one job posted in the U.S., there are about ten in India. Is my company an exception, or is this happening everywhere? submitted by /u/Alarmed-Reporter-230 [link] [comments] /u/Alarmed-Reporter-230 Go to original source

  • A memory effecient TF-IDF project in Python to vectorize datasets large than RAM

    A memory effecient TF-IDF project in Python to vectorize datasets large than RAM Re-designed at C++ level, this library can easily process datasets around 100GB and beyond on as small as a 4GB memory It does have its constraints but the outputs are comparable to sklearn’s output fasttfidf submitted by /u/mrnerdy59 [link] [comments] /u/mrnerdy59 Go…

  • New Data Science Team Lead struggling with aggressive PM on timelines and model expectations

    New Data Science Team Lead struggling with aggressive PM on timelines and model expectations I’m a data scientist who was recently promoted to be a data science team lead. Overall I enjoy the role, but I’m running into a recurring challenge with a very aggressive product manager (also a leader) that I’m not sure how…

  • How complex are your experiment setups?

    How complex are your experiment setups? Are you all also just running t tests or are yours more complex? How often do you run complex setups? I think my org wrongly only runs t tests and are not understanding of the downfalls of defaulting to those submitted by /u/ds_contractor [link] [comments] /u/ds_contractor Go to original…

  • How to Do Evals on a Bloated RAG Pipeline

    How to Do Evals on a Bloated RAG Pipeline Comparing metrics across datasets and models The post How to Do Evals on a Bloated RAG Pipeline appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

  • Tools for Your LLM: a Deep Dive into MCP

    Tools for Your LLM: a Deep Dive into MCP MCP is a key enabler into turning your LLM into an agent by providing it with tools to retrieve real-time information or perform actions. In this deep dive we cover how MCP works, when to use it, and what to watch out for. The post Tools…

  • Understanding the Generative AI User

    Understanding the Generative AI User What do regular technology users think (and know) about AI? The post Understanding the Generative AI User appeared first on Towards Data Science. Stephanie Kirmer Go to original source

  • EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas

    EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas Learn how to analyze product performance, extract time-series features, and uncover key seasonal trends in your sales data. The post EDA in Public (Part 2): Product Deep Dive & Time-Series Analysis in Pandas appeared first on Towards Data Science. Ibrahim Salami Go to original source

  • The Machine Learning “Advent Calendar” Day 19: Bagging in Excel

    The Machine Learning “Advent Calendar” Day 19: Bagging in Excel Understanding ensemble learning from first principles in Excel The post The Machine Learning “Advent Calendar” Day 19: Bagging in Excel appeared first on Towards Data Science. angela shi Go to original source

  • Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC)

    Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC) Using Agentic AI prompts with the Artificial Bee Colony algorithm to enhance unsupervised clustering and optimization workflows. The post Agentic AI Swarm Optimization using Artificial Bee Colonization (ABC) appeared first on Towards Data Science. Gal Arav Go to original source

  • How I Optimized My Leaf Raking Strategy Using Linear Programming

    How I Optimized My Leaf Raking Strategy Using Linear Programming From a weekend chore to a fun application of valuable operations research principles The post How I Optimized My Leaf Raking Strategy Using Linear Programming appeared first on Towards Data Science. Josiah DeValois Go to original source

  • Six Lessons Learned Building RAG Systems in Production

    Six Lessons Learned Building RAG Systems in Production Best practices for data quality, retrieval design, and evaluation in production RAG systems The post Six Lessons Learned Building RAG Systems in Production appeared first on Towards Data Science. Sabrine Bendimerad Go to original source

  • 2025 Must-Reads: Agents, Python, LLMs, and More

    2025 Must-Reads: Agents, Python, LLMs, and More Don’t miss our most popular articles of the past year! The post 2025 Must-Reads: Agents, Python, LLMs, and More appeared first on Towards Data Science. TDS Editors Go to original source

  • BayesSum: Bayesian Quadrature in Discrete Spaces

    BayesSum: Bayesian Quadrature in Discrete Spaces arXiv:2512.16105v1 Announce Type: new Abstract: This paper addresses the challenging computational problem of estimating intractable expectations over discrete domains. Existing approaches, including Monte Carlo and Russian Roulette estimators, are consistent but often require a large number of samples to achieve accurate results. We propose a novel estimator, emph{BayesSum}, which…

  • DAG Learning from Zero-Inflated Count Data Using Continuous Optimization

    DAG Learning from Zero-Inflated Count Data Using Continuous Optimization arXiv:2512.16233v1 Announce Type: new Abstract: We address network structure learning from zero-inflated count data by casting each node as a zero-inflated generalized linear model and optimizing a smooth, score-based objective under a directed acyclic graph constraint. Our Zero-Inflated Continuous Optimization (ZICO) approach uses node-wise likelihoods with…

  • Advantages and limitations in the use of transfer learning for individual treatment effects in causal machine learning

    Advantages and limitations in the use of transfer learning for individual treatment effects in causal machine learning arXiv:2512.16489v1 Announce Type: new Abstract: Generalizing causal knowledge across diverse environments is challenging, especially when estimates from large-scale datasets must be applied to smaller or systematically different contexts, where external validity is critical. Model-based estimators of individual treatment…

  • Riemannian Stochastic Interpolants for Amorphous Particle Systems

    Riemannian Stochastic Interpolants for Amorphous Particle Systems arXiv:2512.16607v1 Announce Type: new Abstract: Modern generative models hold great promise for accelerating diverse tasks involving the simulation of physical systems, but they must be adapted to the specific constraints of each domain. Significant progress has been made for biomolecules and crystalline materials. Here, we address amorphous materials…

  • On The Hidden Biases of Flow Matching Samplers

    On The Hidden Biases of Flow Matching Samplers arXiv:2512.16768v1 Announce Type: new Abstract: We study the implicit bias of flow matching (FM) samplers via the lens of empirical flow matching. Although population FM may produce gradient-field velocities resembling optimal transport (OT), we show that the empirical FM minimizer is almost never a gradient field, even…

  • The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel

    The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel Understanding forward propagation and backpropagation through explicit formulas The post The Machine Learning “Advent Calendar” Day 18: Neural Network Classifier in Excel appeared first on Towards Data Science. angela shi Go to original source

  • 4 Ways to Supercharge Your Data Science Workflow with Google AI Studio

    4 Ways to Supercharge Your Data Science Workflow with Google AI Studio With concrete examples of using AI Studio Build mode to learn faster, prototype smarter, communicate clearer, and automate quicker. The post 4 Ways to Supercharge Your Data Science Workflow with Google AI Studio appeared first on Towards Data Science. Shuai Guo Go to…

  • The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs

    The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs An optimal solution to the well-known NP-complete problem, when the input values are close enough to each other. The post The Subset Sum Problem Solved in Linear Time for Dense Enough Inputs appeared first on Towards Data Science. Tigran Hayrapetyan Go to original…

  • Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting

    Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting Using Python to generate art The post Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting appeared first on Towards Data Science. Mahnoor Javed Go to original source

  • Online Partitioned Local Depth for semi-supervised applications

    Online Partitioned Local Depth for semi-supervised applications arXiv:2512.15436v1 Announce Type: new Abstract: We introduce an extension of the partitioned local depth (PaLD) algorithm that is adapted to online applications such as semi-supervised prediction. The new algorithm we present, online PaLD, is well-suited to situations where it is a possible to pre-compute a cohesion network from…

  • A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point

    A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point arXiv:2512.15606v1 Announce Type: new Abstract: Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for…

  • High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations

    High-Dimensional Partial Least Squares: Spectral Analysis and Fundamental Limitations arXiv:2512.15684v1 Announce Type: new Abstract: Partial Least Squares (PLS) is a widely used method for data integration, designed to extract latent components shared across paired high-dimensional datasets. Despite decades of practical success, a precise theoretical understanding of its behavior in high-dimensional regimes remains limited. In this…

  • Model inference for ranking from pairwise comparisons

    Model inference for ranking from pairwise comparisons arXiv:2512.15269v1 Announce Type: cross Abstract: We consider the problem of ranking objects from noisy pairwise comparisons, for example, ranking tennis players from the outcomes of matches. We follow a standard approach to this problem and assume that each object has an unobserved strength and that the outcome of…

  • A Bayesian latent class reinforcement learning framework to capture adaptive, feedback-driven travel behaviour

    A Bayesian latent class reinforcement learning framework to capture adaptive, feedback-driven travel behaviour arXiv:2512.14713v1 Announce Type: cross Abstract: Many travel decisions involve a degree of experience formation, where individuals learn their preferences over time. At the same time, there is extensive scope for heterogeneity across individual travellers, both in their underlying preferences and in how…

  • A Practical Toolkit for Time Series Anomaly Detection, Using Python

    A Practical Toolkit for Time Series Anomaly Detection, Using Python Here’s how to detect point anomalies within each series, and identify anomalous signals across the whole bank The post A Practical Toolkit for Time Series Anomaly Detection, Using Python appeared first on Towards Data Science. Piero Paialunga Go to original source

  • The Machine Learning “Advent Calendar” Day 17: Neural Network Regressor in Excel

    The Machine Learning “Advent Calendar” Day 17: Neural Network Regressor in Excel Neural networks often feel like black boxes. In this article, we build a neural network regressor from scratch using only Excel formulas. By making every step explicit, from forward propagation to backpropagation, we show how a neural network learns to approximate non-linear functions…

  • Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach

    Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach LLM-as-a-Judge, regression testing, and end-to-end traceability of multi-agent LLM systems The post Production-Grade Observability for AI Agents: A Minimal-Code, Configuration-First Approach appeared first on Towards Data Science. Partha Sarkar Go to original source

  • 3 Techniques to Effectively Utilize AI Agents for Coding

    3 Techniques to Effectively Utilize AI Agents for Coding Learn how to be an effective engineer with coding agents The post 3 Techniques to Effectively Utilize AI Agents for Coding appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics

    Maximum Mean Discrepancy with Unequal Sample Sizes via Generalized U-Statistics arXiv:2512.13997v1 Announce Type: new Abstract: Existing two-sample testing techniques, particularly those based on choosing a kernel for the Maximum Mean Discrepancy (MMD), often assume equal sample sizes from the two distributions. Applying these methods in practice can require discarding valuable data, unnecessarily reducing test power.…

  • One Permutation Is All You Need: Fast, Reliable Variable Importance and Model Stress-Testing

    One Permutation Is All You Need: Fast, Reliable Variable Importance and Model Stress-Testing arXiv:2512.13892v1 Announce Type: new Abstract: Reliable estimation of feature contributions in machine learning models is essential for trust, transparency and regulatory compliance, especially when models are proprietary or otherwise operate as black boxes. While permutation-based methods are a standard tool for this…

  • On the Hardness of Conditional Independence Testing In Practice

    On the Hardness of Conditional Independence Testing In Practice arXiv:2512.14000v1 Announce Type: new Abstract: Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample…

  • Weighted Conformal Prediction Provides Adaptive and Valid Mask-Conditional Coverage for General Missing Data Mechanisms

    Weighted Conformal Prediction Provides Adaptive and Valid Mask-Conditional Coverage for General Missing Data Mechanisms arXiv:2512.14221v1 Announce Type: new Abstract: Conformal prediction (CP) offers a principled framework for uncertainty quantification, but it fails to guarantee coverage when faced with missing covariates. In addressing the heterogeneity induced by various missing patterns, Mask-Conditional Valid (MCV) Coverage has emerged…

  • Improving the Accuracy of Amortized Model Comparison with Self-Consistency

    Improving the Accuracy of Amortized Model Comparison with Self-Consistency arXiv:2512.14308v1 Announce Type: new Abstract: Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope…

  • When (Not) to Use Vector DB

    When (Not) to Use Vector DB When indexing hurts more than it helps: how we realized our RAG use case needed a key-value store, not a vector database The post When (Not) to Use Vector DB appeared first on Towards Data Science. Uri Peled Go to original source

  • Separate Numbers and Text in One Column Using Power Query

    Separate Numbers and Text in One Column Using Power Query An Excel sheet with a column containing numbers and text? What a mess! The post Separate Numbers and Text in One Column Using Power Query appeared first on Towards Data Science. Salvatore Cagliari Go to original source

  • The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel

    The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel Kernel SVM often feels abstract, with kernels, dual formulations, and support vectors. In this article, we take a different path. Starting from Kernel Density Estimation, we build Kernel SVM step by step as a sum of local bells, weighted and selected by hinge loss,…

  • Lessons Learned After 8 Years of Machine Learning

    Lessons Learned After 8 Years of Machine Learning Deep work, over-identification, sports, and blogging The post Lessons Learned After 8 Years of Machine Learning appeared first on Towards Data Science. Pascal Janetzky Go to original source

  • Interval Fisher’s Discriminant Analysis and Visualisation

    Interval Fisher’s Discriminant Analysis and Visualisation arXiv:2512.11945v1 Announce Type: new Abstract: In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher’s Discriminant Analysis to interval-valued data, using Moore’s…

  • Hellinger loss function for Generative Adversarial Networks

    Hellinger loss function for Generative Adversarial Networks arXiv:2512.12267v1 Announce Type: new Abstract: We propose Hellinger-type loss functions for training Generative Adversarial Networks (GANs), motivated by the boundedness, symmetry, and robustness properties of the Hellinger distance. We define an adversarial objective based on this divergence and study its statistical properties within a general parametric framework. We…

  • Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees

    Co-Hub Node Based Multiview Graph Learning with Theoretical Guarantees arXiv:2512.12435v1 Announce Type: new Abstract: Identifying the graphical structure underlying the observed multivariate data is essential in numerous applications. Current methodologies are predominantly confined to deducing a singular graph under the presumption that the observed data are uniform. However, many contexts involve heterogeneous datasets that feature…

  • Towards a pretrained deep learning estimator of the Linfoot informational correlation

    Towards a pretrained deep learning estimator of the Linfoot informational correlation arXiv:2512.12358v1 Announce Type: new Abstract: We develop a supervised deep-learning approach to estimate mutual information between two continuous random variables. As labels, we use the Linfoot informational correlation, a transformation of mutual information that has many important properties. Our method is based on ground…

  • Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data

    Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data arXiv:2512.12442v1 Announce Type: new Abstract: Almost all scientific data have uncertainties originating from different sources. Gaussian process regression (GPR) models are a natural way to model data with Gaussian-distributed uncertainties. GPR also has the benefit of reducing I/O bandwidth and storage requirements for large scientific simulations.…

  • The Machine Learning “Advent Calendar” Day 15: SVM in Excel

    The Machine Learning “Advent Calendar” Day 15: SVM in Excel Instead of starting with margins and geometry, this article builds the Support Vector Machine step by step from familiar models. By changing the loss function and reusing regularization, SVM appears naturally as a linear classifier trained by optimization. This perspective unifies logistic regression, SVM, and…

  • 6 Technical Skills That Make You a Senior Data Scientist

    6 Technical Skills That Make You a Senior Data Scientist Beyond writing code, these are the design-level decisions, trade-offs, and habits that quietly separate senior data scientists from everyone else. The post 6 Technical Skills That Make You a Senior Data Scientist appeared first on Towards Data Science. Piero Paialunga Go to original source

  • Geospatial exploratory data analysis with GeoPandas and DuckDB

    Geospatial exploratory data analysis with GeoPandas and DuckDB In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through…

  • Lessons Learned from Upgrading to LangChain 1.0 in Production

    Lessons Learned from Upgrading to LangChain 1.0 in Production What worked, what broke, and why I did it The post Lessons Learned from Upgrading to LangChain 1.0 in Production appeared first on Towards Data Science. Clara Chong Go to original source

  • STARK denoises spatial transcriptomics images via adaptive regularization

    STARK denoises spatial transcriptomics images via adaptive regularization arXiv:2512.10994v1 Announce Type: new Abstract: We present an approach to denoising spatial transcriptomics images that is particularly effective for uncovering cell identities in the regime of ultra-low sequencing depths, and also allows for interpolation of gene expression. The method — Spatial Transcriptomics via Adaptive Regularization and Kernels…

  • An Efficient Variant of One-Class SVM with Lifelong Online Learning Guarantees

    An Efficient Variant of One-Class SVM with Lifelong Online Learning Guarantees arXiv:2512.11052v1 Announce Type: new Abstract: We study outlier (a.k.a., anomaly) detection for single-pass non-stationary streaming data. In the well-studied offline or batch outlier detection problem, traditional methods such as kernel One-Class SVM (OCSVM) are both computationally heavy and prone to large false-negative (Type II)…

  • Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

    Provable Recovery of Locally Important Signed Features and Interactions from Random Forest arXiv:2512.11081v1 Announce Type: new Abstract: Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are…

  • TPV: Parameter Perturbations Through the Lens of Test Prediction Variance

    TPV: Parameter Perturbations Through the Lens of Test Prediction Variance arXiv:2512.11089v1 Announce Type: new Abstract: We identify test prediction variance (TPV) — the first-order sensitivity of model outputs to parameter perturbations around a trained solution — as a unifying quantity that links several classical observations about generalization in deep networks. TPV is a fully label-free…

  • Data-Driven Model Reduction using WeldNet: Windowed Encoders for Learning Dynamics

    Data-Driven Model Reduction using WeldNet: Windowed Encoders for Learning Dynamics arXiv:2512.11090v1 Announce Type: new Abstract: Many problems in science and engineering involve time-dependent, high dimensional datasets arising from complex physical processes, which are costly to simulate. In this work, we propose WeldNet: Windowed Encoders for Learning Dynamics, a data-driven nonlinear model reduction framework to build…

  • Weekly Entering & Transitioning – Thread 15 Dec, 2025 – 22 Dec, 2025

    Weekly Entering & Transitioning – Thread 15 Dec, 2025 – 22 Dec, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • I got three offers from a two month job search – here’s what I wish I knew earlier

    I got three offers from a two month job search – here’s what I wish I knew earlier There’s a lot of doom and gloom on reddit and elsewhere about the current state of the job market. And yes, it’s bad. But reading all these stories of people going months and years without getting a…

  • Has anyone tried training models on raw discussions instead of curated datasets?

    Has anyone tried training models on raw discussions instead of curated datasets? I’ve always followed the usual advice when training models, like clean the data, normalize everything, remove noise, structure it nicely Recently I tried something different. Instead of polished datasets, I fed models long, messy discussion threads, real conversations, people arguing, correcting themselves, misunderstanding…

  • While 72% of Executives Back AI, Public Trust Is Tanking

    While 72% of Executives Back AI, Public Trust Is Tanking submitted by /u/disforwork [link] [comments] /u/disforwork Go to original source

  • Gemini Deep Research: Autonomous Intelligence for Enterprise Research

    Gemini Deep Research: Autonomous Intelligence for Enterprise Research submitted by /u/WarChampion90 [link] [comments] /u/WarChampion90 Go to original source

  • The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel

    The Machine Learning “Advent Calendar” Day 14: Softmax Regression in Excel Softmax Regression is simply Logistic Regression extended to multiple classes. By computing one linear score per class and normalizing them with Softmax, we obtain multiclass probabilities without changing the core logic. The loss, the gradients, and the optimization remain the same. Only the number…

  • The Skills That Bridge Technical Work and Business Impact

    The Skills That Bridge Technical Work and Business Impact In the Author Spotlight series, TDS Editors chat with members of our community about their career path in data science and AI, their writing, and their sources of inspiration. Today, we’re thrilled to share our conversation with Maria Mouschoutzi.  Maria is a Data Analyst and Project…

  • Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case

    Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case Introduction If you work in data science, data engineering, or as as a frontend/backend developer, you deal with JSON. For professionals, its basically only death, taxes, and JSON-parsing that is inevitable. The issue is that parsing JSON is often a serious pain. Whether you are…

  • The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

    The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel Ridge and Lasso regression are often perceived as more complex versions of linear regression. In reality, the prediction model remains exactly the same. What changes is the training objective. By adding a penalty on the coefficients, regularization forces the model to choose…

  • How to Increase Coding Iteration Speed

    How to Increase Coding Iteration Speed Learn how to become a more efficient programmer with local testing The post How to Increase Coding Iteration Speed appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating

    NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating This one little trick can bring about enhanced training stability, the use of larger learning rates and improved scaling properties The post NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating appeared first on Towards Data Science. Sean Moran Go to original…

  • The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel

    The Machine Learning “Advent Calendar” Day 12: Logistic Regression in Excel In this article, we rebuild Logistic Regression step by step directly in Excel. Starting from a binary dataset, we explore why linear regression struggles as a classifier, how the logistic function fixes these issues, and how log-loss naturally appears from the likelihood. With a…

  • Decentralized Computation: The Hidden Principle Behind Deep Learning

    Decentralized Computation: The Hidden Principle Behind Deep Learning Most breakthroughs in deep learning — from simple neural networks to large language models — are built upon a principle that is much older than AI itself: decentralization. Instead of relying on a powerful “central planner” coordinating and commanding the behaviors of other components, modern deep-learning-based AI…

  • EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

    EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas Hey everyone! Welcome to the start of a major data journey that I’m calling “EDA in Public.” For those who know me, I believe the best way to learn anything is to tackle a real-world problem and share the entire messy process — including mistakes, victories,…

  • Spectral Community Detection in Clinical Knowledge Graphs

    Spectral Community Detection in Clinical Knowledge Graphs Introduction How do we identify latent groups of patients in a large cohort? How can we find similarities among patients that go beyond the well-known comorbidity clusters associated with specific diseases? And more importantly, how can we extract quantitative signals that can be analyzed, compared, and reused across…

  • LxCIM: a new rank-based binary classifier performance metric invariant to local exchange of classes

    LxCIM: a new rank-based binary classifier performance metric invariant to local exchange of classes arXiv:2512.10053v1 Announce Type: new Abstract: Binary classification is one of the oldest, most prevalent, and studied problems in machine learning. However, the metrics used to evaluate model performance have received comparatively little attention. The area under the receiver operating characteristic curve…

  • The Interplay of Statistics and Noisy Optimization: Learning Linear Predictors with Random Data Weights

    The Interplay of Statistics and Noisy Optimization: Learning Linear Predictors with Random Data Weights arXiv:2512.10188v1 Announce Type: new Abstract: We analyze gradient descent with randomly weighted data points in a linear regression model, under a generic weighting distribution. This includes various forms of stochastic gradient descent, importance sampling, but also extends to weighting distributions with…

  • Diffusion differentiable resampling

    Diffusion differentiable resampling arXiv:2512.10401v1 Announce Type: new Abstract: This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). We propose a new informative resampling method that is instantly pathwise differentiable, based on an ensemble score diffusion model. We prove that our diffusion resampling method provides a consistent estimate…

  • Error Analysis of Generalized Langevin Equations with Approximated Memory Kernels

    Error Analysis of Generalized Langevin Equations with Approximated Memory Kernels arXiv:2512.10256v1 Announce Type: new Abstract: We analyze prediction error in stochastic dynamical systems with memory, focusing on generalized Langevin equations (GLEs) formulated as stochastic Volterra equations. We establish that, under a strongly convex potential, trajectory discrepancies decay at a rate determined by the decay of…

  • Supervised Learning of Random Neural Architectures Structured by Latent Random Fields on Compact Boundaryless Multiply-Connected Manifolds

    Supervised Learning of Random Neural Architectures Structured by Latent Random Fields on Compact Boundaryless Multiply-Connected Manifolds arXiv:2512.10407v1 Announce Type: new Abstract: This paper introduces a new probabilistic framework for supervised learning in neural systems. It is designed to model complex, uncertain systems whose random outputs are strongly non-Gaussian given deterministic inputs. The architecture itself is…

  • The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel

    The Machine Learning “Advent Calendar” Day 11: Linear Regression in Excel Linear Regression looks simple, but it introduces the core ideas of modern machine learning: loss functions, optimization, gradients, scaling, and interpretation. In this article, we rebuild Linear Regression in Excel, compare the closed-form solution with Gradient Descent, and see how the coefficients evolve step…

  • Drawing Shapes with the Python Turtle Module

    Drawing Shapes with the Python Turtle Module A step-by-step tutorial that explores the Python Turtle Module The post Drawing Shapes with the Python Turtle Module appeared first on Towards Data Science. Mahnoor Javed Go to original source

  • 7 Pandas Performance Tricks Every Data Scientist Should Know

    7 Pandas Performance Tricks Every Data Scientist Should Know What I’ve learned about making Pandas faster after too many slow notebooks and frozen sessions The post 7 Pandas Performance Tricks Every Data Scientist Should Know appeared first on Towards Data Science. Benjamin Nweke Go to original source