Category: aimldsaimlds

The Machine Learning and Deep Learning “Advent Calendar” Series: The Blueprint

The Machine Learning and Deep Learning “Advent Calendar” Series: The Blueprint Opening the black box of ML models, step by step, directly in Excel The post The Machine Learning and Deep Learning “Advent Calendar” Series: The Blueprint appeared first on Towards Data Science. angela shi Go to original source

December 1, 2025
The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall

The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall A modification to the Boruta algorithm that dramatically reduces computation while maintaining high sensitivity The post The Greedy Boruta Algorithm: Faster Feature Selection Without Sacrificing Recall appeared first on Towards Data Science. Nicolas Vana Go to original source

December 1, 2025
Metric Deception: When Your Best KPIs Hide Your Worst Failures

Metric Deception: When Your Best KPIs Hide Your Worst Failures The most dangerous KPIs aren’t broken; they’re the ones trusted long after they’ve lost their meaning. The post Metric Deception: When Your Best KPIs Hide Your Worst Failures appeared first on Towards Data Science. Shafeeq Ur Rahaman Go to original source

November 30, 2025
How to Scale Your LLM Usage

How to Scale Your LLM Usage Learn how to increase LLM usage to achieve increased productivity The post How to Scale Your LLM Usage appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

November 30, 2025
Data Science in 2026: Is It Still Worth It?

Data Science in 2026: Is It Still Worth It? An honest view from a 10-year AI Engineer The post Data Science in 2026: Is It Still Worth It? appeared first on Towards Data Science. Sabrine Bendimerad Go to original source

November 29, 2025
Why We’ve Been Optimizing the Wrong Thing in LLMs for Years

Why We’ve Been Optimizing the Wrong Thing in LLMs for Years The simple shift in training that unlocks foresight, faster inference, and better reasoning. The post Why We’ve Been Optimizing the Wrong Thing in LLMs for Years appeared first on Towards Data Science. Moulik Gupta Go to original source

November 29, 2025
The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation

The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation How product, growth and engineering teams can converge on a single signal for better incident management The post The Product Health Score: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation appeared first on…

November 29, 2025
TDS Newsletter: November Must-Reads on GraphRAG, ML Projects, LLM-Powered Time-Series Analysis, and More

TDS Newsletter: November Must-Reads on GraphRAG, ML Projects, LLM-Powered Time-Series Analysis, and More Don’t miss our most-read stories of the past month The post TDS Newsletter: November Must-Reads on GraphRAG, ML Projects, LLM-Powered Time-Series Analysis, and More appeared first on Towards Data Science. TDS Editors Go to original source

November 28, 2025
Neural Networks Are Blurry, Symbolic Systems Are Fragmented. Sparse Autoencoders Help Us Combine Them.

Neural Networks Are Blurry, Symbolic Systems Are Fragmented. Sparse Autoencoders Help Us Combine Them. Neural and symbolic models compress the world in fundamentally different ways, and Sparse Autoencoders (SAEs) offer a bridge to connect them. The post Neural Networks Are Blurry, Symbolic Systems Are Fragmented. Sparse Autoencoders Help Us Combine Them. appeared first on Towards…

November 28, 2025
Water Cooler Small Talk, Ep. 10: So, What About the AI Bubble?

Water Cooler Small Talk, Ep. 10: So, What About the AI Bubble? Have we all been tricked into believing in an impossible, extremely expensive future? The post Water Cooler Small Talk, Ep. 10: So, What About the AI Bubble? appeared first on Towards Data Science. Maria Mouschoutzi Go to original source

November 28, 2025
Everyday Decisions are Noisier Than You Think — Here’s How AI Can Help Fix That

Everyday Decisions are Noisier Than You Think — Here’s How AI Can Help Fix That From insurance premiums to courtrooms: the impact of noise The post Everyday Decisions are Noisier Than You Think — Here’s How AI Can Help Fix That appeared first on Towards Data Science. Sean Moran Go to original source

November 28, 2025
Implementing the Rock Paper Scissors Game in Python

Implementing the Rock Paper Scissors Game in Python A beginner-friendly Python tutorial using conditionals and the random module The post Implementing the Rock Paper Scissors Game in Python appeared first on Towards Data Science. Mahnoor Javed Go to original source

November 28, 2025
When Features Beat Noise: A Feature Selection Technique Through Noise-Based Hypothesis Testing

When Features Beat Noise: A Feature Selection Technique Through Noise-Based Hypothesis Testing arXiv:2511.20851v1 Announce Type: new Abstract: Feature selection has remained a daunting challenge in machine learning and artificial intelligence, where increasingly complex, high-dimensional datasets demand principled strategies for isolating the most informative predictors. Despite widespread adoption, many established techniques suffer from notable limitations; some…

November 27, 2025
Deep Learning as a Convex Paradigm of Computation: Minimizing Circuit Size with ResNets

Deep Learning as a Convex Paradigm of Computation: Minimizing Circuit Size with ResNets arXiv:2511.20888v1 Announce Type: new Abstract: This paper argues that DNNs implement a computational Occam’s razor — finding the `simplest’ algorithm that fits the data — and that this could explain their incredible and wide-ranging success over more traditional statistical methods. We start…

November 27, 2025
Geometric Calibration and Neutral Zones for Uncertainty-Aware Multi-Class Classification

Geometric Calibration and Neutral Zones for Uncertainty-Aware Multi-Class Classification arXiv:2511.20960v1 Announce Type: new Abstract: Modern artificial intelligence systems make critical decisions yet often fail silently when uncertain. We develop a geometric framework for post-hoc calibration of neural network probability outputs, treating probability vectors as points on the $(c-1)$-dimensional probability simplex equipped with the Fisher–Rao metric.…

November 27, 2025
Nonconvex Penalized LAD Estimation in Partial Linear Models with DNNs: Asymptotic Analysis and Proximal Algorithms

Nonconvex Penalized LAD Estimation in Partial Linear Models with DNNs: Asymptotic Analysis and Proximal Algorithms arXiv:2511.21115v1 Announce Type: new Abstract: This paper investigates the partial linear model by Least Absolute Deviation (LAD) regression. We parameterize the nonparametric term using Deep Neural Networks (DNNs) and formulate a penalized LAD problem for estimation. Specifically, our model exhibits…

November 27, 2025
Maxitive Donsker-Varadhan Formulation for Possibilistic Variational Inference

Maxitive Donsker-Varadhan Formulation for Possibilistic Variational Inference arXiv:2511.21223v1 Announce Type: new Abstract: Variational inference (VI) is a cornerstone of modern Bayesian learning, enabling approximate inference in complex models that would otherwise be intractable. However, its formulation depends on expectations and divergences defined through high-dimensional integrals, often rendering analytical treatment impossible and necessitating heavy reliance on…

November 27, 2025
I Cleaned a Messy CSV File Using Pandas . Here’s the Exact Process I Follow Every Time.

I Cleaned a Messy CSV File Using Pandas . Here’s the Exact Process I Follow Every Time. Stop guessing at data cleaning. Use this repeatable 5-step Python workflow to diagnose and fix the most common data flaws. The post I Cleaned a Messy CSV File Using Pandas . Here’s the Exact Process I Follow Every Time. appeared first on Towards…

November 27, 2025
RISAT’s Silent Promise: Decoding Disasters with Synthetic Aperture Radar

RISAT’s Silent Promise: Decoding Disasters with Synthetic Aperture Radar The high-resolution physics turning microwave echoes into real-time flood intelligence The post RISAT’s Silent Promise: Decoding Disasters with Synthetic Aperture Radar appeared first on Towards Data Science. Aakash Goswami Go to original source

November 27, 2025
How I Use AI to Convince Companies to Adopt Sustainability

How I Use AI to Convince Companies to Adopt Sustainability Discover how Claude can act as a Supply Chain Sustainability Analyst and guide companies toward greener, more efficient inventory management. The post How I Use AI to Convince Companies to Adopt Sustainability appeared first on Towards Data Science. Samir Saci Go to original source

November 27, 2025
FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection arXiv:2511.19476v1 Announce Type: new Abstract: Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on…

November 26, 2025
Optimization and Regularization Under Arbitrary Objectives

Optimization and Regularization Under Arbitrary Objectives arXiv:2511.19628v1 Announce Type: new Abstract: This study investigates the limitations of applying Markov Chain Monte Carlo (MCMC) methods to arbitrary objective functions, focusing on a two-block MCMC framework which alternates between Metropolis-Hastings and Gibbs sampling. While such approaches are often considered advantageous for enabling data-driven regularization, we show that…

November 26, 2025
Clustering Approaches for Mixed-Type Data: A Comparative Study

Clustering Approaches for Mixed-Type Data: A Comparative Study arXiv:2511.19755v1 Announce Type: new Abstract: Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares…

November 26, 2025
A Fully Probabilistic Tensor Network for Regularized Volterra System Identification

A Fully Probabilistic Tensor Network for Regularized Volterra System Identification arXiv:2511.20457v1 Announce Type: new Abstract: Modeling nonlinear systems with Volterra series is challenging because the number of kernel coefficients grows exponentially with the model order. This work introduces Bayesian Tensor Network Volterra kernel machines (BTN-V), extending the Bayesian Tensor Network framework to Volterra system identification.…

November 26, 2025
Generative Modeling with Manifold Percolation

Generative Modeling with Manifold Percolation arXiv:2511.20503v1 Announce Type: new Abstract: Generative modeling is typically framed as learning mapping rules, but from an observer’s perspective without access to these rules, the task manifests as disentangling the geometric support from the probability distribution. We propose that Continuum Percolation is uniquely suited for this support analysis, as the…

November 26, 2025
Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It

Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It A real-world analysis of why CrewAI’s hierarchical orchestration misfires—and a practical fix you can implement today. The post Why CrewAI’s Manager-Worker Architecture Fails — and How to Fix It appeared first on Towards Data Science. Partha Sarkar Go to original source

November 26, 2025
How to Implement Three Use Cases for the New Calendar-Based Time Intelligence

How to Implement Three Use Cases for the New Calendar-Based Time Intelligence Starting with the September 2025 Release of Power BI, Microsoft introduced the new Calendar-based Time Intelligence feature. Let’s see what can be done by implementing three use cases. The future looks very interesting with this new feature. The post How to Implement Three…

November 26, 2025
Ten Lessons of Building LLM Applications for Engineers

Ten Lessons of Building LLM Applications for Engineers Practical field notes on workflows, structure, and evaluation from two years of building with engineering domain experts. The post Ten Lessons of Building LLM Applications for Engineers appeared first on Towards Data Science. Shuai Guo Go to original source

November 26, 2025
How to Create Professional Articles with LaTeX in Cursor

How to Create Professional Articles with LaTeX in Cursor Learn how to rapidly create professional articles and presentations with LaTeX in Cursor The post How to Create Professional Articles with LaTeX in Cursor appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

November 26, 2025
Quantum Fourier Transform Based Kernel for Solar Irrandiance Forecasting

Quantum Fourier Transform Based Kernel for Solar Irrandiance Forecasting arXiv:2511.17698v1 Announce Type: new Abstract: This study proposes a Quantum Fourier Transform (QFT)-enhanced quantum kernel for short-term time-series forecasting. Each signal is windowed, amplitude-encoded, transformed by a QFT, then passed through a protective rotation layer to avoid the QFT/QFT adjoint cancellation; the resulting kernel is used…

November 25, 2025
Prequential posteriors

Prequential posteriors arXiv:2511.17721v1 Announce Type: new Abstract: Data assimilation is a fundamental task in updating forecasting models upon observing new data, with applications ranging from weather prediction to online reinforcement learning. Deep generative forecasting models (DGFMs) have shown excellent performance in these areas, but assimilating data into such models is challenging due to their intractable…

November 25, 2025
Variational Estimators for Node Popularity Models

Variational Estimators for Node Popularity Models arXiv:2511.17783v1 Announce Type: new Abstract: Node popularity is recognized as a key factor in modeling real-world networks, capturing heterogeneity in connectivity across communities. This concept is equally important in bipartite networks, where nodes in different partitions may exhibit varying popularity patterns, motivating models such as the Two-Way Node Popularity…

November 25, 2025
An operator splitting analysis of Wasserstein–Fisher–Rao gradient flows

An operator splitting analysis of Wasserstein–Fisher–Rao gradient flows arXiv:2511.18060v1 Announce Type: new Abstract: Wasserstein-Fisher-Rao (WFR) gradient flows have been recently proposed as a powerful sampling tool that combines the advantages of pure Wasserstein (W) and pure Fisher-Rao (FR) gradient flows. Existing algorithmic developments implicitly make use of operator splitting techniques to numerically approximate the WFR…

November 25, 2025
Conformal Prediction for Compositional Data

Conformal Prediction for Compositional Data arXiv:2511.18141v1 Announce Type: new Abstract: In this work, we propose a set of conformal prediction procedures tailored to compositional responses, where outcomes are proportions that must be positive and sum to one. Building on Dirichlet regression, we introduce a split conformal approach based on quantile residuals and a highest-density region…

November 25, 2025
How to Implement Randomization with the Python Random Module

How to Implement Randomization with the Python Random Module Let’s generate randomness in our code’s outputs The post How to Implement Randomization with the Python Random Module appeared first on Towards Data Science. Mahnoor Javed Go to original source

November 25, 2025
Struggling with Data Science? 5 Common Beginner Mistakes

Struggling with Data Science? 5 Common Beginner Mistakes Avoid these mistakes to fast track your data science career. The post Struggling with Data Science? 5 Common Beginner Mistakes appeared first on Towards Data Science. Egor Howell Go to original source

November 25, 2025
A Hands-On Guide to Anthropic’s New Structured Output Capabilities

A Hands-On Guide to Anthropic’s New Structured Output Capabilities A developer’s guide to perfect JSON and typed outputs from Claude Sonnet 4.5 and Opus 4.1 The post A Hands-On Guide to Anthropic’s New Structured Output Capabilities appeared first on Towards Data Science. Thomas Reid Go to original source

November 25, 2025
LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models

LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models A step-by-step guide to building AI quality control using large language models The post LLM-as-a-Judge: What It Is, Why It Works, and How to Use It to Evaluate AI Models appeared first on Towards Data Science. Piero Paialunga Go…

November 25, 2025
BITS for GAPS: Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates

BITS for GAPS: Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates arXiv:2511.16815v1 Announce Type: new Abstract: We introduce the Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates (BITS for GAPS) framework to emulate latent components in hybrid physical systems. BITS for GAPS supports serial hybrid modeling, where known physics governs part of the system and…

November 24, 2025
Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition

Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition arXiv:2511.16796v1 Announce Type: cross Abstract: Penalty-based methods have become popular for solving bilevel optimization (BLO) problems, thanks to their effective first-order nature. However, they often require inner-loop iterations to solve the lower-level (LL) problem and small outer-loop step sizes to handle the increased smoothness…

November 24, 2025
Diffusion-Inversion-Net (DIN): An End-to-End Direct Probabilistic Framework for Characterizing Hydraulic Conductivities and Quantifying Uncertainty

Diffusion-Inversion-Net (DIN): An End-to-End Direct Probabilistic Framework for Characterizing Hydraulic Conductivities and Quantifying Uncertainty arXiv:2511.16926v1 Announce Type: cross Abstract: We propose the Diffusion-Inversion-Net (DIN) framework for inverse modeling of groundwater flow and solute transport processes. DIN utilizes an offline-trained Denoising Diffusion Probabilistic Model (DDPM) as a powerful prior leaner, which flexibly incorporates sparse, multi-source observational…

November 24, 2025
Gradient flow for deep equilibrium single-index models

Gradient flow for deep equilibrium single-index models arXiv:2511.16976v1 Announce Type: cross Abstract: Deep equilibrium models (DEQs) have recently emerged as a powerful paradigm for training infinitely deep weight-tied neural networks that achieve state of the art performance across many modern machine learning tasks. Despite their practical success, theoretically understanding the gradient descent dynamics for training…

November 24, 2025
DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing arXiv:2511.17038v1 Announce Type: cross Abstract: From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely…

November 24, 2025
Weekly Entering & Transitioning – Thread 24 Nov, 2025 – 01 Dec, 2025

Weekly Entering & Transitioning – Thread 24 Nov, 2025 – 01 Dec, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

November 24, 2025
Are LeetCode heavy Interviews becoming the norm for DS Modeling roles?

Are LeetCode heavy Interviews becoming the norm for DS Modeling roles? I’ve been actively searching for DS Modeling roles again, and wow the landscape has changed a lot since the last time I was on the market. It seems like leetcode style interviews have become way more common. I’ve already failed or barely passed several…

November 24, 2025
How long should I stay at my first job if I dislike the city?

How long should I stay at my first job if I dislike the city? I recently just got my bachelors from Berkeley in data science, and I recently started a new job in Boston. I’m super grateful for this job opportunity because I applied to probably 1k jobs and this was the only good offer…

November 24, 2025
Indeed’s Job Report Shows 13% YoY Drop in Data & Analytics Roles

Indeed’s Job Report Shows 13% YoY Drop in Data & Analytics Roles “Roles like business analyst, data analyst, data scientist, and BI developer are drawing large talent pools that outpace the number of job postings, creating a fiercely competitive market.” do you agree with these findings – are data & analytics roles the hardest-hit in…

November 24, 2025
Will there be a discount for Physical O’Reilly Media books?

Will there be a discount for Physical O’Reilly Media books? Will there be a discount for Physical O’Reilly Media books? Hello. Not sure if this is the best place to post this question so let me know. Does anyone know if there will be some Black Friday discount for Physical O’Reilly Media books somewhere? I…

November 24, 2025
Learning Triton One Kernel at a Time: Softmax

Learning Triton One Kernel at a Time: Softmax All you need to know about a fast, readable and PyTorch-ready softmax kernel The post Learning Triton One Kernel at a Time: Softmax appeared first on Towards Data Science. Ryan Pégoud Go to original source

November 24, 2025
Your Next ‘Large’ Language Model Might Not Be Large After All

Your Next ‘Large’ Language Model Might Not Be Large After All A 27M-parameter model just outperformed giants like DeepSeek R1, o3-mini, and Claude 3.7 on reasoning tasks The post Your Next ‘Large’ Language Model Might Not Be Large After All appeared first on Towards Data Science. Moulik Gupta Go to original source

November 24, 2025
Empirical Mode Decomposition: The Most Intuitive Way to Decompose Complex Signals and Time Series

Empirical Mode Decomposition: The Most Intuitive Way to Decompose Complex Signals and Time Series A step-by-step breakdown of empirical mode decomposition to help you extract patterns from time series The post Empirical Mode Decomposition: The Most Intuitive Way to Decompose Complex Signals and Time Series appeared first on Towards Data Science. Sabrine Bendimerad Go to…

November 23, 2025
Overfitting vs. Underfitting: Making Sense of the Bias-Variance Trade-Off

Overfitting vs. Underfitting: Making Sense of the Bias-Variance Trade-Off The best models live in the sweet spot: generalizing well, learning enough, but not too much The post Overfitting vs. Underfitting: Making Sense of the Bias-Variance Trade-Off appeared first on Towards Data Science. Frida Karvouni Go to original source

November 23, 2025
Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB How I learned to handle growing datasets without slowing down my entire workflow The post Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB appeared first on Towards Data Science. Benjamin Nweke Go to original source

November 22, 2025
How To Build a Graph-Based Recommendation Engine Using EDG and Neo4j

How To Build a Graph-Based Recommendation Engine Using EDG and Neo4j Use a shared taxonomy to connect RDF and property graphs—and power smarter recommendations with inferencing The post How To Build a Graph-Based Recommendation Engine Using EDG and Neo4j appeared first on Towards Data Science. Steve Hedden Go to original source

November 22, 2025
Natural Language Visualization and the Future of Data Analysis and Presentation

Natural Language Visualization and the Future of Data Analysis and Presentation Will conversational interaction replace SQL queries, KPI reports, and dashboards? The post Natural Language Visualization and the Future of Data Analysis and Presentation appeared first on Towards Data Science. Michal Szudejko Go to original source

November 22, 2025
Generative AI Will Redesign Cars, But Not the Way Automakers Think

Generative AI Will Redesign Cars, But Not the Way Automakers Think Traditional manufacturers are using revolutionary technology for incremental optimization instead of fundamental re-imagination The post Generative AI Will Redesign Cars, But Not the Way Automakers Think appeared first on Towards Data Science. Nishant Arora Go to original source

November 22, 2025
TDS Newsletter: How to Build Robust Data and AI Systems

TDS Newsletter: How to Build Robust Data and AI Systems Many practitioners like to jump headfirst into the nitty-gritty details of implementing AI-powered tools. We get it: tinkering your way into a solution can sometimes save you time, and it’s often a fun way to go about learning. As the articles we’re highlighting this week show,…

November 22, 2025
Atlas Gaussian processes on restricted domains and point clouds

Atlas Gaussian processes on restricted domains and point clouds arXiv:2511.15822v1 Announce Type: new Abstract: In real-world applications, data often reside in restricted domains with unknown boundaries, or as high-dimensional point clouds lying on a lower-dimensional, nontrivial, unknown manifold. Traditional Gaussian Processes (GPs) struggle to capture the underlying geometry in such settings. Some existing methods assume…

November 21, 2025
Angular Graph Fractional Fourier Transform: Theory and Application

Angular Graph Fractional Fourier Transform: Theory and Application arXiv:2511.16111v1 Announce Type: new Abstract: Graph spectral representations are fundamental in graph signal processing, offering a rigorous framework for analyzing and processing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the classical graph Fourier transform (GFT) with a fractional-order parameter, enabling flexible spectral analysis while preserving…

November 21, 2025
Time dependent loss reweighting for flow matching and diffusion models is theoretically justified

Time dependent loss reweighting for flow matching and diffusion models is theoretically justified arXiv:2511.16599v1 Announce Type: new Abstract: This brief note clarifies that, in Generator Matching (which subsumes a large family of flow matching and diffusion models over continuous, manifold, and discrete spaces), both the Bregman divergence loss and the linear parameterization of the generator…

November 21, 2025
Spectral Identifiability for Interpretable Probe Geometry

Spectral Identifiability for Interpretable Probe Geometry arXiv:2511.16288v1 Announce Type: new Abstract: Linear probes are widely used to interpret and evaluate neural representations, yet their reliability remains unclear, as probes may appear accurate in some regimes but collapse unpredictably in others. We uncover a spectral mechanism behind this phenomenon and formalize it as the Spectral Identifiability…

November 21, 2025
Rate-optimal community detection near the KS threshold via node-robust algorithms

Rate-optimal community detection near the KS threshold via node-robust algorithms arXiv:2511.16613v1 Announce Type: new Abstract: We study community detection in the emph{symmetric $k$-stochastic block model}, where $n$ nodes are evenly partitioned into $k$ clusters with intra- and inter-cluster connection probabilities $p$ and $q$, respectively. Our main result is a polynomial-time algorithm that achieves the minimax-optimal…

November 21, 2025
How to Use Gemini 3 Pro Efficiently

How to Use Gemini 3 Pro Efficiently Learn the pros and cons of Gemini 3 Pro, from testing with both coding and console usage The post How to Use Gemini 3 Pro Efficiently appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

November 21, 2025
Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair)

Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair) An explanation of time-series visualization, including in-depth code examples in Matplotlib, Plotly, and Altair. The post Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair) appeared first on Towards Data Science. Murtaza Ali Go to original…

November 21, 2025
How Relevance Models Foreshadowed Transformers for NLP

How Relevance Models Foreshadowed Transformers for NLP Tracing the history of LLM attention: standing on the shoulders of giants The post How Relevance Models Foreshadowed Transformers for NLP appeared first on Towards Data Science. Sean Moran Go to original source

November 21, 2025
Why I’m Making the Switch to marimo Notebooks

Why I’m Making the Switch to marimo Notebooks A fresh way to think about computational notebooks The post Why I’m Making the Switch to marimo Notebooks appeared first on Towards Data Science. Parul Pandey Go to original source

November 21, 2025
Convex Clustering Redefined: Robust Learning with the Median of Means Estimator

Convex Clustering Redefined: Robust Learning with the Median of Means Estimator arXiv:2511.14784v1 Announce Type: new Abstract: Clustering approaches that utilize convex loss functions have recently attracted growing interest in the formation of compact data clusters. Although classical methods like k-means and its wide family of variants are still widely used, all of them require the…

November 20, 2025
Implicit Bias of the JKO Scheme

Implicit Bias of the JKO Scheme arXiv:2511.14827v1 Announce Type: new Abstract: Wasserstein gradient flow provides a general framework for minimizing an energy functional $J$ over the space of probability measures on a Riemannian manifold $(M,g)$. Its canonical time-discretization, the Jordan-Kinderlehrer-Otto (JKO) scheme, produces for any step size $eta>0$ a sequence of probability distributions $rho_k^eta$ that…

November 20, 2025
Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit arXiv:2511.15120v1 Announce Type: new Abstract: In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(boldsymbol{x})=g(boldsymbol{U}boldsymbol{x})$ with hidden subspace $boldsymbol{U}in mathbb{R}^{rtimes d}$, which is the…

November 20, 2025
Latent space analysis and generalization to out-of-distribution data

Latent space analysis and generalization to out-of-distribution data arXiv:2511.15010v1 Announce Type: new Abstract: Understanding the relationships between data points in the latent decision space derived by the deep learning system is critical to evaluating and interpreting the performance of the system on real world data. Detecting textit{out-of-distribution} (OOD) data for deep learning systems continues to…

November 20, 2025
Beyond Uncertainty Sets: Leveraging Optimal Transport to Extend Conformal Predictive Distribution to Multivariate Settings

Beyond Uncertainty Sets: Leveraging Optimal Transport to Extend Conformal Predictive Distribution to Multivariate Settings arXiv:2511.15146v1 Announce Type: new Abstract: Conformal prediction (CP) constructs uncertainty sets for model outputs with finite-sample coverage guarantees. A candidate output is included in the prediction set if its non-conformity score is not considered extreme relative to the scores observed on…

November 20, 2025
How to Perform Agentic Information Retrieval

How to Perform Agentic Information Retrieval Learn how to utilize AI agents to find information in your document corpus The post How to Perform Agentic Information Retrieval appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

November 20, 2025
Developing Human Sexuality in the Age of AI

Developing Human Sexuality in the Age of AI How we learn is changing with generative AI — what does that mean for sex education, consent, and responsibility? The post Developing Human Sexuality in the Age of AI appeared first on Towards Data Science. Stephanie Kirmer Go to original source

November 20, 2025
PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch

PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch Hands-on PyTorch: Building a 3-layer neural network for multiple regression The post PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch appeared first on Towards Data Science. Gustavo Santos Go to original source

November 20, 2025
Making Smarter Bets: Towards a Winning AI Strategy with Probabilistic Thinking

Making Smarter Bets: Towards a Winning AI Strategy with Probabilistic Thinking Practical guidance on identifying opportunities, managing product portfolios, and overcoming behavioral biases The post Making Smarter Bets: Towards a Winning AI Strategy with Probabilistic Thinking appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

November 20, 2025
Uncertainty-Calibrated Prediction of Randomly-Timed Biomarker Trajectories with Conformal Bands

Uncertainty-Calibrated Prediction of Randomly-Timed Biomarker Trajectories with Conformal Bands arXiv:2511.13911v1 Announce Type: new Abstract: Despite recent progress in predicting biomarker trajectories from real clinical data, uncertainty in the predictions poses high-stakes risks (e.g., misdiagnosis) that limit their clinical deployment. To enable safe and reliable use of such predictions in healthcare, we introduce a conformal method…

November 19, 2025
Knowledge vs. Experience: Asymptotic Limits of Impatience in Edge Tenants

Knowledge vs. Experience: Asymptotic Limits of Impatience in Edge Tenants arXiv:2511.13763v1 Announce Type: new Abstract: We study how two information feeds, a closed-form Markov estimator of residual sojourn and an online trained actor-critic, affect reneging and jockeying in a dual M/M/1 system. Analytically, for unequal service rates and total-time patience, we show that total wait…

November 19, 2025
Empirical Likelihood for Random Forests and Ensembles

Empirical Likelihood for Random Forests and Ensembles arXiv:2511.13934v1 Announce Type: new Abstract: We develop an empirical likelihood (EL) framework for random forests and related ensemble methods, providing a likelihood-based approach to quantify their statistical uncertainty. Exploiting the incomplete $U$-statistic structure inherent in ensemble predictions, we construct an EL statistic that is asymptotically chi-squared when subsampling…

November 19, 2025
Splat Regression Models

Splat Regression Models arXiv:2511.14042v1 Announce Type: new Abstract: We introduce a highly expressive class of function approximators called Splat Regression Models. Model outputs are mixtures of heterogeneous and anisotropic bump functions, termed splats, each weighted by an output vector. The power of splat modeling lies in its ability to locally adjust the scale and direction…

November 19, 2025
SCOPE: Spectral Concentration by Distributionally Robust Joint Covariance-Precision Estimation

SCOPE: Spectral Concentration by Distributionally Robust Joint Covariance-Precision Estimation arXiv:2511.14146v1 Announce Type: new Abstract: We propose a distributionally robust formulation for simultaneously estimating the covariance matrix and the precision matrix of a random vector.The proposed model minimizes the worst-case weighted sum of the Frobenius loss of the covariance estimator and Stein’s loss of the precision…

November 19, 2025
How to Build an Over-Engineered Retrieval System

How to Build an Over-Engineered Retrieval System Which is actually how some people do it The post How to Build an Over-Engineered Retrieval System appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

November 19, 2025
Why LLMs Aren’t a One-Size-Fits-All Solution for Enterprises

Why LLMs Aren’t a One-Size-Fits-All Solution for Enterprises LLMs are a seamless way to find value in your unstructured data, but the truth is, there is so much more value hidden within your structured data. This post explores what LLMs are (and aren’t) optimized for and how the industry is approaching AI over structured business…

November 19, 2025
How Deep Feature Embeddings and Euclidean Similarity Power Automatic Plant Leaf Recognition

How Deep Feature Embeddings and Euclidean Similarity Power Automatic Plant Leaf Recognition Introduction Automatic plant leaf detection is a remarkable innovation in computer vision and machine learning, enabling the identification of plant species by examining a photograph of the leaves. Deep learning is applied to extract meaningful features from an image of leaves and convert…

November 19, 2025
Introducing Google’s File Search Tool

Introducing Google’s File Search Tool The search giant fires its latest salvo against traditional RAG processing. The post Introducing Google’s File Search Tool appeared first on Towards Data Science. Thomas Reid Go to original source

November 19, 2025
Generalized Inequality-based Approach for Probabilistic WCET Estimation

Generalized Inequality-based Approach for Probabilistic WCET Estimation arXiv:2511.11682v1 Announce Type: new Abstract: Estimating the probabilistic Worst-Case Execution Time (pWCET) is essential for ensuring the timing correctness of real-time applications, such as in robot IoT systems and autonomous driving systems. While methods based on Extreme Value Theory (EVT) can provide tight bounds, they suffer from model…

November 18, 2025
FreDN: Spectral Disentanglement for Time Series Forecasting via Learnable Frequency Decomposition

FreDN: Spectral Disentanglement for Time Series Forecasting via Learnable Frequency Decomposition arXiv:2511.11817v1 Announce Type: new Abstract: Time series forecasting is essential in a wide range of real world applications. Recently, frequency-domain methods have attracted increasing interest for their ability to capture global dependencies. However, when applied to non-stationary time series, these methods encounter the $textit{spectral…

November 18, 2025
PCA recovery thresholds in low-rank matrix inference with sparse noise

PCA recovery thresholds in low-rank matrix inference with sparse noise arXiv:2511.11927v1 Announce Type: new Abstract: We study the high-dimensional inference of a rank-one signal corrupted by sparse noise. The noise is modelled as the adjacency matrix of a weighted undirected graph with finite average connectivity in the large size limit. Using the replica method from…

November 18, 2025
Bayesian–AI Fusion for Epidemiological Decision Making: Calibrated Risk, Honest Uncertainty, and Hyperparameter Intelligence

Bayesian–AI Fusion for Epidemiological Decision Making: Calibrated Risk, Honest Uncertainty, and Hyperparameter Intelligence arXiv:2511.11983v1 Announce Type: new Abstract: Modern epidemiological analytics increasingly use machine learning models that offer strong prediction but often lack calibrated uncertainty. Bayesian methods provide principled uncertainty quantification, yet are viewed as difficult to integrate with contemporary AI workflows. This paper proposes…

November 18, 2025
PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning

PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning arXiv:2511.12278v1 Announce Type: new Abstract: High-dimensional data often contain low-dimensional signals obscured by structured background noise, which limits the effectiveness of standard PCA. Motivated by contrastive learning, we address the problem of recovering shared signal subspaces from positive pairs, paired observations sharing the same…

November 18, 2025
Understanding Convolutional Neural Networks (CNNs) Through Excel

Understanding Convolutional Neural Networks (CNNs) Through Excel Deep learning is often seen as a black box. We know that it learns from data, but we rarely stop to ask how it truly learns. What if we could open that box and watch each step happen right before our eyes? With Excel, we can do exactly…

November 18, 2025
Javascript Fatigue: HTMX Is All You Need to Build ChatGPT — Part 2

Javascript Fatigue: HTMX Is All You Need to Build ChatGPT — Part 2 In part 1, we showed how we could leverage HTMX to add interactivity to our HTML elements. In other words, Javascript without Javascript. To illustrate that, we began building a simple chat that would return a simulated LLM response. In this article,…

November 18, 2025
Introducing ShaTS: A Shapley-Based Method for Time-Series Models

Introducing ShaTS: A Shapley-Based Method for Time-Series Models Why you should not explain your time-series data with tabular Shapley methods The post Introducing ShaTS: A Shapley-Based Method for Time-Series Models appeared first on Towards Data Science. Manuel Franco de la Peña Go to original source

November 18, 2025
The Absolute Beginner’s Guide to Pandas DataFrames

The Absolute Beginner’s Guide to Pandas DataFrames Learn how to initialize dataframes from dictionaries, lists, and NumPy arrays The post The Absolute Beginner’s Guide to Pandas DataFrames appeared first on Towards Data Science. Ibrahim Salami Go to original source

November 18, 2025
Javascript Fatigue: HTMX is all you need to build ChatGPT — Part 1

Javascript Fatigue: HTMX is all you need to build ChatGPT — Part 1 Building a chatbot (almost) without Javascript, only with Python and HTML. The post Javascript Fatigue: HTMX is all you need to build ChatGPT — Part 1 appeared first on Towards Data Science. Benjamin Etienne Go to original source

November 18, 2025
Neural Local Wasserstein Regression

Neural Local Wasserstein Regression arXiv:2511.10824v1 Announce Type: new Abstract: We study the estimation problem of distribution-on-distribution regression, where both predictors and responses are probability measures. Existing approaches typically rely on a global optimal transport map or tangent-space linearization, which can be restrictive in approximation capacity and distort geometry in multivariate underlying domains. In this paper,…

November 17, 2025
Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data

Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data arXiv:2511.10919v1 Announce Type: new Abstract: Positive-Unlabeled (PU) learning presents unique challenges due to the lack of explicitly labeled negative samples, particularly in high-stakes domains such as fraud detection and medical diagnosis. To address data scarcity and privacy constraints, we propose a novel transfer learning with…

November 17, 2025
Drift Estimation for Diffusion Processes Using Neural Networks Based on Discretely Observed Independent Paths

Drift Estimation for Diffusion Processes Using Neural Networks Based on Discretely Observed Independent Paths arXiv:2511.11161v1 Announce Type: new Abstract: This paper addresses the nonparametric estimation of the drift function over a compact domain for a time-homogeneous diffusion process, based on high-frequency discrete observations from $N$ independent trajectories. We propose a neural network-based estimator and derive…

November 17, 2025
Decomposing Direct and Indirect Biases in Linear Models under Demographic Parity Constraint

Decomposing Direct and Indirect Biases in Linear Models under Demographic Parity Constraint arXiv:2511.11294v1 Announce Type: new Abstract: Linear models are widely used in high-stakes decision-making due to their simplicity and interpretability. Yet when fairness constraints such as demographic parity are introduced, their effects on model coefficients, and thus on how predictive bias is distributed across…

November 17, 2025
Bayesian Evaluation of Large Language Model Behavior

Bayesian Evaluation of Large Language Model Behavior arXiv:2511.10661v1 Announce Type: cross Abstract: It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts…

November 17, 2025
Weekly Entering & Transitioning – Thread 17 Nov, 2025 – 24 Nov, 2025

Weekly Entering & Transitioning – Thread 17 Nov, 2025 – 24 Nov, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

November 17, 2025