Category: aimldsaimlds

Hierarchical Variable Importance with Statistical Control for Medical Data-Based Prediction

Hierarchical Variable Importance with Statistical Control for Medical Data-Based Prediction arXiv:2508.08724v1 Announce Type: new Abstract: Recent advances in machine learning have greatly expanded the repertoire of predictive methods for medical imaging. However, the interpretability of complex models remains a challenge, which limits their utility in medical applications. Recently, model-agnostic methods have been proposed to measure…

August 13, 2025
Bio-Inspired Artificial Neural Networks based on Predictive Coding

Bio-Inspired Artificial Neural Networks based on Predictive Coding arXiv:2508.08762v1 Announce Type: new Abstract: Backpropagation (BP) of errors is the backbone training algorithm for artificial neural networks (ANNs). It updates network weights through gradient descent to minimize a loss function representing the mismatch between predictions and desired outputs. BP uses the chain rule to propagate the…

August 13, 2025
Reducing Time to Value for Data Science Projects: Part 4

Reducing Time to Value for Data Science Projects: Part 4 Embrace your inner software developer The post Reducing Time to Value for Data Science Projects: Part 4 appeared first on Towards Data Science. Kristopher McGlinchey Go to original source

August 13, 2025
Model Predictive Control Basics

Model Predictive Control Basics A hands-on tutorial with Python and CasADi The post Model Predictive Control Basics appeared first on Towards Data Science. Willem Esterhuizen Go to original source

August 13, 2025
Coconut: A Framework for Latent Reasoning in LLMs

Coconut: A Framework for Latent Reasoning in LLMs Explaining Coconut (Training Large Language Models to Reason in a Continuous Latent Space) in simple terms The post Coconut: A Framework for Latent Reasoning in LLMs appeared first on Towards Data Science. Youssef Farag Go to original source

August 13, 2025
A Refined Training Recipe for Fine-Grained Visual Classification

A Refined Training Recipe for Fine-Grained Visual Classification How FGVC aims to recognize images belonging to multiple subordinate categories of a super-category The post A Refined Training Recipe for Fine-Grained Visual Classification appeared first on Towards Data Science. Ahmed Belgacem Go to original source

August 13, 2025
Fine-Tune Your Topic Modeling Workflow with BERTopic

Fine-Tune Your Topic Modeling Workflow with BERTopic Learn how to fine-tune BERTopic settings for more focused, reproducible, and interpretable results The post Fine-Tune Your Topic Modeling Workflow with BERTopic appeared first on Towards Data Science. Tiffany Chen Go to original source

August 13, 2025
Federated Online Learning for Heterogeneous Multisource Streaming Data

Federated Online Learning for Heterogeneous Multisource Streaming Data arXiv:2508.06652v1 Announce Type: new Abstract: Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns. Most existing federated learning methods focus on the “static” datasets. However, in many real-world applications, data arrive continuously over time, forming streaming datasets. This introduces additional…

August 12, 2025
MOCA-HESP: Meta High-dimensional Bayesian Optimization for Combinatorial and Mixed Spaces via Hyper-ellipsoid Partitioning

MOCA-HESP: Meta High-dimensional Bayesian Optimization for Combinatorial and Mixed Spaces via Hyper-ellipsoid Partitioning arXiv:2508.06847v1 Announce Type: new Abstract: High-dimensional Bayesian Optimization (BO) has attracted significant attention in recent research. However, existing methods have mainly focused on optimizing in continuous domains, while combinatorial (ordinal and categorical) and mixed domains still remain challenging. In this paper, we…

August 12, 2025
Membership Inference Attacks with False Discovery Rate Control

Membership Inference Attacks with False Discovery Rate Control arXiv:2508.07066v1 Announce Type: new Abstract: Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have…

August 12, 2025
Statistical Inference for Autoencoder-based Anomaly Detection after Representation Learning-based Domain Adaptation

Statistical Inference for Autoencoder-based Anomaly Detection after Representation Learning-based Domain Adaptation arXiv:2508.07049v1 Announce Type: new Abstract: Anomaly detection (AD) plays a vital role across a wide range of domains, but its performance might deteriorate when applied to target domains with limited data. Domain Adaptation (DA) offers a solution by transferring knowledge from a related source…

August 12, 2025
Stochastic dynamics learning with state-space systems

Stochastic dynamics learning with state-space systems arXiv:2508.07876v1 Announce Type: new Abstract: This work advances the theoretical foundations of reservoir computing (RC) by providing a unified treatment of fading memory and the echo state property (ESP) in both deterministic and stochastic settings. We investigate state-space systems, a central model class in time series learning, and establish…

August 12, 2025
Estimating from No Data: Deriving a Continuous Score from Categories

Estimating from No Data: Deriving a Continuous Score from Categories A walk-through of and the maths behind using low-capacity networks to acquire fine-grained scoring when only categorical labelling is available for training. We use it to predict the severity of an infection on a scale based on information on just rough outcomes in previous cases.…

August 12, 2025
Introducing Google’s LangExtract tool

Introducing Google’s LangExtract tool Do RAG without doing RAG with this powerful new NLP and data extraction library The post Introducing Google’s LangExtract tool appeared first on Towards Data Science. Thomas Reid Go to original source

August 12, 2025
From Genes to Neural Networks: Understanding and Building NEAT (Neuro-Evolution of Augmenting Topologies) from Scratch

From Genes to Neural Networks: Understanding and Building NEAT (Neuro-Evolution of Augmenting Topologies) from Scratch Practical Neuroevolution: Reproducing NEAT’s Innovations and Code Walkthrough The post From Genes to Neural Networks: Understanding and Building NEAT (Neuro-Evolution of Augmenting Topologies) from Scratch appeared first on Towards Data Science. Carlos Redondo Go to original source

August 12, 2025
LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions

LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions Stop guessing your statistical test. Let this AI do it for you. The post LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions appeared first on Towards Data Science. Gustavo Santos Go to original source

August 12, 2025
Random Walk Learning and the Pac-Man Attack

Random Walk Learning and the Pac-Man Attack arXiv:2508.05663v1 Announce Type: new Abstract: Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an…

August 11, 2025
Stochastic Trace Optimization of Parameter Dependent Matrices Based on Statistical Learning Theory

Stochastic Trace Optimization of Parameter Dependent Matrices Based on Statistical Learning Theory arXiv:2508.05764v1 Announce Type: new Abstract: We consider matrices $boldsymbol{A}(boldsymboltheta)inmathbb{R}^{mtimes m}$ that depend, possibly nonlinearly, on a parameter $boldsymboltheta$ from a compact parameter space $Theta$. We present a Monte Carlo estimator for minimizing $text{trace}(boldsymbol{A}(boldsymboltheta))$ over all $boldsymbolthetainTheta$, and determine the sampling amount so that…

August 11, 2025
Reduction Techniques for Survival Analysis

Reduction Techniques for Survival Analysis arXiv:2508.05715v1 Announce Type: new Abstract: In this work, we discuss what we refer to as reduction techniques for survival analysis, that is, techniques that “reduce” a survival task to a more common regression or classification task, without ignoring the specifics of survival data. Such techniques particularly facilitate machine learning-based survival…

August 11, 2025
Lightweight Auto-bidding based on Traffic Prediction in Live Advertising

Lightweight Auto-bidding based on Traffic Prediction in Live Advertising arXiv:2508.06069v1 Announce Type: new Abstract: Internet live streaming is widely used in online entertainment and e-commerce, where live advertising is an important marketing tool for anchors. An advertising campaign hopes to maximize the effect (such as conversions) under constraints (such as budget and cost-per-click). The mainstream…

August 11, 2025
Decorrelated feature importance from local sample weighting

Decorrelated feature importance from local sample weighting arXiv:2508.06337v1 Announce Type: new Abstract: Feature importance (FI) statistics provide a prominent and valuable method of insight into the decision process of machine learning (ML) models, but their effectiveness has well-known limitations when correlation is present among the features in the training data. In this case, the FI…

August 11, 2025
Weekly Entering & Transitioning – Thread 11 Aug, 2025 – 18 Aug, 2025

Weekly Entering & Transitioning – Thread 11 Aug, 2025 – 18 Aug, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

August 11, 2025
Catch-22: Learning R through “hands on” Projects

Catch-22: Learning R through “hands on” Projects I often get told “learn data science by doing hands-on projects” and then I get all fired up and motivated to learn, and then I open up R…. And then I stare at a blank screen because I don’t know the syntax from memory. And then I tell…

August 11, 2025
AI isn’t taking your job. Executives are.

AI isn’t taking your job. Executives are. If AI is ready to replace developers, why aren’t developers replacing themselves with AI and just taking it easy at work? I’m a Director at my company. I’m in the meetings and helping set up the tools that cost people their jobs. Here’s how they work: Claude AI…

August 11, 2025
Burnout, disillusionment, and imposter syndrome after 1 year in DS. Am I just an API monkey? Reality check needed.

Burnout, disillusionment, and imposter syndrome after 1 year in DS. Am I just an API monkey? Reality check needed. Hey folks, I am about a year into my first data science job. It took roughly a year and more than 400 applications to land it, so the idea of another long search is scary. Early…

August 11, 2025
Business focused data science

Business focused data science As a microbiology researcher, I’m far away from the business world. I do more -omics and growth curves and molecular techniques, but I want to move away from biology. I believe the bridge that can help me do that is data. I have got experience with R and excel. I’m looking…

August 11, 2025
How to Design Machine Learning Experiments — the Right Way

How to Design Machine Learning Experiments — the Right Way The key to successful ML projects isn’t always more resources The post How to Design Machine Learning Experiments — the Right Way appeared first on Towards Data Science. TDS Editors Go to original source

August 9, 2025
How to Write Insightful Technical Articles

How to Write Insightful Technical Articles Learn how to write informative technical articles The post How to Write Insightful Technical Articles appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 9, 2025
Generating Structured Outputs from LLMs

Generating Structured Outputs from LLMs An overview of popular techniques to confine LLMs’ output to a predefined schema The post Generating Structured Outputs from LLMs appeared first on Towards Data Science. Ibrahim Habib Go to original source

August 9, 2025
Demystifying Cosine Similarity

Demystifying Cosine Similarity Mathematical intuition and practical considerations for NLP scenarios The post Demystifying Cosine Similarity appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

August 9, 2025
Differentially Private Model-X Knockoffs via Johnson-Lindenstrauss Transform

Differentially Private Model-X Knockoffs via Johnson-Lindenstrauss Transform arXiv:2508.04800v1 Announce Type: new Abstract: We introduce a novel privatization framework for high-dimensional controlled variable selection. Our framework enables rigorous False Discovery Rate (FDR) control under differential privacy constraints. While the Model-X knockoff procedure provides FDR guarantees by constructing provably exchangeable “negative control” features, existing privacy mechanisms like…

August 8, 2025
The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models

The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models arXiv:2508.04884v1 Announce Type: new Abstract: In this work, we study the problem of choosing the discretisation schedule for sampling from masked discrete diffusion models in terms of the information geometry of the induced probability path. Specifically, we show that the optimal schedule under the Fisher-Rao…

August 8, 2025
L1-Regularized Functional Support Vector Machine

L1-Regularized Functional Support Vector Machine arXiv:2508.05567v1 Announce Type: new Abstract: In functional data analysis, binary classification with one functional covariate has been extensively studied. We aim to fill in the gap of considering multivariate functional covariates in classification. In particular, we propose an $L_1$-regularized functional support vector machine for binary classification. An accompanying algorithm is…

August 8, 2025
High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference

High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference arXiv:2508.05212v1 Announce Type: new Abstract: With the development of big data and machine learning, privacy concerns have become increasingly critical, especially when handling heterogeneous datasets containing sensitive personal information. Differential privacy provides a rigorous framework for safeguarding individual privacy while enabling meaningful statistical analysis. In…

August 8, 2025
High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation

High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation arXiv:2508.05570v1 Announce Type: new Abstract: In this paper, we study the bias and high-order error bounds of the Linear Stochastic Approximation (LSA) algorithm with Polyak-Ruppert (PR) averaging under Markovian noise. We focus on the version of the algorithm with constant step size $alpha$ and propose a…

August 8, 2025
Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing

Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing Explore how STL uses LOESS smoothing to extract trend and seasonal components. The post Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing appeared first on Towards Data Science. Nikhil Dasari Go to original source

August 8, 2025
Finding Golden Examples: A Smarter Approach to In-Context Learning

Finding Golden Examples: A Smarter Approach to In-Context Learning From random example selection to systematic AuPair generation — how to make your LLM prompts actually work The post Finding Golden Examples: A Smarter Approach to In-Context Learning appeared first on Towards Data Science. Sudheer Singh Go to original source

August 8, 2025
Agentic AI: On Evaluations

Agentic AI: On Evaluations Metrics to track for RAG and agents, plus the frameworks that help The post Agentic AI: On Evaluations appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

August 8, 2025
The Channel-Wise Attention | Squeeze and Excitation

The Channel-Wise Attention | Squeeze and Excitation Applying the Squeeze and Excitation module on ResNeXt using PyTorch The post The Channel-Wise Attention | Squeeze and Excitation appeared first on Towards Data Science. Muhammad Ardi Go to original source

August 8, 2025
Reliable Programmatic Weak Supervision with Confidence Intervals for Label Probabilities

Reliable Programmatic Weak Supervision with Confidence Intervals for Label Probabilities arXiv:2508.03896v1 Announce Type: new Abstract: The accurate labeling of datasets is often both costly and time-consuming. Given an unlabeled dataset, programmatic weak supervision obtains probabilistic predictions for the labels by leveraging multiple weak labeling functions (LFs) that provide rough guesses for labels. Weak LFs commonly…

August 7, 2025
Reinforcement Learning in MDPs with Information-Ordered Policies

Reinforcement Learning in MDPs with Information-Ordered Policies arXiv:2508.03904v1 Announce Type: new Abstract: We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, $pi’ leq pi$ if data collected under $pi$ can be used to estimate the performance of $pi’$,…

August 7, 2025
Deep Neural Network-Driven Adaptive Filtering

Deep Neural Network-Driven Adaptive Filtering arXiv:2508.04258v1 Announce Type: new Abstract: This paper proposes a deep neural network (DNN)-driven framework to address the longstanding generalization challenge in adaptive filtering (AF). In contrast to traditional AF frameworks that emphasize explicit cost function design, the proposed framework shifts the paradigm toward direct gradient acquisition. The DNN, functioning as…

August 7, 2025
Negative binomial regression and inference using a pre-trained transformer

Negative binomial regression and inference using a pre-trained transformer arXiv:2508.04111v1 Announce Type: new Abstract: Negative binomial regression is essential for analyzing over-dispersed count data in in comparative studies, but parameter estimation becomes computationally challenging in large screens requiring millions of comparisons. We investigate using a pre-trained transformer to produce estimates of negative binomial regression parameters…

August 7, 2025
The Relative Instability of Model Comparison with Cross-validation

The Relative Instability of Model Comparison with Cross-validation arXiv:2508.04409v1 Announce Type: new Abstract: Existing work has shown that cross-validation (CV) can be used to provide an asymptotic confidence interval for the test error of a stable machine learning algorithm, and existing stability results for many popular algorithms can be applied to derive positive instances where…

August 7, 2025
The MCP Security Survival Guide: Best Practices, Pitfalls, and Real-World Lessons

The MCP Security Survival Guide: Best Practices, Pitfalls, and Real-World Lessons Unless you’re someone who lives and breathes cybersecurity, chances are you didn’t think much about authentication, network exposure, or what happens if someone else finds your server. This guide isn’t here to kill the excitement—it’s here to help you use MCP without opening the…

August 7, 2025
How I Won the “Mostly AI” Synthetic Data Challenge

How I Won the “Mostly AI” Synthetic Data Challenge A deep dive into how post-processing can supercharge synthetic data generation The post How I Won the “Mostly AI” Synthetic Data Challenge appeared first on Towards Data Science. Daniel Gärber Go to original source

August 7, 2025
The Machine, the Expert, and the Common Folks

The Machine, the Expert, and the Common Folks A look at noise, consistency and broken legs The post The Machine, the Expert, and the Common Folks appeared first on Towards Data Science. Lars Nørtoft Reiter Go to original source

August 7, 2025
InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI

InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI Learn how InfiniBand and RoCEv2 enable high-speed GPU communication The post InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI appeared first on Towards Data Science. Shireesh Kumar Singh Go to original source

August 7, 2025
A Dual Optimization View to Empirical Risk Minimization with f-Divergence Regularization

A Dual Optimization View to Empirical Risk Minimization with f-Divergence Regularization arXiv:2508.03314v1 Announce Type: new Abstract: The dual formulation of empirical risk minimization with f-divergence regularization (ERM-fDR) is introduced. The solution of the dual optimization problem to the ERM-fDR is connected to the notion of normalization function introduced as an implicit function. This dual approach…

August 6, 2025
Hedging with memory: shallow and deep learning with signatures

Hedging with memory: shallow and deep learning with signatures arXiv:2508.02759v1 Announce Type: new Abstract: We investigate the use of path signatures in a machine learning context for hedging exotic derivatives under non-Markovian stochastic volatility models. In a deep learning setting, we use signatures as features in feedforward neural networks and show that they outperform LSTMs…

August 6, 2025
Supervised Dynamic Dimension Reduction with Deep Neural Network

Supervised Dynamic Dimension Reduction with Deep Neural Network arXiv:2508.03546v1 Announce Type: new Abstract: This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process.…

August 6, 2025
Likelihood Matching for Diffusion Models

Likelihood Matching for Diffusion Models arXiv:2508.03636v1 Announce Type: new Abstract: We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, a quasi-likelihood is considered…

August 6, 2025
Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws arXiv:2508.03688v1 Announce Type: new Abstract: We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y propto sum_{j=1}^{r}lambda_j sigmaleft(langle boldsymbol{theta_j}, boldsymbol{x}rangleright), boldsymbol{x} sim N(0,boldsymbol{I}_d)$,…

August 6, 2025
Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

Context Engineering — A Comprehensive Hands-On Tutorial with DSPy Let’s dissect the art and science of context engineering, one module at a time! The post Context Engineering — A Comprehensive Hands-On Tutorial with DSPy appeared first on Towards Data Science. Avishek Biswas Go to original source

August 6, 2025
Things I Wish I Had Known Before Starting ML

Things I Wish I Had Known Before Starting ML Part 2: Guardrails, research code, reading The post Things I Wish I Had Known Before Starting ML appeared first on Towards Data Science. Pascal Janetzky Go to original source

August 6, 2025
How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus

How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus Welcome to the 21st century by the hand of large language models and reasoning AI agents The post How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus appeared first on Towards Data…

August 6, 2025
Stellar Flare Detection and Prediction Using Clustering and Machine Learning

Stellar Flare Detection and Prediction Using Clustering and Machine Learning Combining unsupervised clustering with supervised learning to detect and predict stellar flares The post Stellar Flare Detection and Prediction Using Clustering and Machine Learning appeared first on Towards Data Science. Diksha Sen Chaudhury Go to original source

August 6, 2025
Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 3)

Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 3) Let’s observe the matter on the atomic level The post Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 3) appeared first on Towards Data Science. Dmitrii Eliuseev Go to original source

August 6, 2025
Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling

Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling arXiv:2508.01217v1 Announce Type: new Abstract: Deep learning has revolutionized modern data science. However, how to accurately quantify the uncertainty of predictions from large-scale deep neural networks (DNNs) remains an unresolved issue. To address this issue, we introduce a novel post-processing approach. This approach feeds the output…

August 5, 2025
Inequalities for Optimization of Classification Algorithms: A Perspective Motivated by Diagnostic Testing

Inequalities for Optimization of Classification Algorithms: A Perspective Motivated by Diagnostic Testing arXiv:2508.01065v1 Announce Type: new Abstract: Motivated by canonical problems in medical diagnostics, we propose and study properties of an objective function that uniformly bounds uncertainties in quantities of interest extracted from classifiers and related data analysis tools. We begin by adopting a set-theoretic…

August 5, 2025
Flow IV: Counterfactual Inference In Nonseparable Outcome Models Using Instrumental Variables

Flow IV: Counterfactual Inference In Nonseparable Outcome Models Using Instrumental Variables arXiv:2508.01321v1 Announce Type: new Abstract: To reach human level intelligence, learning algorithms need to incorporate causal reasoning. But identifying causality, and particularly counterfactual reasoning, remains an elusive task. In this paper, we make progress on this task by utilizing instrumental variables (IVs). IVs are…

August 5, 2025
Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: “One Map, Many Trials” in Satellite-Driven Poverty Analysis

Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: “One Map, Many Trials” in Satellite-Driven Poverty Analysis arXiv:2508.01341v1 Announce Type: new Abstract: Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can…

August 5, 2025
Efficient optimization of expensive black-box simulators via marginal means, with application to neutrino detector design

Efficient optimization of expensive black-box simulators via marginal means, with application to neutrino detector design arXiv:2508.01834v1 Announce Type: new Abstract: With advances in scientific computing, computer experiments are increasingly used for optimizing complex systems. However, for modern applications, e.g., the optimization of nuclear physics detectors, each experiment run can require hundreds of CPU hours, making…

August 5, 2025
From Data Scientist IC to Manager: One Year In

From Data Scientist IC to Manager: One Year In Three pillars that shaped my first year in data science management - prioritization, empowerment, and recognition The post From Data Scientist IC to Manager: One Year In appeared first on Towards Data Science. Yu Dong Go to original source

August 5, 2025
Introducing Server-Sent Events in Python

Introducing Server-Sent Events in Python A simpler path to coding real-time web applications. The post Introducing Server-Sent Events in Python appeared first on Towards Data Science. Thomas Reid Go to original source

August 5, 2025
On Adding a Start Value to a Waterfall Chart in Power BI

On Adding a Start Value to a Waterfall Chart in Power BI A waterfall chart can be a powerful tool for conveying information. But it has some limitations. The post On Adding a Start Value to a Waterfall Chart in Power BI appeared first on Towards Data Science. Salvatore Cagliari Go to original source

August 5, 2025
Hands-On with Agents SDK: Multi-Agent Collaboration

Hands-On with Agents SDK: Multi-Agent Collaboration Explore the handoff and agents-as-tools patterns, their use cases, and how to customize them using OpenAI Agents SDK and Streamlit. The post Hands-On with Agents SDK: Multi-Agent Collaboration appeared first on Towards Data Science. Iqbal Rahmadhan Go to original source

August 5, 2025
Does the Code Work or Not?

Does the Code Work or Not? A common misconception about the working state of code in data, AI or software engineering fields. The post Does the Code Work or Not? appeared first on Towards Data Science. Marina Tosic Go to original source

August 5, 2025
funOCLUST: Clustering Functional Data with Outliers

funOCLUST: Clustering Functional Data with Outliers arXiv:2508.00110v1 Announce Type: new Abstract: Functional data present unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers. An extension of the OCLUST algorithm to the functional setting is proposed to address these issues. The approach leverages the OCLUST framework, creating a robust method to…

August 4, 2025
Sinusoidal Approximation Theorem for Kolmogorov-Arnold Networks

Sinusoidal Approximation Theorem for Kolmogorov-Arnold Networks arXiv:2508.00247v1 Announce Type: new Abstract: The Kolmogorov-Arnold representation theorem states that any continuous multivariable function can be exactly represented as a finite superposition of continuous single variable functions. Subsequent simplifications of this representation involve expressing these functions as parameterized sums of a smaller number of unique monotonic functions. These…

August 4, 2025
DO-EM: Density Operator Expectation Maximization

DO-EM: Density Operator Expectation Maximization arXiv:2507.22786v1 Announce Type: cross Abstract: Density operators, quantum generalizations of probability distributions, are gaining prominence in machine learning due to their foundational role in quantum computing. Generative modeling based on density operator models (textbf{DOMs}) is an emerging field, but existing training algorithms — such as those for the Quantum Boltzmann…

August 4, 2025
Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting

Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting arXiv:2508.00040v1 Announce Type: cross Abstract: This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to…

August 4, 2025
AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors

AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors arXiv:2508.00120v1 Announce Type: cross Abstract: Multimodal high-dimensional data are increasingly prevalent in biomedical research, yet they are often compromised by block-wise missingness and measurement errors, posing significant challenges for statistical inference and prediction. We propose AdapDISCOM, a novel adaptive…

August 4, 2025
Personal projects and skill set

Personal projects and skill set Hi everyone, I was just wondering how do you guys specify personal acquired skills from your personal projects in your CV. I’m in the midst of a pretty large project – end to end pipeline for predicting real time probabilities of winning chances in a game. This includes a lot…

August 4, 2025
Weekly Entering & Transitioning – Thread 04 Aug, 2025 – 11 Aug, 2025

Weekly Entering & Transitioning – Thread 04 Aug, 2025 – 11 Aug, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

August 4, 2025
Built this out of pure laziness for all my Feature engineering/model training jobs

Built this out of pure laziness for all my Feature engineering/model training jobs Built this out of pure laziness A lightweight Telegram bot that lets me: – Get Databricks job alerts – Check today’s status – Repair failed runs – Pause/reschedule , All from my phone. No laptop. No dashboard. Just / Commands. submitted by…

August 4, 2025
Is there a term for internal processing vs data that needs to be stakeholding/customer facing?

Is there a term for internal processing vs data that needs to be stakeholding/customer facing? For example I had my physical credit card stolen. I was trying to get information from the CC company about when the card was used so that the local PD could check security cameras. (We thought it was particular person…

August 4, 2025
Hi! i am a junior dev need advice regarding fraud/risk scoring (not credit) on my rules based fraud detection system.

Hi! i am a junior dev need advice regarding fraud/risk scoring (not credit) on my rules based fraud detection system. so i our team has developed a rules based fraud detecton system….now we have received a new requirement that we have to score every transaction as how much risky or if flagged as fraud how…

August 4, 2025
Mastering NLP with spaCy – Part 2

Mastering NLP with spaCy – Part 2 POS tagging, dependency parser and named entity recognition. The post Mastering NLP with spaCy – Part 2 appeared first on Towards Data Science. Marcello Politi Go to original source

August 2, 2025
How Computers “See” Molecules

How Computers “See” Molecules Generative Molecular Design (Part 1): common molecular representations in data science. The post How Computers “See” Molecules appeared first on Towards Data Science. Tianyuan Zheng Go to original source

August 2, 2025
“I think of analysts as data wizards who help their product teams solve problems”

“I think of analysts as data wizards who help their product teams solve problems” Mariya Mansurova explains how hands-on learning, agentic AI, and engineering habits shape her writing and work. The post “I think of analysts as data wizards who help their product teams solve problems” appeared first on Towards Data Science. TDS Editors Go…

August 2, 2025
When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems

When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems Models don’t just fail with noise; they fail in silence, by narrowing their attention to the point of fragility. The post When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems appeared first on Towards Data Science. Mahe Jabeen Abdul Go…

August 2, 2025
A Smoothing Newton Method for Rank-one Matrix Recovery

A Smoothing Newton Method for Rank-one Matrix Recovery arXiv:2507.23017v1 Announce Type: new Abstract: We consider the phase retrieval problem, which involves recovering a rank-one positive semidefinite matrix from rank-one measurements. A recently proposed algorithm based on Bures-Wasserstein gradient descent (BWGD) exhibits superlinear convergence, but it is unstable, and existing theory can only prove local linear…

August 1, 2025
Optimal Transport Learning: Balancing Value Optimization and Fairness in Individualized Treatment Rules

Optimal Transport Learning: Balancing Value Optimization and Fairness in Individualized Treatment Rules arXiv:2507.23349v1 Announce Type: new Abstract: Individualized treatment rules (ITRs) have gained significant attention due to their wide-ranging applications in fields such as precision medicine, ridesharing, and advertising recommendations. However, when ITRs are influenced by sensitive attributes such as race, gender, or age, they…

August 1, 2025
DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction

DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction arXiv:2507.23736v1 Announce Type: new Abstract: Access to medical imaging and associated text data has the potential to drive major advances in healthcare research and patient outcomes. However, the presence of Protected Health Information (PHI) and Personally Identifiable Information (PII) in Digital Imaging and…

August 1, 2025
Scaled Beta Models and Feature Dilution for Dynamic Ticket Pricing

Scaled Beta Models and Feature Dilution for Dynamic Ticket Pricing arXiv:2507.23767v1 Announce Type: new Abstract: A novel approach is presented for identifying distinct signatures of performing acts in the secondary ticket resale market by analyzing dynamic pricing distributions. Using a newly curated, time series dataset from the SeatGeek API, we model ticket pricing distributions as…

August 1, 2025
Formal Bayesian Transfer Learning via the Total Risk Prior

Formal Bayesian Transfer Learning via the Total Risk Prior arXiv:2507.23768v1 Announce Type: new Abstract: In analyses with severe data-limitations, augmenting the target dataset with information from ancillary datasets in the application domain, called source datasets, can lead to significantly improved statistical procedures. However, existing methods for this transfer learning struggle to deal with situations where…

August 1, 2025
FastSAM for Image Segmentation Tasks — Explained Simply

FastSAM for Image Segmentation Tasks — Explained Simply Image segmentation is a popular task in computer vision, with the goal of partitioning an input image into multiple regions, where each region represents a separate object. Several classic approaches from the past involved taking a model backbone (e.g., U-Net) and fine-tuning it on specialized datasets. While…

August 1, 2025
How to Benchmark LLMs – ARC AGI 3

How to Benchmark LLMs – ARC AGI 3 Learn how to LLMs are benchmarked, and try out the newly released ARC AGI 3 The post How to Benchmark LLMs – ARC AGI 3 appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 1, 2025
LLMs and Mental Health

LLMs and Mental Health Are LLMs good or bad for our mental health? It’s more complicated than that. The post LLMs and Mental Health appeared first on Towards Data Science. Stephanie Kirmer Go to original source

August 1, 2025
The ONLY Data Science Roadmap You Need to Get a Job

The ONLY Data Science Roadmap You Need to Get a Job Are you looking to become a data scientist and don’t know where to start? In this article, I want to provide you with a straightforward, no-nonsense learning roadmap that you can follow to break into the industry. By the end, you’ll finally have a clear…

August 1, 2025
Simulating Posterior Bayesian Neural Networks with Dependent Weights

Simulating Posterior Bayesian Neural Networks with Dependent Weights arXiv:2507.22095v1 Announce Type: new Abstract: In this paper we consider posterior Bayesian fully connected and feedforward deep neural networks with dependent weights. Particularly, if the likelihood is Gaussian, we identify the distribution of the wide width limit and provide an algorithm to sample from the network. In…

July 31, 2025
Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration

Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration arXiv:2507.22170v1 Announce Type: new Abstract: Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets. A commonly used model assumes that the data matrices are noisy observations of low-rank matrices with a shared singular subspace. In this case, two…

July 31, 2025
LVM-GP: Uncertainty-Aware PDE Solver via coupling latent variable model and Gaussian process

LVM-GP: Uncertainty-Aware PDE Solver via coupling latent variable model and Gaussian process arXiv:2507.22493v1 Announce Type: new Abstract: We propose a novel probabilistic framework, termed LVM-GP, for uncertainty quantification in solving forward and inverse partial differential equations (PDEs) with noisy data. The core idea is to construct a stochastic mapping from the input to a high-dimensional…

July 31, 2025
Subgrid BoostCNN: Efficient Boosting of Convolutional Networks via Gradient-Guided Feature Selection

Subgrid BoostCNN: Efficient Boosting of Convolutional Networks via Gradient-Guided Feature Selection arXiv:2507.22842v1 Announce Type: new Abstract: Convolutional Neural Networks (CNNs) have achieved remarkable success across a wide range of machine learning tasks by leveraging hierarchical feature learning through deep architectures. However, the large number of layers and millions of parameters often make CNNs computationally expensive…

July 31, 2025
A Unified Analysis of Generalization and Sample Complexity for Semi-Supervised Domain Adaptation

A Unified Analysis of Generalization and Sample Complexity for Semi-Supervised Domain Adaptation arXiv:2507.22632v1 Announce Type: new Abstract: Domain adaptation seeks to leverage the abundant label information in a source domain to improve classification performance in a target domain with limited labels. While the field has seen extensive methodological development, its theoretical foundations remain relatively underexplored.…

July 31, 2025
The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix

The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix Retraining is easy; knowing when not to is the real challenge. In machine learning, performance drops are rarely about stale weights; they’re about misunderstood signals. The post The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix appeared first on Towards Data Science.…

July 31, 2025
Confusion Matrix Made Simple: Accuracy, Precision, Recall & F1-Score

Confusion Matrix Made Simple: Accuracy, Precision, Recall & F1-Score How to evaluate classification models and understand which metric matters the most. The post Confusion Matrix Made Simple: Accuracy, Precision, Recall & F1-Score appeared first on Towards Data Science. Nikhil Dasari Go to original source

July 31, 2025
What Is Data Literacy in 2025? It’s Not What You Think

What Is Data Literacy in 2025? It’s Not What You Think In today’s fast-paced, distraction-heavy world, data literacy isn’t just about understanding charts or analyzing numbers—it’s about context, clarity, and human connection. With attention spans shrinking and AI-generated insights flooding our screens, even highly skilled professionals can behave like data novices. The real challenge isn’t…

July 31, 2025
Automated Testing: A Software Engineering Concept Data Scientists Must Know To Succeed

Automated Testing: A Software Engineering Concept Data Scientists Must Know To Succeed Why you should read this article Most data scientists whip up a Jupyter Notebook, play around in some cells, and then maintain entire data processing and model training pipelines in the same notebook. The code is tested once when the notebook was first…

July 31, 2025