Category: aimldsaimlds

  • Hierarchical Variable Importance with Statistical Control for Medical Data-Based Prediction

    Hierarchical Variable Importance with Statistical Control for Medical Data-Based Prediction arXiv:2508.08724v1 Announce Type: new Abstract: Recent advances in machine learning have greatly expanded the repertoire of predictive methods for medical imaging. However, the interpretability of complex models remains a challenge, which limits their utility in medical applications. Recently, model-agnostic methods have been proposed to measure…

  • Bio-Inspired Artificial Neural Networks based on Predictive Coding

    Bio-Inspired Artificial Neural Networks based on Predictive Coding arXiv:2508.08762v1 Announce Type: new Abstract: Backpropagation (BP) of errors is the backbone training algorithm for artificial neural networks (ANNs). It updates network weights through gradient descent to minimize a loss function representing the mismatch between predictions and desired outputs. BP uses the chain rule to propagate the…

  • Reducing Time to Value for Data Science Projects: Part 4

    Reducing Time to Value for Data Science Projects: Part 4 Embrace your inner software developer The post Reducing Time to Value for Data Science Projects: Part 4 appeared first on Towards Data Science. Kristopher McGlinchey Go to original source

  • Model Predictive Control Basics

    Model Predictive Control Basics A hands-on tutorial with Python and CasADi The post Model Predictive Control Basics appeared first on Towards Data Science. Willem Esterhuizen Go to original source

  • Coconut: A Framework for Latent Reasoning in LLMs

    Coconut: A Framework for Latent Reasoning in LLMs Explaining Coconut (Training Large Language Models to Reason in a Continuous Latent Space) in simple terms The post Coconut: A Framework for Latent Reasoning in LLMs appeared first on Towards Data Science. Youssef Farag Go to original source

  • A Refined Training Recipe for Fine-Grained Visual Classification

    A Refined Training Recipe for Fine-Grained Visual Classification How FGVC aims to recognize images belonging to multiple subordinate categories of a super-category The post A Refined Training Recipe for Fine-Grained Visual Classification appeared first on Towards Data Science. Ahmed Belgacem Go to original source

  • Fine-Tune Your Topic Modeling Workflow with BERTopic

    Fine-Tune Your Topic Modeling Workflow with BERTopic Learn how to fine-tune BERTopic settings for more focused, reproducible, and interpretable results The post Fine-Tune Your Topic Modeling Workflow with BERTopic appeared first on Towards Data Science. Tiffany Chen Go to original source

  • Federated Online Learning for Heterogeneous Multisource Streaming Data

    Federated Online Learning for Heterogeneous Multisource Streaming Data arXiv:2508.06652v1 Announce Type: new Abstract: Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns. Most existing federated learning methods focus on the “static” datasets. However, in many real-world applications, data arrive continuously over time, forming streaming datasets. This introduces additional…

  • MOCA-HESP: Meta High-dimensional Bayesian Optimization for Combinatorial and Mixed Spaces via Hyper-ellipsoid Partitioning

    MOCA-HESP: Meta High-dimensional Bayesian Optimization for Combinatorial and Mixed Spaces via Hyper-ellipsoid Partitioning arXiv:2508.06847v1 Announce Type: new Abstract: High-dimensional Bayesian Optimization (BO) has attracted significant attention in recent research. However, existing methods have mainly focused on optimizing in continuous domains, while combinatorial (ordinal and categorical) and mixed domains still remain challenging. In this paper, we…

  • Membership Inference Attacks with False Discovery Rate Control

    Membership Inference Attacks with False Discovery Rate Control arXiv:2508.07066v1 Announce Type: new Abstract: Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have…

  • Statistical Inference for Autoencoder-based Anomaly Detection after Representation Learning-based Domain Adaptation

    Statistical Inference for Autoencoder-based Anomaly Detection after Representation Learning-based Domain Adaptation arXiv:2508.07049v1 Announce Type: new Abstract: Anomaly detection (AD) plays a vital role across a wide range of domains, but its performance might deteriorate when applied to target domains with limited data. Domain Adaptation (DA) offers a solution by transferring knowledge from a related source…

  • Stochastic dynamics learning with state-space systems

    Stochastic dynamics learning with state-space systems arXiv:2508.07876v1 Announce Type: new Abstract: This work advances the theoretical foundations of reservoir computing (RC) by providing a unified treatment of fading memory and the echo state property (ESP) in both deterministic and stochastic settings. We investigate state-space systems, a central model class in time series learning, and establish…

  • Estimating from No Data: Deriving a Continuous Score from Categories

    Estimating from No Data: Deriving a Continuous Score from Categories A walk-through of and the maths behind using low-capacity networks to acquire fine-grained scoring when only categorical labelling is available for training. We use it to predict the severity of an infection on a scale based on information on just rough outcomes in previous cases.…

  • Introducing Google’s LangExtract tool

    Introducing Google’s LangExtract tool Do RAG without doing RAG with this powerful new NLP and data extraction library The post Introducing Google’s LangExtract tool appeared first on Towards Data Science. Thomas Reid Go to original source

  • From Genes to Neural Networks: Understanding and Building NEAT (Neuro-Evolution of Augmenting Topologies) from Scratch

    From Genes to Neural Networks: Understanding and Building NEAT (Neuro-Evolution of Augmenting Topologies) from Scratch Practical Neuroevolution: Reproducing NEAT’s Innovations and Code Walkthrough The post From Genes to Neural Networks: Understanding and Building NEAT (Neuro-Evolution of Augmenting Topologies) from Scratch appeared first on Towards Data Science. Carlos Redondo Go to original source

  • LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions

    LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions Stop guessing your statistical test. Let this AI do it for you. The post LangGraph + SciPy: Building an AI That Reads Documentation and Makes Decisions appeared first on Towards Data Science. Gustavo Santos Go to original source

  • Random Walk Learning and the Pac-Man Attack

    Random Walk Learning and the Pac-Man Attack arXiv:2508.05663v1 Announce Type: new Abstract: Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an…

  • Stochastic Trace Optimization of Parameter Dependent Matrices Based on Statistical Learning Theory

    Stochastic Trace Optimization of Parameter Dependent Matrices Based on Statistical Learning Theory arXiv:2508.05764v1 Announce Type: new Abstract: We consider matrices $boldsymbol{A}(boldsymboltheta)inmathbb{R}^{mtimes m}$ that depend, possibly nonlinearly, on a parameter $boldsymboltheta$ from a compact parameter space $Theta$. We present a Monte Carlo estimator for minimizing $text{trace}(boldsymbol{A}(boldsymboltheta))$ over all $boldsymbolthetainTheta$, and determine the sampling amount so that…

  • Reduction Techniques for Survival Analysis

    Reduction Techniques for Survival Analysis arXiv:2508.05715v1 Announce Type: new Abstract: In this work, we discuss what we refer to as reduction techniques for survival analysis, that is, techniques that “reduce” a survival task to a more common regression or classification task, without ignoring the specifics of survival data. Such techniques particularly facilitate machine learning-based survival…

  • Lightweight Auto-bidding based on Traffic Prediction in Live Advertising

    Lightweight Auto-bidding based on Traffic Prediction in Live Advertising arXiv:2508.06069v1 Announce Type: new Abstract: Internet live streaming is widely used in online entertainment and e-commerce, where live advertising is an important marketing tool for anchors. An advertising campaign hopes to maximize the effect (such as conversions) under constraints (such as budget and cost-per-click). The mainstream…

  • Decorrelated feature importance from local sample weighting

    Decorrelated feature importance from local sample weighting arXiv:2508.06337v1 Announce Type: new Abstract: Feature importance (FI) statistics provide a prominent and valuable method of insight into the decision process of machine learning (ML) models, but their effectiveness has well-known limitations when correlation is present among the features in the training data. In this case, the FI…

  • Weekly Entering & Transitioning – Thread 11 Aug, 2025 – 18 Aug, 2025

    Weekly Entering & Transitioning – Thread 11 Aug, 2025 – 18 Aug, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Catch-22: Learning R through “hands on” Projects

    Catch-22: Learning R through “hands on” Projects I often get told “learn data science by doing hands-on projects” and then I get all fired up and motivated to learn, and then I open up R…. And then I stare at a blank screen because I don’t know the syntax from memory. And then I tell…

  • AI isn’t taking your job. Executives are.

    AI isn’t taking your job. Executives are. If AI is ready to replace developers, why aren’t developers replacing themselves with AI and just taking it easy at work? I’m a Director at my company. I’m in the meetings and helping set up the tools that cost people their jobs. Here’s how they work: Claude AI…

  • Burnout, disillusionment, and imposter syndrome after 1 year in DS. Am I just an API monkey? Reality check needed.

    Burnout, disillusionment, and imposter syndrome after 1 year in DS. Am I just an API monkey? Reality check needed. Hey folks, I am about a year into my first data science job. It took roughly a year and more than 400 applications to land it, so the idea of another long search is scary. Early…

  • Business focused data science

    Business focused data science As a microbiology researcher, I’m far away from the business world. I do more -omics and growth curves and molecular techniques, but I want to move away from biology. I believe the bridge that can help me do that is data. I have got experience with R and excel. I’m looking…

  • How to Design Machine Learning Experiments — the Right Way

    How to Design Machine Learning Experiments — the Right Way The key to successful ML projects isn’t always more resources The post How to Design Machine Learning Experiments — the Right Way appeared first on Towards Data Science. TDS Editors Go to original source

  • How to Write Insightful Technical Articles

    How to Write Insightful Technical Articles Learn how to write informative technical articles The post How to Write Insightful Technical Articles appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Generating Structured Outputs from LLMs

    Generating Structured Outputs from LLMs An overview of popular techniques to confine LLMs’ output to a predefined schema The post Generating Structured Outputs from LLMs appeared first on Towards Data Science. Ibrahim Habib Go to original source

  • Demystifying Cosine Similarity

    Demystifying Cosine Similarity Mathematical intuition and practical considerations for NLP scenarios The post Demystifying Cosine Similarity appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

  • Differentially Private Model-X Knockoffs via Johnson-Lindenstrauss Transform

    Differentially Private Model-X Knockoffs via Johnson-Lindenstrauss Transform arXiv:2508.04800v1 Announce Type: new Abstract: We introduce a novel privatization framework for high-dimensional controlled variable selection. Our framework enables rigorous False Discovery Rate (FDR) control under differential privacy constraints. While the Model-X knockoff procedure provides FDR guarantees by constructing provably exchangeable “negative control” features, existing privacy mechanisms like…

  • The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models

    The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models arXiv:2508.04884v1 Announce Type: new Abstract: In this work, we study the problem of choosing the discretisation schedule for sampling from masked discrete diffusion models in terms of the information geometry of the induced probability path. Specifically, we show that the optimal schedule under the Fisher-Rao…

  • L1-Regularized Functional Support Vector Machine

    L1-Regularized Functional Support Vector Machine arXiv:2508.05567v1 Announce Type: new Abstract: In functional data analysis, binary classification with one functional covariate has been extensively studied. We aim to fill in the gap of considering multivariate functional covariates in classification. In particular, we propose an $L_1$-regularized functional support vector machine for binary classification. An accompanying algorithm is…

  • High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference

    High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference arXiv:2508.05212v1 Announce Type: new Abstract: With the development of big data and machine learning, privacy concerns have become increasingly critical, especially when handling heterogeneous datasets containing sensitive personal information. Differential privacy provides a rigorous framework for safeguarding individual privacy while enabling meaningful statistical analysis. In…

  • High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation

    High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation arXiv:2508.05570v1 Announce Type: new Abstract: In this paper, we study the bias and high-order error bounds of the Linear Stochastic Approximation (LSA) algorithm with Polyak-Ruppert (PR) averaging under Markovian noise. We focus on the version of the algorithm with constant step size $alpha$ and propose a…

  • Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing

    Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing Explore how STL uses LOESS smoothing to extract trend and seasonal components. The post Time Series Forecasting Made Simple (Part 3.2): A Deep Dive into LOESS-Based Smoothing appeared first on Towards Data Science. Nikhil Dasari Go to original source

  • Finding Golden Examples: A Smarter Approach to In-Context Learning

    Finding Golden Examples: A Smarter Approach to In-Context Learning From random example selection to systematic AuPair generation  — how to make your LLM prompts actually work The post Finding Golden Examples: A Smarter Approach to In-Context Learning appeared first on Towards Data Science. Sudheer Singh Go to original source

  • Agentic AI: On Evaluations

    Agentic AI: On Evaluations Metrics to track for RAG and agents, plus the frameworks that help The post Agentic AI: On Evaluations appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

  • The Channel-Wise Attention | Squeeze and Excitation

    The Channel-Wise Attention | Squeeze and Excitation Applying the Squeeze and Excitation module on ResNeXt using PyTorch The post The Channel-Wise Attention | Squeeze and Excitation appeared first on Towards Data Science. Muhammad Ardi Go to original source

  • Reliable Programmatic Weak Supervision with Confidence Intervals for Label Probabilities

    Reliable Programmatic Weak Supervision with Confidence Intervals for Label Probabilities arXiv:2508.03896v1 Announce Type: new Abstract: The accurate labeling of datasets is often both costly and time-consuming. Given an unlabeled dataset, programmatic weak supervision obtains probabilistic predictions for the labels by leveraging multiple weak labeling functions (LFs) that provide rough guesses for labels. Weak LFs commonly…

  • Reinforcement Learning in MDPs with Information-Ordered Policies

    Reinforcement Learning in MDPs with Information-Ordered Policies arXiv:2508.03904v1 Announce Type: new Abstract: We propose an epoch-based reinforcement learning algorithm for infinite-horizon average-cost Markov decision processes (MDPs) that leverages a partial order over a policy class. In this structure, $pi’ leq pi$ if data collected under $pi$ can be used to estimate the performance of $pi’$,…

  • Deep Neural Network-Driven Adaptive Filtering

    Deep Neural Network-Driven Adaptive Filtering arXiv:2508.04258v1 Announce Type: new Abstract: This paper proposes a deep neural network (DNN)-driven framework to address the longstanding generalization challenge in adaptive filtering (AF). In contrast to traditional AF frameworks that emphasize explicit cost function design, the proposed framework shifts the paradigm toward direct gradient acquisition. The DNN, functioning as…

  • Negative binomial regression and inference using a pre-trained transformer

    Negative binomial regression and inference using a pre-trained transformer arXiv:2508.04111v1 Announce Type: new Abstract: Negative binomial regression is essential for analyzing over-dispersed count data in in comparative studies, but parameter estimation becomes computationally challenging in large screens requiring millions of comparisons. We investigate using a pre-trained transformer to produce estimates of negative binomial regression parameters…

  • The Relative Instability of Model Comparison with Cross-validation

    The Relative Instability of Model Comparison with Cross-validation arXiv:2508.04409v1 Announce Type: new Abstract: Existing work has shown that cross-validation (CV) can be used to provide an asymptotic confidence interval for the test error of a stable machine learning algorithm, and existing stability results for many popular algorithms can be applied to derive positive instances where…

  • The MCP Security Survival Guide: Best Practices, Pitfalls, and Real-World Lessons

    The MCP Security Survival Guide: Best Practices, Pitfalls, and Real-World Lessons Unless you’re someone who lives and breathes cybersecurity, chances are you didn’t think much about authentication, network exposure, or what happens if someone else finds your server. This guide isn’t here to kill the excitement—it’s here to help you use MCP without opening the…

  • How I Won the “Mostly AI” Synthetic Data Challenge

    How I Won the “Mostly AI” Synthetic Data Challenge A deep dive into how post-processing can supercharge synthetic data generation The post How I Won the “Mostly AI” Synthetic Data Challenge appeared first on Towards Data Science. Daniel Gärber Go to original source

  • The Machine, the Expert, and the Common Folks

    The Machine, the Expert, and the Common Folks A look at noise, consistency and broken legs The post The Machine, the Expert, and the Common Folks appeared first on Towards Data Science. Lars Nørtoft Reiter Go to original source

  • InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI

    InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI Learn how InfiniBand and RoCEv2 enable high-speed GPU communication The post InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI appeared first on Towards Data Science. Shireesh Kumar Singh Go to original source

  • A Dual Optimization View to Empirical Risk Minimization with f-Divergence Regularization

    A Dual Optimization View to Empirical Risk Minimization with f-Divergence Regularization arXiv:2508.03314v1 Announce Type: new Abstract: The dual formulation of empirical risk minimization with f-divergence regularization (ERM-fDR) is introduced. The solution of the dual optimization problem to the ERM-fDR is connected to the notion of normalization function introduced as an implicit function. This dual approach…

  • Hedging with memory: shallow and deep learning with signatures

    Hedging with memory: shallow and deep learning with signatures arXiv:2508.02759v1 Announce Type: new Abstract: We investigate the use of path signatures in a machine learning context for hedging exotic derivatives under non-Markovian stochastic volatility models. In a deep learning setting, we use signatures as features in feedforward neural networks and show that they outperform LSTMs…

  • Supervised Dynamic Dimension Reduction with Deep Neural Network

    Supervised Dynamic Dimension Reduction with Deep Neural Network arXiv:2508.03546v1 Announce Type: new Abstract: This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process.…

  • Likelihood Matching for Diffusion Models

    Likelihood Matching for Diffusion Models arXiv:2508.03636v1 Announce Type: new Abstract: We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, a quasi-likelihood is considered…

  • Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

    Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws arXiv:2508.03688v1 Announce Type: new Abstract: We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $y propto sum_{j=1}^{r}lambda_j sigmaleft(langle boldsymbol{theta_j}, boldsymbol{x}rangleright), boldsymbol{x} sim N(0,boldsymbol{I}_d)$,…

  • Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

    Context Engineering — A Comprehensive Hands-On Tutorial with DSPy Let’s dissect the art and science of context engineering, one module at a time! The post Context Engineering — A Comprehensive Hands-On Tutorial with DSPy appeared first on Towards Data Science. Avishek Biswas Go to original source

  • Things I Wish I Had Known Before Starting ML

    Things I Wish I Had Known Before Starting ML Part 2: Guardrails, research code, reading The post Things I Wish I Had Known Before Starting ML appeared first on Towards Data Science. Pascal Janetzky Go to original source

  • How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus

    How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus Welcome to the 21st century by the hand of large language models and reasoning AI agents The post How a Research Lab Made Entirely of LLM Agents Developed Molecules That Can Block a Virus appeared first on Towards Data…

  • Stellar Flare Detection and Prediction Using Clustering and Machine Learning

    Stellar Flare Detection and Prediction Using Clustering and Machine Learning Combining unsupervised clustering with supervised learning to detect and predict stellar flares The post Stellar Flare Detection and Prediction Using Clustering and Machine Learning appeared first on Towards Data Science. Diksha Sen Chaudhury Go to original source

  • Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 3)

    Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 3) Let’s observe the matter on the atomic level The post Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 3) appeared first on Towards Data Science. Dmitrii Eliuseev Go to original source

  • Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling

    Uncertainty Quantification for Large-Scale Deep Networks via Post-StoNet Modeling arXiv:2508.01217v1 Announce Type: new Abstract: Deep learning has revolutionized modern data science. However, how to accurately quantify the uncertainty of predictions from large-scale deep neural networks (DNNs) remains an unresolved issue. To address this issue, we introduce a novel post-processing approach. This approach feeds the output…

  • Inequalities for Optimization of Classification Algorithms: A Perspective Motivated by Diagnostic Testing

    Inequalities for Optimization of Classification Algorithms: A Perspective Motivated by Diagnostic Testing arXiv:2508.01065v1 Announce Type: new Abstract: Motivated by canonical problems in medical diagnostics, we propose and study properties of an objective function that uniformly bounds uncertainties in quantities of interest extracted from classifiers and related data analysis tools. We begin by adopting a set-theoretic…

  • Flow IV: Counterfactual Inference In Nonseparable Outcome Models Using Instrumental Variables

    Flow IV: Counterfactual Inference In Nonseparable Outcome Models Using Instrumental Variables arXiv:2508.01321v1 Announce Type: new Abstract: To reach human level intelligence, learning algorithms need to incorporate causal reasoning. But identifying causality, and particularly counterfactual reasoning, remains an elusive task. In this paper, we make progress on this task by utilizing instrumental variables (IVs). IVs are…

  • Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: “One Map, Many Trials” in Satellite-Driven Poverty Analysis

    Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: “One Map, Many Trials” in Satellite-Driven Poverty Analysis arXiv:2508.01341v1 Announce Type: new Abstract: Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can…

  • Efficient optimization of expensive black-box simulators via marginal means, with application to neutrino detector design

    Efficient optimization of expensive black-box simulators via marginal means, with application to neutrino detector design arXiv:2508.01834v1 Announce Type: new Abstract: With advances in scientific computing, computer experiments are increasingly used for optimizing complex systems. However, for modern applications, e.g., the optimization of nuclear physics detectors, each experiment run can require hundreds of CPU hours, making…

  • From Data Scientist IC to Manager: One Year In

    From Data Scientist IC to Manager: One Year In Three pillars that shaped my first year in data science management - prioritization, empowerment, and recognition The post From Data Scientist IC to Manager: One Year In appeared first on Towards Data Science. Yu Dong Go to original source

  • Introducing Server-Sent Events in Python

    Introducing Server-Sent Events in Python A simpler path to coding real-time web applications. The post Introducing Server-Sent Events in Python appeared first on Towards Data Science. Thomas Reid Go to original source

  • On Adding a Start Value to a Waterfall Chart in Power BI

    On Adding a Start Value to a Waterfall Chart in Power BI A waterfall chart can be a powerful tool for conveying information. But it has some limitations. The post On Adding a Start Value to a Waterfall Chart in Power BI appeared first on Towards Data Science. Salvatore Cagliari Go to original source

  • Hands-On with Agents SDK: Multi-Agent Collaboration

    Hands-On with Agents SDK: Multi-Agent Collaboration Explore the handoff and agents-as-tools patterns, their use cases, and how to customize them using OpenAI Agents SDK and Streamlit. The post Hands-On with Agents SDK: Multi-Agent Collaboration appeared first on Towards Data Science. Iqbal Rahmadhan Go to original source

  • Does the Code Work or Not? 

    Does the Code Work or Not?  A common misconception about the working state of code in data, AI or software engineering fields. The post Does the Code Work or Not?  appeared first on Towards Data Science. Marina Tosic Go to original source

  • funOCLUST: Clustering Functional Data with Outliers

    funOCLUST: Clustering Functional Data with Outliers arXiv:2508.00110v1 Announce Type: new Abstract: Functional data present unique challenges for clustering due to their infinite-dimensional nature and potential sensitivity to outliers. An extension of the OCLUST algorithm to the functional setting is proposed to address these issues. The approach leverages the OCLUST framework, creating a robust method to…

  • Sinusoidal Approximation Theorem for Kolmogorov-Arnold Networks

    Sinusoidal Approximation Theorem for Kolmogorov-Arnold Networks arXiv:2508.00247v1 Announce Type: new Abstract: The Kolmogorov-Arnold representation theorem states that any continuous multivariable function can be exactly represented as a finite superposition of continuous single variable functions. Subsequent simplifications of this representation involve expressing these functions as parameterized sums of a smaller number of unique monotonic functions. These…

  • DO-EM: Density Operator Expectation Maximization

    DO-EM: Density Operator Expectation Maximization arXiv:2507.22786v1 Announce Type: cross Abstract: Density operators, quantum generalizations of probability distributions, are gaining prominence in machine learning due to their foundational role in quantum computing. Generative modeling based on density operator models (textbf{DOMs}) is an emerging field, but existing training algorithms — such as those for the Quantum Boltzmann…

  • Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting

    Regime-Aware Conditional Neural Processes with Multi-Criteria Decision Support for Operational Electricity Price Forecasting arXiv:2508.00040v1 Announce Type: cross Abstract: This work integrates Bayesian regime detection with conditional neural processes for 24-hour electricity price prediction in the German market. Our methodology integrates regime detection using a disentangled sticky hierarchical Dirichlet process hidden Markov model (DS-HDP-HMM) applied to…

  • AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors

    AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors arXiv:2508.00120v1 Announce Type: cross Abstract: Multimodal high-dimensional data are increasingly prevalent in biomedical research, yet they are often compromised by block-wise missingness and measurement errors, posing significant challenges for statistical inference and prediction. We propose AdapDISCOM, a novel adaptive…

  • Personal projects and skill set

    Personal projects and skill set Hi everyone, I was just wondering how do you guys specify personal acquired skills from your personal projects in your CV. I’m in the midst of a pretty large project – end to end pipeline for predicting real time probabilities of winning chances in a game. This includes a lot…

  • Weekly Entering & Transitioning – Thread 04 Aug, 2025 – 11 Aug, 2025

    Weekly Entering & Transitioning – Thread 04 Aug, 2025 – 11 Aug, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Built this out of pure laziness for all my Feature engineering/model training jobs

    Built this out of pure laziness for all my Feature engineering/model training jobs Built this out of pure laziness A lightweight Telegram bot that lets me: – Get Databricks job alerts – Check today’s status – Repair failed runs – Pause/reschedule , All from my phone. No laptop. No dashboard. Just / Commands. submitted by…

  • Is there a term for internal processing vs data that needs to be stakeholding/customer facing?

    Is there a term for internal processing vs data that needs to be stakeholding/customer facing? For example I had my physical credit card stolen. I was trying to get information from the CC company about when the card was used so that the local PD could check security cameras. (We thought it was particular person…

  • Hi! i am a junior dev need advice regarding fraud/risk scoring (not credit) on my rules based fraud detection system.

    Hi! i am a junior dev need advice regarding fraud/risk scoring (not credit) on my rules based fraud detection system. so i our team has developed a rules based fraud detecton system….now we have received a new requirement that we have to score every transaction as how much risky or if flagged as fraud how…

  • Mastering NLP with spaCy – Part 2

    Mastering NLP with spaCy – Part 2 POS tagging, dependency parser and named entity recognition. The post Mastering NLP with spaCy – Part 2 appeared first on Towards Data Science. Marcello Politi Go to original source

  • How Computers “See” Molecules

    How Computers “See” Molecules Generative Molecular Design (Part 1): common molecular representations in data science. The post How Computers “See” Molecules appeared first on Towards Data Science. Tianyuan Zheng Go to original source

  • “I think of analysts as data wizards who help their product teams solve problems”

    “I think of analysts as data wizards who help their product teams solve problems” Mariya Mansurova explains how hands-on learning, agentic AI, and engineering habits shape her writing and work. The post “I think of analysts as data wizards who help their product teams solve problems” appeared first on Towards Data Science. TDS Editors Go…

  • When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems

    When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems Models don’t just fail with noise; they fail in silence, by narrowing their attention to the point of fragility. The post When Models Stop Listening: How Feature Collapse Quietly Erodes Machine Learning Systems appeared first on Towards Data Science. Mahe Jabeen Abdul Go…

  • A Smoothing Newton Method for Rank-one Matrix Recovery

    A Smoothing Newton Method for Rank-one Matrix Recovery arXiv:2507.23017v1 Announce Type: new Abstract: We consider the phase retrieval problem, which involves recovering a rank-one positive semidefinite matrix from rank-one measurements. A recently proposed algorithm based on Bures-Wasserstein gradient descent (BWGD) exhibits superlinear convergence, but it is unstable, and existing theory can only prove local linear…

  • Optimal Transport Learning: Balancing Value Optimization and Fairness in Individualized Treatment Rules

    Optimal Transport Learning: Balancing Value Optimization and Fairness in Individualized Treatment Rules arXiv:2507.23349v1 Announce Type: new Abstract: Individualized treatment rules (ITRs) have gained significant attention due to their wide-ranging applications in fields such as precision medicine, ridesharing, and advertising recommendations. However, when ITRs are influenced by sensitive attributes such as race, gender, or age, they…

  • DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction

    DICOM De-Identification via Hybrid AI and Rule-Based Framework for Scalable, Uncertainty-Aware Redaction arXiv:2507.23736v1 Announce Type: new Abstract: Access to medical imaging and associated text data has the potential to drive major advances in healthcare research and patient outcomes. However, the presence of Protected Health Information (PHI) and Personally Identifiable Information (PII) in Digital Imaging and…

  • Scaled Beta Models and Feature Dilution for Dynamic Ticket Pricing

    Scaled Beta Models and Feature Dilution for Dynamic Ticket Pricing arXiv:2507.23767v1 Announce Type: new Abstract: A novel approach is presented for identifying distinct signatures of performing acts in the secondary ticket resale market by analyzing dynamic pricing distributions. Using a newly curated, time series dataset from the SeatGeek API, we model ticket pricing distributions as…

  • Formal Bayesian Transfer Learning via the Total Risk Prior

    Formal Bayesian Transfer Learning via the Total Risk Prior arXiv:2507.23768v1 Announce Type: new Abstract: In analyses with severe data-limitations, augmenting the target dataset with information from ancillary datasets in the application domain, called source datasets, can lead to significantly improved statistical procedures. However, existing methods for this transfer learning struggle to deal with situations where…

  • FastSAM  for Image Segmentation Tasks — Explained Simply

    FastSAM  for Image Segmentation Tasks — Explained Simply Image segmentation is a popular task in computer vision, with the goal of partitioning an input image into multiple regions, where each region represents a separate object. Several classic approaches from the past involved taking a model backbone (e.g., U-Net) and fine-tuning it on specialized datasets. While…

  • How to Benchmark LLMs – ARC AGI 3

    How to Benchmark LLMs – ARC AGI 3 Learn how to LLMs are benchmarked, and try out the newly released ARC AGI 3 The post How to Benchmark LLMs – ARC AGI 3 appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • LLMs and Mental Health

    LLMs and Mental Health Are LLMs good or bad for our mental health? It’s more complicated than that. The post LLMs and Mental Health appeared first on Towards Data Science. Stephanie Kirmer Go to original source

  • The ONLY Data Science Roadmap You Need to Get a Job

    The ONLY Data Science Roadmap You Need to Get a Job Are you looking to become a data scientist and don’t know where to start? In this article, I want to provide you with a straightforward, no-nonsense learning roadmap that you can follow to break into the industry. By the end, you’ll finally have a clear…

  • Simulating Posterior Bayesian Neural Networks with Dependent Weights

    Simulating Posterior Bayesian Neural Networks with Dependent Weights arXiv:2507.22095v1 Announce Type: new Abstract: In this paper we consider posterior Bayesian fully connected and feedforward deep neural networks with dependent weights. Particularly, if the likelihood is Gaussian, we identify the distribution of the wide width limit and provide an algorithm to sample from the network. In…

  • Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration

    Stacked SVD or SVD stacked? A Random Matrix Theory perspective on data integration arXiv:2507.22170v1 Announce Type: new Abstract: Modern data analysis increasingly requires identifying shared latent structure across multiple high-dimensional datasets. A commonly used model assumes that the data matrices are noisy observations of low-rank matrices with a shared singular subspace. In this case, two…

  • LVM-GP: Uncertainty-Aware PDE Solver via coupling latent variable model and Gaussian process

    LVM-GP: Uncertainty-Aware PDE Solver via coupling latent variable model and Gaussian process arXiv:2507.22493v1 Announce Type: new Abstract: We propose a novel probabilistic framework, termed LVM-GP, for uncertainty quantification in solving forward and inverse partial differential equations (PDEs) with noisy data. The core idea is to construct a stochastic mapping from the input to a high-dimensional…

  • Subgrid BoostCNN: Efficient Boosting of Convolutional Networks via Gradient-Guided Feature Selection

    Subgrid BoostCNN: Efficient Boosting of Convolutional Networks via Gradient-Guided Feature Selection arXiv:2507.22842v1 Announce Type: new Abstract: Convolutional Neural Networks (CNNs) have achieved remarkable success across a wide range of machine learning tasks by leveraging hierarchical feature learning through deep architectures. However, the large number of layers and millions of parameters often make CNNs computationally expensive…

  • A Unified Analysis of Generalization and Sample Complexity for Semi-Supervised Domain Adaptation

    A Unified Analysis of Generalization and Sample Complexity for Semi-Supervised Domain Adaptation arXiv:2507.22632v1 Announce Type: new Abstract: Domain adaptation seeks to leverage the abundant label information in a source domain to improve classification performance in a target domain with limited labels. While the field has seen extensive methodological development, its theoretical foundations remain relatively underexplored.…

  • The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix

    The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix Retraining is easy; knowing when not to is the real challenge. In machine learning, performance drops are rarely about stale weights; they’re about misunderstood signals. The post The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix appeared first on Towards Data Science.…

  • Confusion Matrix Made Simple: Accuracy, Precision, Recall & F1-Score

    Confusion Matrix Made Simple: Accuracy, Precision, Recall & F1-Score How to evaluate classification models and understand which metric matters the most. The post Confusion Matrix Made Simple: Accuracy, Precision, Recall & F1-Score appeared first on Towards Data Science. Nikhil Dasari Go to original source

  • What Is Data Literacy in 2025? It’s Not What You Think

    What Is Data Literacy in 2025? It’s Not What You Think In today’s fast-paced, distraction-heavy world, data literacy isn’t just about understanding charts or analyzing numbers—it’s about context, clarity, and human connection. With attention spans shrinking and AI-generated insights flooding our screens, even highly skilled professionals can behave like data novices. The real challenge isn’t…

  • Automated Testing: A Software Engineering Concept Data Scientists Must Know To Succeed

    Automated Testing: A Software Engineering Concept Data Scientists Must Know To Succeed Why you should read this article Most data scientists whip up a Jupyter Notebook, play around in some cells, and then maintain entire data processing and model training pipelines in the same notebook. The code is tested once when the notebook was first…