Category: aimldsaimlds

Sparse minimum Redundancy Maximum Relevance for feature selection

Sparse minimum Redundancy Maximum Relevance for feature selection arXiv:2508.18901v1 Announce Type: new Abstract: We propose a feature screening method that integrates both feature-feature and feature-target relationships. Inactive features are identified via a penalized minimum Redundancy Maximum Relevance (mRMR) procedure, which is the continuous version of the classic mRMR penalized by a non-convex regularizer, and where…

August 27, 2025
Echoes of the past: A unified perspective on fading memory and echo states

Echoes of the past: A unified perspective on fading memory and echo states arXiv:2508.19145v1 Announce Type: new Abstract: Recurrent neural networks (RNNs) have become increasingly popular in information processing tasks involving time series and temporal data. A fundamental property of RNNs is their ability to create reliable input/output responses, often linked to how the network…

August 27, 2025
How to Develop Powerful Internal LLM Benchmarks

How to Develop Powerful Internal LLM Benchmarks Learn how to compare LLMs using your own interal benchmark The post How to Develop Powerful Internal LLM Benchmarks appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 27, 2025
Plato’s Cave and the Shadows of Data

Plato’s Cave and the Shadows of Data On truth, illusion, and the limits of what data can reveal The post Plato’s Cave and the Shadows of Data appeared first on Towards Data Science. Pol Marin Go to original source

August 27, 2025
Using Google’s LangExtract and Gemma for Structured Data Extraction

Using Google’s LangExtract and Gemma for Structured Data Extraction Extracting structured information effectively and accurately from long unstructured text with LangExtract and LLMs The post Using Google’s LangExtract and Gemma for Structured Data Extraction appeared first on Towards Data Science. Kenneth Leung Go to original source

August 27, 2025
Positional Embeddings in Transformers: A Math Guide to RoPE & ALiBi

Positional Embeddings in Transformers: A Math Guide to RoPE & ALiBi Learn APE, RoPE, and ALiBi positional embeddings for GPT — intuitions, math, PyTorch code, and experiments on TinyStories The post Positional Embeddings in Transformers: A Math Guide to RoPE & ALiBi appeared first on Towards Data Science. Sathya Krishnan Suresh Go to original source

August 27, 2025
Google’s URL Context Grounding: Another Nail in RAG’s Coffin?

Google’s URL Context Grounding: Another Nail in RAG’s Coffin? Google’s hot streak in AI-related releases continues unabated. Just a few days ago, it released a new tool for Gemini called URL context grounding. URL context grounding can be used stand-alone or combined with Google search grounding to conduct deep dives into internet content. What is…

August 27, 2025
GraphPPD: Posterior Predictive Modelling for Graph-Level Inference

GraphPPD: Posterior Predictive Modelling for Graph-Level Inference arXiv:2508.16995v1 Announce Type: new Abstract: Accurate modelling and quantification of predictive uncertainty is crucial in deep learning since it allows a model to make safer decisions when the data is ambiguous and facilitates the users’ understanding of the model’s confidence in its predictions. Along with the tremendously increasing…

August 26, 2025
Limitations of refinement methods for weak to strong generalization

Limitations of refinement methods for weak to strong generalization arXiv:2508.17018v1 Announce Type: new Abstract: Standard techniques for aligning large language models (LLMs) utilize human-produced data, which could limit the capability of any aligned LLM to human level. Label refinement and weak training have emerged as promising strategies to address this superalignment problem. In this work,…

August 26, 2025
CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference

CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference arXiv:2508.17077v1 Announce Type: new Abstract: Current experimental scientists have been increasingly relying on simulation-based inference (SBI) to invert complex non-linear models with intractable likelihoods. However, posterior approximations obtained with SBI are often miscalibrated, causing credible regions to undercover true parameters. We develop $texttt{CP4SBI}$, a model-agnostic…

August 26, 2025
Neural Stochastic Differential Equations on Compact State-Spaces

Neural Stochastic Differential Equations on Compact State-Spaces arXiv:2508.17090v1 Announce Type: new Abstract: Many modern probabilistic models rely on SDEs, but their adoption is hampered by instability, poor inductive bias outside bounded domains, and reliance on restrictive dynamics or training tricks. While recent work constrains SDEs to compact spaces using reflected dynamics, these approaches lack continuous…

August 26, 2025
Rao Differential Privacy

Rao Differential Privacy arXiv:2508.17135v1 Announce Type: new Abstract: Differential privacy (DP) has recently emerged as a definition of privacy to release private estimates. DP calibrates noise to be on the order of an individuals contribution. Due to the this calibration a private estimate obscures any individual while preserving the utility of the estimate. Since the…

August 26, 2025
LLM Monitoring and Observability: Hands-on with Langfuse

LLM Monitoring and Observability: Hands-on with Langfuse Learn the fundamentals of LLM monitoring and observability, from tracing to evaluation and setting up a dashboard using Langfuse The post LLM Monitoring and Observability: Hands-on with Langfuse appeared first on Towards Data Science. Ahmad Talal Riaz Go to original source

August 26, 2025
Why Your Prompts Don’t Belong in Git

Why Your Prompts Don’t Belong in Git The hidden cost of storing prompts in your source code The post Why Your Prompts Don’t Belong in Git appeared first on Towards Data Science. Giorgos Myrianthous Go to original source

August 26, 2025
How to Benchmark Classical Machine Learning Workloads on Google Cloud

How to Benchmark Classical Machine Learning Workloads on Google Cloud Harnessing CPUs for Practical, Cost-Effective Machine Learning The post How to Benchmark Classical Machine Learning Workloads on Google Cloud appeared first on Towards Data Science. Ehssan Khan Go to original source

August 26, 2025
Why Science Must Embrace Co-Creation with Generative AI to Break Current Research Barriers

Why Science Must Embrace Co-Creation with Generative AI to Break Current Research Barriers An Open Letter to the Scientific Community The post Why Science Must Embrace Co-Creation with Generative AI to Break Current Research Barriers appeared first on Towards Data Science. Ugo Pradère Go to original source

August 26, 2025
Systematic LLM Prompt Engineering Using DSPy Optimization

Systematic LLM Prompt Engineering Using DSPy Optimization This article is a journey into the fascinating and rapidly evolving science of LLM prompt iteration, which is a fundamental part of Large Language Model Operations (LLMOPs). We’ll use the example of generating customer service responses with a real-world dataset to show how both generator and LLM-judge prompts…

August 26, 2025
Interpretable Kernels

Interpretable Kernels arXiv:2508.15932v1 Announce Type: new Abstract: The use of kernels for nonlinear prediction is widespread in machine learning. They have been popularized in support vector machines and used in kernel ridge regression, amongst others. Kernel methods share three aspects. First, instead of the original matrix of predictor variables or features, each observation is mapped…

August 25, 2025
Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning

Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning arXiv:2508.16027v1 Announce Type: new Abstract: Transformers have demonstrated exceptional performance across a wide range of domains. While their ability to perform reinforcement learning in-context has been established both theoretically and empirically, their behavior in non-stationary environments remains less understood. In this study, we address this gap…

August 25, 2025
A Sharp KL-Convergence Analysis for Diffusion Models under Minimal Assumptions

A Sharp KL-Convergence Analysis for Diffusion Models under Minimal Assumptions arXiv:2508.16306v1 Announce Type: new Abstract: Diffusion-based generative models have emerged as highly effective methods for synthesizing high-quality samples. Recent works have focused on analyzing the convergence of their generation process with minimal assumptions, either through reverse SDEs or Probability Flow ODEs. The best known guarantees,…

August 25, 2025
Deep Intrinsic Coregionalization Multi-Output Gaussian Process Surrogate with Active Learning

Deep Intrinsic Coregionalization Multi-Output Gaussian Process Surrogate with Active Learning arXiv:2508.16434v1 Announce Type: new Abstract: Deep Gaussian Processes (DGPs) are powerful surrogate models known for their flexibility and ability to capture complex functions. However, extending them to multi-output settings remains challenging due to the need for efficient dependency modeling. We propose the Deep Intrinsic Coregionalization…

August 25, 2025
Underdamped Langevin MCMC with third order convergence

Underdamped Langevin MCMC with third order convergence arXiv:2508.16485v1 Announce Type: new Abstract: In this paper, we propose a new numerical method for the underdamped Langevin diffusion (ULD) and present a non-asymptotic analysis of its sampling error in the 2-Wasserstein distance when the $d$-dimensional target distribution $p(x)propto e^{-f(x)}$ is strongly log-concave and has varying degrees of…

August 25, 2025
Weekly Entering & Transitioning – Thread 25 Aug, 2025 – 01 Sep, 2025

Weekly Entering & Transitioning – Thread 25 Aug, 2025 – 01 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

August 25, 2025
Day to day work at lead/principal data scientist

Day to day work at lead/principal data scientist Hi, I have 9 years of experience in ml/dl. I have been looking for a role in lead/principal ds. Can you tell me what expectations do you guys face at the role. Data science knowledge? Ml ops knowledge? Team management? submitted by /u/sourabharsh [link] [comments] /u/sourabharsh Go…

August 25, 2025
Google’s new Research : Measuring the environmental impact of delivering AI at Google Scale

Google’s new Research : Measuring the environmental impact of delivering AI at Google Scale Google has dropped in a very important research paper measuring the impact of AI on the environment, suggesting how much carbon emission, water, and energy consumption is done for running a prompt on Gemini. Surprisingly, the numbers have been quite low…

August 25, 2025
Generating passages similar in style to a set of 9 examples (Question)

Generating passages similar in style to a set of 9 examples (Question) Hello everyone I hope I can find some guidance here for a project in generative AI. I have a set of 9 short passages from a TOEFL-like English test. I need to generate more passages that match the style of the examples set.…

August 25, 2025
NVIDIA new paper : Small Language Models are the Future of Agentic AI

NVIDIA new paper : Small Language Models are the Future of Agentic AI NVIDIA have just published a paper claiming SLMs (small language models) are the future of agentic AI. They provide a number of claims as to why they think so, some important ones being they are cheap. Agentic AI requires just a tiny…

August 25, 2025
Is Google’s Reveal of Gemini’s Impact Progress or Greenwashing?

Is Google’s Reveal of Gemini’s Impact Progress or Greenwashing? On the surface, Google’s numbers sound reassuringly small, but the more closely you look, the more complicated the story becomes. The post Is Google’s Reveal of Gemini’s Impact Progress or Greenwashing? appeared first on Towards Data Science. Kasper Groes Albin Ludvigsen Go to original source

August 23, 2025
Three Essential Hyperparameter Tuning Techniques for Better Machine Learning Models

Three Essential Hyperparameter Tuning Techniques for Better Machine Learning Models Learn how to optimize your ML models for better results The post Three Essential Hyperparameter Tuning Techniques for Better Machine Learning Models appeared first on Towards Data Science. Rukshan Pramoditha Go to original source

August 23, 2025
Cracking the Density Code: Why MAF Flows Where KDE Stalls

Cracking the Density Code: Why MAF Flows Where KDE Stalls Learn why autoregressive flows are the superior density estimation tool for high-dimensional data The post Cracking the Density Code: Why MAF Flows Where KDE Stalls appeared first on Towards Data Science. Zackary Nay Go to original source

August 23, 2025
Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning

Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning arXiv:2508.15084v1 Announce Type: new Abstract: This paper introduces a novel kernel-based formulation of the Equalized Odds (EO) criterion, denoted as $EO_k$, for fair representation learning (FRL) in supervised settings. The central goal of FRL is to mitigate discrimination regarding a sensitive attribute $S$…

August 22, 2025
Bayesian Inference and Learning in Nonlinear Dynamical Systems: A Framework for Incorporating Explicit and Implicit Prior Knowledge

Bayesian Inference and Learning in Nonlinear Dynamical Systems: A Framework for Incorporating Explicit and Implicit Prior Knowledge arXiv:2508.15345v1 Announce Type: new Abstract: Accuracy and generalization capabilities are key objectives when learning dynamical system models. To obtain such models from limited data, current works exploit prior knowledge and assumptions about the system. However, the fusion of…

August 22, 2025
Bayesian Optimization with Expected Improvement: No Regret and the Choice of Incumbent

Bayesian Optimization with Expected Improvement: No Regret and the Choice of Incumbent arXiv:2508.15674v1 Announce Type: new Abstract: Expected improvement (EI) is one of the most widely used acquisition functions in Bayesian optimization (BO). Despite its proven empirical success in applications, the cumulative regret upper bound of EI remains an open question. In this paper, we…

August 22, 2025
Tree-like Pairwise Interaction Networks

Tree-like Pairwise Interaction Networks arXiv:2508.15678v1 Announce Type: new Abstract: Modeling feature interactions in tabular data remains a key challenge in predictive modeling, for example, as used for insurance pricing. This paper proposes the Tree-like Pairwise Interaction Network (PIN), a novel neural network architecture that explicitly captures pairwise feature interactions through a shared feed-forward neural network…

August 22, 2025
Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI

Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI arXiv:2508.14936v1 Announce Type: cross Abstract: Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies…

August 22, 2025
How to Perform Comprehensive Large Scale LLM Validation

How to Perform Comprehensive Large Scale LLM Validation Learn how to validate large scale LLM applications The post How to Perform Comprehensive Large Scale LLM Validation appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 22, 2025
What If I Had AI in 2020: Rent The Runway Dynamic Pricing Model

What If I Had AI in 2020: Rent The Runway Dynamic Pricing Model Ever wondered how different things might have been if ChatGPT had existed at the start of Covid? Especially for data scientists who had to update their forecast models? The post What If I Had AI in 2020: Rent The Runway Dynamic Pricing…

August 22, 2025
Where Hurricanes Hit Hardest: A County-Level Analysis with Python

Where Hurricanes Hit Hardest: A County-Level Analysis with Python Use Python, GeoPandas, Tropycal, and Plotly Express to map the number of hurricane encounters per county over the past 50 years. The post Where Hurricanes Hit Hardest: A County-Level Analysis with Python appeared first on Towards Data Science. Lee Vaughan Go to original source

August 22, 2025
Designing Trustworthy ML Models: Alan & Aida Discover Monotonicity in Machine Learning

Designing Trustworthy ML Models: Alan & Aida Discover Monotonicity in Machine Learning Accuracy alone doesn’t guarantee trustworthiness. Monotonicity ensures predictions align with common sense and business rules. The post Designing Trustworthy ML Models: Alan & Aida Discover Monotonicity in Machine Learning appeared first on Towards Data Science. Mehdi Mohammadi Go to original source

August 22, 2025
How We Reduced LLM Costs by 90% with 5 Lines of Code

How We Reduced LLM Costs by 90% with 5 Lines of Code When clean code hides inefficiencies: what we learned from fixing a few lines of code and saving 90% in LLM cost. The post How We Reduced LLM Costs by 90% with 5 Lines of Code appeared first on Towards Data Science. Uri Peled Go to…

August 22, 2025
Comparing Model-agnostic Feature Selection Methods through Relative Efficiency

Comparing Model-agnostic Feature Selection Methods through Relative Efficiency arXiv:2508.14268v1 Announce Type: new Abstract: Feature selection and importance estimation in a model-agnostic setting is an ongoing challenge of significant interest. Wrapper methods are commonly used because they are typically model-agnostic, even though they are computationally intensive. In this paper, we focus on feature selection methods related…

August 21, 2025
Evaluation and Optimization of Leave-one-out Cross-validation for the Lasso

Evaluation and Optimization of Leave-one-out Cross-validation for the Lasso arXiv:2508.14368v1 Announce Type: new Abstract: I develop an algorithm to produce the piecewise quadratic that computes leave-one-out cross-validation for the lasso as a function of its hyperparameter. The algorithm can be used to find exact hyperparameters that optimize leave-one-out cross-validation either globally or locally, and its…

August 21, 2025
The C-index Multiverse

The C-index Multiverse arXiv:2508.14821v1 Announce Type: new Abstract: Quantifying out-of-sample discrimination performance for time-to-event outcomes is a fundamental step for model evaluation and selection in the context of predictive modelling. The concordance index, or C-index, is a widely used metric for this purpose, particularly with the growing development of machine learning methods. Beyond differences between…

August 21, 2025
Noise Robust One-Class Intrusion Detection on Dynamic Graphs

Noise Robust One-Class Intrusion Detection on Dynamic Graphs arXiv:2508.14192v1 Announce Type: cross Abstract: In the domain of network intrusion detection, robustness against contaminated and noisy data inputs remains a critical challenge. This study introduces a probabilistic version of the Temporal Graph Network Support Vector Data Description (TGN-SVDD) model, designed to enhance detection accuracy in the…

August 21, 2025
Optimal Subspace Embeddings: Resolving Nelson-Nguyen Conjecture Up to Sub-Polylogarithmic Factors

Optimal Subspace Embeddings: Resolving Nelson-Nguyen Conjecture Up to Sub-Polylogarithmic Factors arXiv:2508.14234v1 Announce Type: cross Abstract: We give a proof of the conjecture of Nelson and Nguyen [FOCS 2013] on the optimal dimension and sparsity of oblivious subspace embeddings, up to sub-polylogarithmic factors: For any $ngeq d$ and $epsilongeq d^{-O(1)}$, there is a random $tilde O(d/epsilon^2)times…

August 21, 2025
Everything You Need to Know About the New Power BI Storage Mode

Everything You Need to Know About the New Power BI Storage Mode 50 Shades of Direct Lake The post Everything You Need to Know About the New Power BI Storage Mode appeared first on Towards Data Science. Nikola Ilic Go to original source

August 21, 2025
AI Agents for Supply Chain Optimisation: Production Planning

AI Agents for Supply Chain Optimisation: Production Planning How to integrate an optimisation algorithm in a FastAPI microservice and connect it with an AI workflow to automate production planning. The post AI Agents for Supply Chain Optimisation: Production Planning appeared first on Towards Data Science. Samir Saci Go to original source

August 21, 2025
My Most Valuable Lesson as an Aspiring Data Analyst

My Most Valuable Lesson as an Aspiring Data Analyst What my internship taught me about the power of collaboration in data analysis. The post My Most Valuable Lesson as an Aspiring Data Analyst appeared first on Towards Data Science. Benjamin Nweke Go to original source

August 21, 2025
Smarter Model Tuning: An AI Agent with LangGraph + Streamlit That Boosts ML Performance

Smarter Model Tuning: An AI Agent with LangGraph + Streamlit That Boosts ML Performance Automating model tuning in Python with Gemini, LangGraph, and Streamlit for regression and classification improvements The post Smarter Model Tuning: An AI Agent with LangGraph + Streamlit That Boosts ML Performance appeared first on Towards Data Science. Gustavo Santos Go to…

August 21, 2025
“Where’s Marta?”: How We Removed Uncertainty From AI Reasoning

“Where’s Marta?”: How We Removed Uncertainty From AI Reasoning A primer on overcoming LLM limitations with formal verification. The post “Where’s Marta?”: How We Removed Uncertainty From AI Reasoning appeared first on Towards Data Science. Jacopo Tagliabue Go to original source

August 21, 2025
Preference Models assume Proportional Hazards of Utilities

Preference Models assume Proportional Hazards of Utilities arXiv:2508.13189v1 Announce Type: new Abstract: Approaches for estimating preferences from human annotated data typically involves inducing a distribution over a ranked list of choices such as the Plackett-Luce model. Indeed, modern AI alignment tools such as Reward Modelling and Direct Preference Optimization are based on the statistical assumptions…

August 20, 2025
Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation

Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation arXiv:2508.13313v1 Announce Type: new Abstract: Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations. Recent advances in generative modeling have inspired new approaches to DA in high-dimensional nonlinear settings, especially the ensemble score filter (EnSF). However,…

August 20, 2025
Structural Foundations for Leading Digit Laws: Beyond Probabilistic Mixtures

Structural Foundations for Leading Digit Laws: Beyond Probabilistic Mixtures arXiv:2508.13237v1 Announce Type: new Abstract: This article presents a modern deterministic framework for the study of leading significant digit distributions in numerical data. Rather than relying on traditional probabilistic or mixture-based explanations, we demonstrate that the observed frequencies of leading digits are determined by the underlying…

August 20, 2025
Smooth Flow Matching

Smooth Flow Matching arXiv:2508.13831v1 Announce Type: new Abstract: Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and…

August 20, 2025
Online Conformal Selection with Accept-to-Reject Changes

Online Conformal Selection with Accept-to-Reject Changes arXiv:2508.13838v1 Announce Type: new Abstract: Selecting a subset of promising candidates from a large pool is crucial across various scientific and real-world applications. Conformal selection offers a distribution-free and model-agnostic framework for candidate selection with uncertainty quantification. While effective in offline settings, its application to online scenarios, where data…

August 20, 2025
Building a Modern Dashboard with Python and Tkinter

Building a Modern Dashboard with Python and Tkinter Create polished GUIs and data dashboards with this versatile library The post Building a Modern Dashboard with Python and Tkinter appeared first on Towards Data Science. Thomas Reid Go to original source

August 20, 2025
Mastering NLP with spaCy – Part 3

Mastering NLP with spaCy – Part 3 Rule-based matching for information extraction The post Mastering NLP with spaCy – Part 3 appeared first on Towards Data Science. Marcello Politi Go to original source

August 20, 2025
Help Your Model Learn the True Signal

Help Your Model Learn the True Signal An algorithm-agnostic approach inspired by Cook’s distance The post Help Your Model Learn the True Signal appeared first on Towards Data Science. Mena Wang Go to original source

August 20, 2025
Capturing and Deploying PyTorch Models with torch.export

Capturing and Deploying PyTorch Models with torch.export A demonstration of PyTorch’s exciting new export feature on a HuggingFace model The post Capturing and Deploying PyTorch Models with torch.export appeared first on Towards Data Science. Chaim Rand Go to original source

August 20, 2025
Advanced Prompt Engineering for Data Science Projects

Advanced Prompt Engineering for Data Science Projects Part 2: Prompt Engineering for Features, Modeling, and Evaluation The post Advanced Prompt Engineering for Data Science Projects appeared first on Towards Data Science. Sara Nobrega Go to original source

August 20, 2025
BaMANI: Bayesian Multi-Algorithm causal Network Inference

BaMANI: Bayesian Multi-Algorithm causal Network Inference arXiv:2508.11741v1 Announce Type: new Abstract: Improved computational power has enabled different disciplines to predict causal relationships among modeled variables using Bayesian network inference. While many alternative algorithms have been proposed to improve the efficiency and reliability of network prediction, the predicted causal networks reflect the generative process but also…

August 19, 2025
Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings

Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings arXiv:2508.11847v1 Announce Type: new Abstract: We propose a method for evaluating the robustness of a widely used LLM ranking system — the Bradley–Terry ranking system — to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and…

August 19, 2025
Robust Data Fusion via Subsampling

Robust Data Fusion via Subsampling arXiv:2508.12048v1 Announce Type: new Abstract: Data fusion and transfer learning are rapidly growing fields that enhance model performance for a target population by leveraging other related data sources or tasks. The challenges lie in the various potential heterogeneities between the target and external data, as well as various practical concerns…

August 19, 2025
An Introduction to Sliced Optimal Transport

An Introduction to Sliced Optimal Transport arXiv:2508.12519v1 Announce Type: new Abstract: Sliced Optimal Transport (SOT) is a rapidly developing branch of optimal transport (OT) that exploits the tractability of one-dimensional OT problems. By combining tools from OT, integral geometry, and computational statistics, SOT enables fast and scalable computation of distances, barycenters, and kernels for probability…

August 19, 2025
On computing and the complexity of computing higher-order $U$-statistics, exactly

On computing and the complexity of computing higher-order $U$-statistics, exactly arXiv:2508.12627v1 Announce Type: new Abstract: Higher-order $U$-statistics abound in fields such as statistics, machine learning, and computer science, but are known to be highly time-consuming to compute in practice. Despite their widespread appearance, a comprehensive study of their computational complexity is surprisingly lacking. This paper…

August 19, 2025
Can LangExtract Turn Messy Clinical Notes into Structured Data?

Can LangExtract Turn Messy Clinical Notes into Structured Data? Turning raw clinical notes into structured entities with LLMs. The post Can LangExtract Turn Messy Clinical Notes into Structured Data? appeared first on Towards Data Science. Parul Pandey Go to original source

August 19, 2025
Modular Arithmetic in Data Science

Modular Arithmetic in Data Science Modular arithmetic is a mathematical system where numbers cycle back to the beginning after reaching a value called the modulus. The system is often referred to as “clock arithmetic” due to its similarity to how analog 12-hour clocks represent time. This article provides a conceptual overview of modular arithmetic and…

August 19, 2025
Maximizing AI/ML Model Performance with PyTorch Compilation

Maximizing AI/ML Model Performance with PyTorch Compilation Since its inception in PyTorch 2.0 in March 2023, the evolution of torch.compile has been one of the most exciting things to follow. Given that PyTorch’s popularity was due to its “Pythonic” nature, its ease of use, and its line-by-line (a.k.a., eager) execution, the success of a just-in-time (JIT) graph…

August 19, 2025
How to Correctly Apply Limits on the Result in DAX (and SQL)

How to Correctly Apply Limits on the Result in DAX (and SQL) What if the output of a measure mustn’t be above a specific limit? How can we ensure that the total is calculated correctly? This piece is about correctly calculating and summarizing such output. The post How to Correctly Apply Limits on the Result…

August 19, 2025
How to Create Powerful LLM Applications with Context Engineering

How to Create Powerful LLM Applications with Context Engineering Improve your LLM by optimizing its context The post How to Create Powerful LLM Applications with Context Engineering appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 19, 2025
Non-asymptotic convergence bound of conditional diffusion models

Non-asymptotic convergence bound of conditional diffusion models arXiv:2508.10944v1 Announce Type: new Abstract: Learning and generating various types of data based on conditional diffusion models has been a research hotspot in recent years. Although conditional diffusion models have made considerable progress in improving acceleration algorithms and enhancing generation quality, the lack of non-asymptotic properties has hindered…

August 18, 2025
Counterfactual Survival Q Learning for Longitudinal Randomized Trials via Buckley James Boosting

Counterfactual Survival Q Learning for Longitudinal Randomized Trials via Buckley James Boosting arXiv:2508.11060v1 Announce Type: new Abstract: We propose a Buckley James (BJ) Boost Q learning framework for estimating optimal dynamic treatment regimes under right censored survival data, tailored for longitudinal randomized clinical trial settings. The method integrates accelerated failure time models with iterative boosting…

August 18, 2025
Uniform convergence for Gaussian kernel ridge regression

Uniform convergence for Gaussian kernel ridge regression arXiv:2508.11274v1 Announce Type: new Abstract: This paper establishes the first polynomial convergence rates for Gaussian kernel ridge regression (KRR) with a fixed hyperparameter in both the uniform and the $L^{2}$-norm. The uniform convergence result closes a gap in the theoretical understanding of KRR with the Gaussian kernel, where…

August 18, 2025
ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization arXiv:2508.11551v1 Announce Type: new Abstract: Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a…

August 18, 2025
Nonparametric learning of stochastic differential equations from sparse and noisy data

Nonparametric learning of stochastic differential equations from sparse and noisy data arXiv:2508.11597v1 Announce Type: new Abstract: The paper proposes a systematic framework for building data-driven stochastic differential equation (SDE) models from sparse, noisy observations. Unlike traditional parametric approaches, which assume a known functional form for the drift, our goal here is to learn the entire…

August 18, 2025
Weekly Entering & Transitioning – Thread 18 Aug, 2025 – 25 Aug, 2025

Weekly Entering & Transitioning – Thread 18 Aug, 2025 – 25 Aug, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

August 18, 2025
Dijkstra defeated: New Shortest Path Algorithm revealed

Dijkstra defeated: New Shortest Path Algorithm revealed Dijkstra, the goto shortest path algorithm (time complexity nlogn) has now been outperformed by a new algorithm by top Chinese University which looks like a hybrid of bellman ford+ dijsktra algorithm. Paper : https://arxiv.org/abs/2504.17033 Algorithm explained with example : https://youtu.be/rXFtoXzZTF8?si=OiB6luMslndUbTrz submitted by /u/Technical-Love-8479 [link] [comments] /u/Technical-Love-8479 Go to…

August 18, 2025
Curious to know about people who switched from DS to DE or SWE or Solutions Architect

Curious to know about people who switched from DS to DE or SWE or Solutions Architect Hello, I was just curious to know about people who have switched from DS to DE or SWE or Solutions Architect. If you have done it, what was your rationale behind doing it, what pushed or motivated you for…

August 18, 2025
R-Zero : Self-Evolving Reasoning LLM from Zero Data

R-Zero : Self-Evolving Reasoning LLM from Zero Data R-Zero by Tencent introduces a concept to train LLMs without any labelled data and aims towards self-improving AI without human intervention. It works on the similar principle of GANs i.e. involving a Challenger and Solver where one generates questions and other Solves them. Paper : https://arxiv.org/abs/2508.05004?ref=mackenziemorehead.com Video…

August 18, 2025
How different is “Senior Data Analyst” from “Data Scientist”?

How different is “Senior Data Analyst” from “Data Scientist”? I often see Senior DA roles that seem focused on using R/Python for analysis (vs. Excel and Power BI), but don’t have any insight into the day-to-day of theese roles. At the senior level, how different is Data Analyst from Data Scientist? submitted by /u/empirical-sadboy [link]…

August 18, 2025
Prediction-Powered Inference with Inverse Probability Weighting

Prediction-Powered Inference with Inverse Probability Weighting arXiv:2508.10149v1 Announce Type: new Abstract: Prediction-powered inference (PPI) is a recent framework for valid statistical inference with partially labeled data, combining model-based predictions on a large unlabeled set with bias correction from a smaller labeled subset. We show that PPI can be extended to handle informative labeling by replacing…

August 15, 2025
Mo’ Memory, Mo’ Problems: Stream-Native Machine Unlearning

Mo’ Memory, Mo’ Problems: Stream-Native Machine Unlearning arXiv:2508.10193v1 Announce Type: new Abstract: Machine unlearning work assumes a static, i.i.d training environment that doesn’t truly exist. Modern ML pipelines need to learn, unlearn, and predict continuously on production streams of data. We translate the notion of the batch unlearning scenario to the online setting using notions…

August 15, 2025
Dimension-Free Bounds for Generalized First-Order Methods via Gaussian Coupling

Dimension-Free Bounds for Generalized First-Order Methods via Gaussian Coupling arXiv:2508.10782v1 Announce Type: new Abstract: We establish non-asymptotic bounds on the finite-sample behavior of generalized first-order iterative algorithms — including gradient-based optimization methods and approximate message passing (AMP) — with Gaussian data matrices and full-memory, non-separable nonlinearities. The central result constructs an explicit coupling between the…

August 15, 2025
Conic Formulations of Transport Metrics for Unbalanced Measure Networks and Hypernetworks

Conic Formulations of Transport Metrics for Unbalanced Measure Networks and Hypernetworks arXiv:2508.10888v1 Announce Type: new Abstract: The Gromov-Wasserstein (GW) variant of optimal transport, designed to compare probability densities defined over distinct metric spaces, has emerged as an important tool for the analysis of data with complex structure, such as ensembles of point clouds or networks.…

August 15, 2025
An Iterative Algorithm for Differentially Private $k$-PCA with Adaptive Noise

An Iterative Algorithm for Differentially Private $k$-PCA with Adaptive Noise arXiv:2508.10879v1 Announce Type: new Abstract: Given $n$ i.i.d. random matrices $A_i in mathbb{R}^{d times d}$ that share a common expectation $Sigma$, the objective of Differentially Private Stochastic PCA is to identify a subspace of dimension $k$ that captures the largest variance directions of $Sigma$, while…

August 15, 2025
LangGraph 101: Let’s Build A Deep Research Agent

LangGraph 101: Let’s Build A Deep Research Agent Learn LangGraph fundamentals from Google’s open-source full-stack implementation The post LangGraph 101: Let’s Build A Deep Research Agent appeared first on Towards Data Science. Shuai Guo Go to original source

August 15, 2025
What Does “Following Best Practices” Mean in the Age of AI?

What Does “Following Best Practices” Mean in the Age of AI? How data and ML practitioners should navigate a rapidly changing landscape The post What Does “Following Best Practices” Mean in the Age of AI? appeared first on Towards Data Science. TDS Editors Go to original source

August 15, 2025
“My biggest lesson was realizing that domain expertise matters more than algorithmic complexity.“

“My biggest lesson was realizing that domain expertise matters more than algorithmic complexity.“ Claudia Ng reflects on real-world ML lessons, mentoring newcomers, and her journey from corporate ML to freelance AI. The post “My biggest lesson was realizing that domain expertise matters more than algorithmic complexity.“ appeared first on Towards Data Science. TDS Editors Go…

August 15, 2025
Distributional Sensitivity Analysis: Enabling Differentiability in Sample-Based Inference

Distributional Sensitivity Analysis: Enabling Differentiability in Sample-Based Inference arXiv:2508.09347v1 Announce Type: new Abstract: We present two analytical formulae for estimating the sensitivity — namely, the gradient or Jacobian — at given realizations of an arbitrary-dimensional random vector with respect to its distributional parameters. The first formula interprets this sensitivity as partial derivatives of the inverse…

August 14, 2025
A pseudo-inverse of a line graph

A pseudo-inverse of a line graph arXiv:2508.09412v1 Announce Type: new Abstract: Line graphs are an alternative representation of graphs where each vertex of the original (root) graph becomes an edge. However not all graphs have a corresponding root graph, hence the transformation from graphs to line graphs is not invertible. We investigate the case when…

August 14, 2025
Scalable h-adaptive probabilistic solver for time-independent and time-dependent systems

Scalable h-adaptive probabilistic solver for time-independent and time-dependent systems arXiv:2508.09623v1 Announce Type: new Abstract: Solving partial differential equations (PDEs) within the framework of probabilistic numerics offers a principled approach to quantifying epistemic uncertainty arising from discretization. By leveraging Gaussian process regression and imposing the governing PDE as a constraint at a finite set of collocation…

August 14, 2025
Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA

Structured Kernel Regression VAE: A Computationally Efficient Surrogate for GP-VAEs in ICA arXiv:2508.09721v1 Announce Type: new Abstract: The interpretability of generative models is considered a key factor in demonstrating their effectiveness and controllability. The generated data are believed to be determined by latent variables that are not directly observable. Therefore, disentangling, decoupling, decomposing, causal inference,…

August 14, 2025
Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

Objective Soups: Multilingual Multi-Task Modeling for Speech Processing arXiv:2508.09228v1 Announce Type: cross Abstract: Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks like speech recognition and translation. While multi-objective optimization (MOO) aims to align gradient updates, its effectiveness diminishes as the number of tasks grows, making…

August 14, 2025
How to Use LLMs for Powerful Automatic Evaluations

How to Use LLMs for Powerful Automatic Evaluations A beginner-friendly introduction to LLM-as-a-Judge The post How to Use LLMs for Powerful Automatic Evaluations appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

August 14, 2025
Data Mesh Diaries: Realities from Early Adopters

Data Mesh Diaries: Realities from Early Adopters Early-adopter realities gathered from real data mesh implementations The post Data Mesh Diaries: Realities from Early Adopters appeared first on Towards Data Science. Corné POTGIETER Go to original source

August 14, 2025
Tips for Setting Expectations in AI Projects

Tips for Setting Expectations in AI Projects If you want your AI project to succeed, mastering expectation management comes first. When working with AI projets, uncertainty isn’t just a side effect, it can make or break the entire initiative. Most people impacted by AI projects don’t fully understand how AI works, or that errors are…

August 14, 2025
A Bird’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That?

A Bird’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That? Since the way we manipulate high-dimensional vectors is primarily matrix multiplication, it isn’t a stretch to say it is the bedrock of the modern AI revolution. The post A Bird’s-Eye View of Linear Algebra: Why Is Matrix Multiplication Like That? appeared first on…

August 14, 2025
On Experiments

On Experiments arXiv:2508.08288v1 Announce Type: new Abstract: The scientific process is a means for turning the results of experiments into knowledge about the world in which we live. Much research effort has been directed toward automating this process. To do this, one needs to formulate the scientific process in a precise mathematical language. This paper…

August 13, 2025
Projection-based multifidelity linear regression for data-scarce applications

Projection-based multifidelity linear regression for data-scarce applications arXiv:2508.08517v1 Announce Type: new Abstract: Surrogate modeling for systems with high-dimensional quantities of interest remains challenging, particularly when training data are costly to acquire. This work develops multifidelity methods for multiple-input multiple-output linear regression targeting data-limited applications with high-dimensional outputs. Multifidelity methods integrate many inexpensive low-fidelity model evaluations…

August 13, 2025
In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality

In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality arXiv:2508.08673v1 Announce Type: new Abstract: This paper investigates the expected excess risk of In-Context Learning (ICL) for multiclass classification. We model each task as a sequence of labeled prompt samples and a query input, where a pre-trained model estimates the conditional class probabilities of…

August 13, 2025