Category: aimldsaimlds

Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animals

Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animals arXiv:2509.15989v1 Announce Type: new Abstract: We propose a novel family of model-free algorithms for node clustering and parameter inference in graphs generated from the Stochastic Block Model (SBM), a fundamental framework in community detection. Drawing inspiration from…

September 22, 2025
What is a good matching of probability measures? A counterfactual lens on transport maps

What is a good matching of probability measures? A counterfactual lens on transport maps arXiv:2509.16027v1 Announce Type: new Abstract: Coupling probability measures lies at the core of many problems in statistics and machine learning, from domain adaptation to transfer learning and causal inference. Yet, even when restricted to deterministic transports, such couplings are not identifiable:…

September 22, 2025
Weekly Entering & Transitioning – Thread 22 Sep, 2025 – 29 Sep, 2025

Weekly Entering & Transitioning – Thread 22 Sep, 2025 – 29 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

September 22, 2025
Need input from mid-career dara Scientists (2-5 year range)

Need input from mid-career dara Scientists (2-5 year range) I am a DS with 2YOE (plus about 6 coops). I’m looking for feedback from folks specifically transitioned out of early career and into mid-career phase. (Unfortunately I don’t have any in my immediate network) Context: I’m coming upto 2 years in my role and have…

September 22, 2025
Is it due to the tech recession?

Is it due to the tech recession? We know that in many companies Data Scientists are Product Analytics / Data Analysts. I thought it was because MLEs had absorbed the duties of DSs, but i have noticed that this may not be exactly the case. There are basically three distinct roles: Data Analyst / Product…

September 22, 2025
What’s the right thing to say to salary expectations question?

What’s the right thing to say to salary expectations question? I have come across usually two types of scenarios here and I am not sure what’s the best way to deal. I ask for a range and they give you range. Should you just say you’re okay with the range? But what if I make…

September 22, 2025
Updated based on subreddit feedback. Applying for mid-senior based roles. Thank you

Updated based on subreddit feedback. Applying for mid-senior based roles. Thank you submitted by /u/StormyT [link] [comments] /u/StormyT Go to original source

September 22, 2025
Data Visualization Explained: What It Is and Why It Matters

Data Visualization Explained: What It Is and Why It Matters A brief introduction to data visualization and its importance in today’s technological landscape. The post Data Visualization Explained: What It Is and Why It Matters appeared first on Towards Data Science. Murtaza Ali Go to original source

September 22, 2025
Python Can Now Call Mojo

Python Can Now Call Mojo Boost your runtimes with lightning-fast Mojo code The post Python Can Now Call Mojo appeared first on Towards Data Science. Thomas Reid Go to original source

September 22, 2025
Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output

Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output A hands-on example of building a time-series anomaly detection system entirely through visualization and prompting The post Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output appeared first on Towards…

September 21, 2025
The SyncNet Research Paper, Clearly Explained

The SyncNet Research Paper, Clearly Explained A Deep Dive into “Out of Time: Automated Lip Sync in the Wild” The post The SyncNet Research Paper, Clearly Explained appeared first on Towards Data Science. Aman Agrawal Go to original source

September 21, 2025
Deploying a PICO Extractor in Five Steps

Deploying a PICO Extractor in Five Steps Lessons learned deploying a domain-specific NER model The post Deploying a PICO Extractor in Five Steps appeared first on Towards Data Science. Elena Jolkver Go to original source

September 20, 2025
An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers An overview of 4 fundamental computer vision tasks – image classification, image segmentation, image captioning and visual question answering, with transformer models. Compare ViT, DETR, BLIP, and ViLT performance interactively by providing a practical Streamlit app implementation guide. The post An Interactive Guide to…

September 20, 2025
How to Select the 5 Most Relevant Documents for AI Search

How to Select the 5 Most Relevant Documents for AI Search Improve the document retrieval step of your RAG pipeline The post How to Select the 5 Most Relevant Documents for AI Search appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 20, 2025
Towards universal property prediction in Cartesian space: TACE is all you need

Towards universal property prediction in Cartesian space: TACE is all you need arXiv:2509.14961v1 Announce Type: new Abstract: Machine learning has revolutionized atomistic simulations and materials science, yet current approaches often depend on spherical-harmonic representations. Here we introduce the Tensor Atomic Cluster Expansion and Tensor Moment Potential, the first unified framework formulated entirely in Cartesian space…

September 19, 2025
Learning Rate Should Scale Inversely with High-Order Data Moments in High-Dimensional Online Independent Component Analysis

Learning Rate Should Scale Inversely with High-Order Data Moments in High-Dimensional Online Independent Component Analysis arXiv:2509.15127v1 Announce Type: new Abstract: We investigate the impact of high-order moments on the learning dynamics of an online Independent Component Analysis (ICA) algorithm under a high-dimensional data model composed of a weighted sum of two non-Gaussian random variables. This…

September 19, 2025
Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression

Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression arXiv:2509.15141v1 Announce Type: new Abstract: Empirical Risk Minimization (ERM) is a foundational framework for supervised learning but primarily optimizes average-case performance, often neglecting fairness and robustness considerations. Tilted Empirical Risk Minimization (TERM) extends ERM by introducing an exponential tilt…

September 19, 2025
Next-Depth Lookahead Tree

Next-Depth Lookahead Tree arXiv:2509.15143v1 Announce Type: new Abstract: This paper proposes the Next-Depth Lookahead Tree (NDLT), a single-tree model designed to improve performance by evaluating node splits not only at the node being optimized but also by evaluating the quality of the next depth level. Jaeho Lee, Kangjin Kim, Gyeong Taek Lee Go to original…

September 19, 2025
Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models

Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models arXiv:2509.15152v1 Announce Type: new Abstract: We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the…

September 19, 2025
TDS Newsletter: How to Make Smarter Business Decisions with AI

TDS Newsletter: How to Make Smarter Business Decisions with AI Research agents, budget planners, and more The post TDS Newsletter: How to Make Smarter Business Decisions with AI appeared first on Towards Data Science. TDS Editors Go to original source

September 19, 2025
How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify

How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify All ideas can be turned into action in a matter of time now. The post How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify appeared first on Towards Data Science. Soner Yıldırım Go to…

September 19, 2025
From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples

From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples Learn the basics of JavaScript through tiny n8n Code node snippets for sales data analytics The post From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples appeared first on Towards Data Science. Samir Saci…

September 19, 2025
Rapid Prototyping of Chatbots with Streamlit and Chainlit

Rapid Prototyping of Chatbots with Streamlit and Chainlit End-to-end demos, comparison of pros and cons, and practical recommendations The post Rapid Prototyping of Chatbots with Streamlit and Chainlit appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

September 19, 2025
From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory

From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory Achieve natural multi-turn conversations without sacrificing content control. The post From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory appeared first on Towards Data Science. Nicole Ren Go to original source

September 19, 2025
On the Rate of Gaussian Approximation for Linear Regression Problems

On the Rate of Gaussian Approximation for Linear Regression Problems arXiv:2509.14039v1 Announce Type: new Abstract: In this paper, we consider the problem of Gaussian approximation for the online linear regression task. We derive the corresponding rates for the setting of a constant learning rate and study the explicit dependence of the convergence rate upon the…

September 18, 2025
Field of View Enhanced Signal Dependent Binauralization with Mixture of Experts Framework for Continuous Source Motion

Field of View Enhanced Signal Dependent Binauralization with Mixture of Experts Framework for Continuous Source Motion arXiv:2509.13548v1 Announce Type: cross Abstract: We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress…

September 18, 2025
Imputation-Powered Inference

Imputation-Powered Inference arXiv:2509.13778v1 Announce Type: cross Abstract: Modern multi-modal and multi-site data frequently suffer from blockwise missingness, where subsets of features are missing for groups of individuals, creating complex patterns that challenge standard inference methods. Existing approaches have critical limitations: complete-case analysis discards informative data and is potentially biased; doubly robust estimators for non-monotone missingness-where…

September 18, 2025
Towards a Physics Foundation Model

Towards a Physics Foundation Model arXiv:2509.13805v1 Announce Type: cross Abstract: Foundation models have revolutionized natural language processing through a “train once, deploy anywhere” paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative — democratizing access to high-fidelity simulations, accelerating scientific discovery,…

September 18, 2025
Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus

Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus arXiv:2509.13923v1 Announce Type: cross Abstract: Cross-validation is one of the most widely used methods for model selection and evaluation; its efficiency for large covariance matrix estimation appears robust in practice, but little is known about the theoretical behavior of its error. In this paper,…

September 18, 2025
Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour

Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour Applying causal inference to measure the effect of product unavailability on retail sales at Carrefour The post Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour appeared first on Towards Data Science. Thanh Liêm NGUYEN Go…

September 18, 2025
RAG Explained: Understanding Embeddings, Similarity, and Retrieval

RAG Explained: Understanding Embeddings, Similarity, and Retrieval Let’s take a closer look at how the retrieval mechanism works The post RAG Explained: Understanding Embeddings, Similarity, and Retrieval appeared first on Towards Data Science. Maria Mouschoutzi Go to original source

September 18, 2025
Evaluating Your RAG Solution

Evaluating Your RAG Solution A guide to building and evaluating RAG solutions by leveraging LLM-as-a-Judge capabilities. The post Evaluating Your RAG Solution appeared first on Towards Data Science. Alex Davis Go to original source

September 18, 2025
Deploying AI Safely and Responsibly

Deploying AI Safely and Responsibly Experts debunk the biggest myths about trustworthy AI The post Deploying AI Safely and Responsibly appeared first on Towards Data Science. Stephanie Kirmer Go to original source

September 18, 2025
ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models

ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models Understand how ROC curves and AUC help you go beyond accuracy with visuals and examples. The post ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models appeared first on Towards Data Science. Nikhil Dasari Go to original source

September 18, 2025
PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models

PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models arXiv:2509.12666v1 Announce Type: new Abstract: Physics-Informed Neural Networks (PINNs) leverage machine learning with differential equations to solve direct and inverse problems, ensuring predictions follow physical laws. Physiologically based pharmacokinetic (PBPK) modeling advances beyond classical compartmental approaches by using a mechanistic, physiology focused framework.…

September 17, 2025
SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty

SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty arXiv:2509.13189v1 Announce Type: new Abstract: We present a direct inverse modeling method named SURGIN, a SURrogate-guided Generative INversion framework tailed for subsurface multiphase flow data assimilation. Unlike existing inversion methods that require adaptation for each new observational configuration, SURGIN features a zero-shot conditional generation…

September 17, 2025
Jackknife Variance Estimation for H’ajek-Dominated Generalized U-Statistics

Jackknife Variance Estimation for H’ajek-Dominated Generalized U-Statistics arXiv:2509.12356v1 Announce Type: cross Abstract: We prove ratio-consistency of the jackknife variance estimator, and certain variants, for a broad class of generalized U-statistics whose variance is asymptotically dominated by their H’ajek projection, with the classical fixed-order case recovered as a special instance. This H’ajek projection dominance condition unifies…

September 17, 2025
Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization

Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization arXiv:2509.12387v1 Announce Type: cross Abstract: Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human-like intelligence-robust, sample-efficient learning-stems from…

September 17, 2025
Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC)

Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC) arXiv:2509.12401v1 Announce Type: cross Abstract: Physics-aware deep learning (PADL) has gained popularity for use in complex spatiotemporal dynamics (field evolution) simulations, such as those that arise frequently in computational modeling of energetic materials (EM). Here, we show that…

September 17, 2025
Building a Unified Intent Recognition Engine

Building a Unified Intent Recognition Engine How modular design can simplify and scale intent classification in enterprise AI systems The post Building a Unified Intent Recognition Engine appeared first on Towards Data Science. Shruti Tiwari Go to original source

September 17, 2025
Using Python to Build a Calculator

Using Python to Build a Calculator A beginner-friendly Python project to understand conditional statements, loops and recursive functions The post Using Python to Build a Calculator appeared first on Towards Data Science. Mahnoor Javed Go to original source

September 17, 2025
My Experiments with NotebookLM for Teaching

My Experiments with NotebookLM for Teaching Exploring NotebookLM as a teaching companion The post My Experiments with NotebookLM for Teaching appeared first on Towards Data Science. Parul Pandey Go to original source

September 17, 2025
How to Enrich LLM Context to Significantly Enhance Capabilities

How to Enrich LLM Context to Significantly Enhance Capabilities Learn how to empower your LLMs by leveraging additional metadata The post How to Enrich LLM Context to Significantly Enhance Capabilities appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 17, 2025
Why Your A/B Test Winner Might Just Be Random Noise

Why Your A/B Test Winner Might Just Be Random Noise What a coach’s warm-up trial can teach us about running better experiments The post Why Your A/B Test Winner Might Just Be Random Noise appeared first on Towards Data Science. Pol Marin Go to original source

September 17, 2025
Variable Selection Using Relative Importance Rankings

Variable Selection Using Relative Importance Rankings arXiv:2509.10853v1 Announce Type: new Abstract: Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable ranking and filter-based selection before model creation. Specifically, we anticipate…

September 16, 2025
Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning

Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning arXiv:2509.11070v1 Announce Type: new Abstract: We develop a stochastic approximation framework for learning nonlinear operators between infinite-dimensional spaces utilizing general Mercer operator-valued kernels. Our framework encompasses two key classes: (i) compact kernels, which admit discrete spectral decompositions, and (ii) diagonal kernels of the form $K(x,x’)=k(x,x’)T$, where $k$…

September 16, 2025
Maximum diversity, weighting and invariants of time series

Maximum diversity, weighting and invariants of time series arXiv:2509.11146v1 Announce Type: new Abstract: Magnitude, obtained as a special case of Euler characteristic of enriched category, represents a sense of the size of metric spaces and is related to classical notions such as cardinality, dimension, and volume. While the studies have explained the meaning of magnitude…

September 16, 2025
Predictable Compression Failures: Why Language Models Actually Hallucinate

Predictable Compression Failures: Why Language Models Actually Hallucinate arXiv:2509.11208v1 Announce Type: new Abstract: Large language models perform near-Bayesian inference yet violate permutation invariance on exchangeable data. We resolve this by showing transformers minimize expected conditional description length (cross-entropy) over orderings, $mathbb{E}_pi[ell(Y mid Gamma_pi(X))]$, which admits a Kolmogorov-complexity interpretation up to additive constants, rather than the…

September 16, 2025
Contrastive Network Representation Learning

Contrastive Network Representation Learning arXiv:2509.11316v1 Announce Type: new Abstract: Network representation learning seeks to embed networks into a low-dimensional space while preserving the structural and semantic properties, thereby facilitating downstream tasks such as classification, trait prediction, edge identification, and community detection. Motivated by challenges in brain connectivity data analysis that is characterized by subject-specific, high-dimensional,…

September 16, 2025
A Visual Guide to Tuning Gradient Boosted Trees

A Visual Guide to Tuning Gradient Boosted Trees Introduction My previous posts looked at the bog-standard decision tree and the wonder of a random forest. Now, to complete the triplet, I’ll visually explore gradient boosted trees! There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. However, for this I’m going…

September 16, 2025
Implementing the Coffee Machine Project in Python Using Object Oriented Programming

Implementing the Coffee Machine Project in Python Using Object Oriented Programming Understanding classes, objects, attributes, and methods The post Implementing the Coffee Machine Project in Python Using Object Oriented Programming appeared first on Towards Data Science. Mahnoor Javed Go to original source

September 16, 2025
You Only Need 3 Things to Turn AI Experiments into AI Advantage

You Only Need 3 Things to Turn AI Experiments into AI Advantage Trapped in a purgatory of POCs enterprises need to focus and build just 3 pillars to realize value from AI The post You Only Need 3 Things to Turn AI Experiments into AI Advantage appeared first on Towards Data Science. Shreshth Sharma Go…

September 16, 2025
Learn How to Use Transformers with HuggingFace and SpaCy

Learn How to Use Transformers with HuggingFace and SpaCy Mastering NLP with spaCy: Part 4 The post Learn How to Use Transformers with HuggingFace and SpaCy appeared first on Towards Data Science. Marcello Politi Go to original source

September 16, 2025
How to Become a Machine Learning Engineer (Step-by-Step)

How to Become a Machine Learning Engineer (Step-by-Step) Your one-stop guide to becoming a machine learning engineer The post How to Become a Machine Learning Engineer (Step-by-Step) appeared first on Towards Data Science. Egor Howell Go to original source

September 16, 2025
An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards

An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards arXiv:2509.09855v1 Announce Type: new Abstract: Credit risk modeling relies extensively on Weight of Evidence (WoE) and Information Value (IV) for feature engineering, and Population Stability Index (PSI) for drift monitoring, yet their theoretical foundations remain disconnected. We…

September 15, 2025
Why does your graph neural network fail on some graphs? Insights from exact generalisation error

Why does your graph neural network fail on some graphs? Insights from exact generalisation error arXiv:2509.10337v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) are widely used in learning on graph-structured data, yet a principled understanding of why they succeed or fail remains elusive. While prior works have examined architectural limitations such as over-smoothing and…

September 15, 2025
Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance

Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance arXiv:2509.10166v1 Announce Type: new Abstract: In this paper, we consider the problem of computing the integral of a function on the unit sphere, in any dimension, using Monte Carlo methods. Although the methods we present are general, our guiding thread is the sliced Wasserstein…

September 15, 2025
Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise

Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise arXiv:2509.10385v1 Announce Type: new Abstract: In this work, we explore differentially private synthetic data generation in a decentralized-data setting by building on the recently proposed Differentially Private Class-Centric Data Aggregation (DP-CDA). DP-CDA synthesizes data in a centralized setting by mixing multiple randomly-selected samples from…

September 15, 2025
Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation arXiv:2509.09802v1 Announce Type: cross Abstract: We propose and study Sparse Polyak, a variant of Polyak’s adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak…

September 15, 2025
Weekly Entering & Transitioning – Thread 15 Sep, 2025 – 22 Sep, 2025

Weekly Entering & Transitioning – Thread 15 Sep, 2025 – 22 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

September 15, 2025
Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice?

Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice? I’ve been experimenting with generating synthetic datasets for financial indicators (GDP, inflation, unemployment, etc.) and found that CTGAN offered stronger privacy protection in simple linkage tests, but its overall analytical utility was much weaker. In contrast, Gaussian Copula provided reasonably strong privacy and…

September 15, 2025
Texts for creating better visualizations/presentations?

Texts for creating better visualizations/presentations? I started working for an HR team and have been tasked with creating visualizations, both in PowerPoint (I’ve been using Seaborn and Matplotlib for visualizations) and PowerBI Dashboards. I’ve been having a lot of fun creating visualizations, but I’m looking for a few texts or maybe courses/videos about design. Anything…

September 15, 2025
Does meta only have product analytics?

Does meta only have product analytics? I have been told that all meta data scientists are all product analysts meaning that they do ab tests and sql. Despite this, i ve been told by friends of mine that google, amazon, uber… they all have two different types of data scientist: one doing product analytics and…

September 15, 2025
Database tools and method for tree structured data?

Database tools and method for tree structured data? I have a database structure which I believe is very common, and very general, so I’m wondering how this is tackled. The database structured like: -> Project (Name of project) -> Category (simple word, ~20 categories) -> Study Study is a directory containing: – README with date…

September 15, 2025
The Rise of Semantic Entity Resolution

The Rise of Semantic Entity Resolution Semantic entity resolution uses language models to bring an increased level of automation to schema alignment, blocking (grouping records into smaller, efficient blocks for all-pairs comparison at quadratic, n² complexity), matching and even merging duplicate nodes and edges. In the past, entity resolution systems relied on statistical tricks such…

September 15, 2025
No Peeking Ahead: Time-Aware Graph Fraud Detection

No Peeking Ahead: Time-Aware Graph Fraud Detection How to implement leak-free graph fraud detection The post No Peeking Ahead: Time-Aware Graph Fraud Detection appeared first on Towards Data Science. Erika G. Gonçalves Go to original source

September 15, 2025
Building Research Agents for Tech Insights

Building Research Agents for Tech Insights Using a controlled workflow, unique data & prompt chaining The post Building Research Agents for Tech Insights appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

September 14, 2025
If we use AI to do our work – what is our job, then?

If we use AI to do our work – what is our job, then? Images. Text. Audio. There’s no modality that is not handled by AI. And AI systems reach even further, planning advertisement and marketing campaigns, automating social media postings, … Most of this was unthinkable a mere ten years ago. But then, the…

September 13, 2025
Docling: The Document Alchemist

Docling: The Document Alchemist Why do we still wrestle with documents in 2025? Spend some time in any data-driven organisation, and you’ll encounter a host of PDFs, Word files, PowerPoints, half-scanned images, handwritten notes, and the occasional surprise CSV lurking in a SharePoint folder. Business and data analysts waste hours converting, splitting, and cajoling those formats…

September 13, 2025
Generalists Can Also Dig Deep

Generalists Can Also Dig Deep Ida Silfverskiöld on AI agents, RAG, evals, and what design choice ended up mattering more than expected The post Generalists Can Also Dig Deep appeared first on Towards Data Science. TDS Editors Go to original source

September 13, 2025
A Focused Approach to Learning SQL

A Focused Approach to Learning SQL Data is everywhere, but how do you draw insights from it? Often, structured data is stored in relational databases, meaning collections of related tables of data. For instance, a company might store customer purchases in one table, customer demographics in another, and suppliers in a third table. These tables…

September 13, 2025
Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation

Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation arXiv:2509.09238v1 Announce Type: new Abstract: Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject…

September 12, 2025
Scalable extensions to given-data Sobol’ index estimators

Scalable extensions to given-data Sobol’ index estimators arXiv:2509.09078v1 Announce Type: new Abstract: Given-data methods for variance-based sensitivity analysis have significantly advanced the feasibility of Sobol’ index computation for computationally expensive models and models with many inputs. However, the limitations of existing methods still preclude their application to models with an extremely large number of inputs.…

September 12, 2025
Low-degree lower bounds via almost orthonormal bases

Low-degree lower bounds via almost orthonormal bases arXiv:2509.09353v1 Announce Type: new Abstract: Low-degree polynomials have emerged as a powerful paradigm for providing evidence of statistical-computational gaps across a variety of high-dimensional statistical models [Wein25]. For detection problems — where the goal is to test a planted distribution $mathbb{P}’$ against a null distribution $mathbb{P}$ with independent…

September 12, 2025
Uncertainty Estimation using Variance-Gated Distributions

Uncertainty Estimation using Variance-Gated Distributions arXiv:2509.08846v1 Announce Type: cross Abstract: Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications. A common approach is to use the predictive distribution from Bayesian or approximation models and decompose the corresponding predictive uncertainty into epistemic (model-related) and aleatoric (data-related) components. However, additive decomposition…

September 12, 2025
Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications

Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications arXiv:2509.08911v1 Announce Type: cross Abstract: The Matrix Multiplicative Weight Update (MMWU) is a seminal online learning algorithm with numerous applications. Applied to the matrix version of the Learning from Expert Advice (LEA) problem on the $d$-dimensional spectraplex, it is well known that MMWU achieves the minimax-optimal…

September 12, 2025
Why Context Is the New Currency in AI: From RAG to Context Engineering

Why Context Is the New Currency in AI: From RAG to Context Engineering Context, not computation, is the real currency of intelligent systems The post Why Context Is the New Currency in AI: From RAG to Context Engineering appeared first on Towards Data Science. Sudheer Singamsetty Go to original source

September 12, 2025
How to Analyze and Optimize Your LLMs in 3 Steps

How to Analyze and Optimize Your LLMs in 3 Steps Learn to enhance your LLMs with my 3 step process, inspecting, improving and iterating on your LLMs The post How to Analyze and Optimize Your LLMs in 3 Steps appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 12, 2025
The Crucial Role of Color Theory in Data Analysis and Visualization

The Crucial Role of Color Theory in Data Analysis and Visualization How research-backed color principles improved clarity and storytelling in my dashboards The post The Crucial Role of Color Theory in Data Analysis and Visualization appeared first on Towards Data Science. Benjamin Nweke Go to original source

September 12, 2025
kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions arXiv:2509.08366v1 Announce Type: new Abstract: We study a missing-value imputation method, termed kNNSampler, that imputes a given unit’s missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample…

September 11, 2025
Gaussian Process Regression — Neural Network Hybrid with Optimized Redundant Coordinates

Gaussian Process Regression — Neural Network Hybrid with Optimized Redundant Coordinates arXiv:2509.08457v1 Announce Type: new Abstract: Recently, a Gaussian Process Regression – neural network (GPRNN) hybrid machine learning method was proposed, which is based on additive-kernel GPR in redundant coordinates constructed by rules [J. Phys. Chem. A 127 (2023) 7823]. The method combined the expressive…

September 11, 2025
PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research arXiv:2509.08553v1 Announce Type: new Abstract: Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major…

September 11, 2025
Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation

Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation arXiv:2509.08163v1 Announce Type: cross Abstract: Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally…

September 11, 2025
A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo

A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo arXiv:2509.08619v1 Announce Type: new Abstract: The unadjusted Langevin algorithm is widely used for sampling from complex high-dimensional distributions. It is well known to be biased, with the bias typically scaling linearly with the dimension when measured in squared Wasserstein distance. However,…

September 11, 2025
Is Your Training Data Representative? A Guide to Checking with PSI in Python

Is Your Training Data Representative? A Guide to Checking with PSI in Python Comparing Variable Distributions Between Two Datasets Using Population Stability Index (PSI) and Cramér’s V. The post Is Your Training Data Representative? A Guide to Checking with PSI in Python appeared first on Towards Data Science. JUNIOR JUMBONG Go to original source

September 11, 2025
Fighting Back Against Attacks in Federated Learning

Fighting Back Against Attacks in Federated Learning Lessons from a multi-node simulator The post Fighting Back Against Attacks in Federated Learning appeared first on Towards Data Science. Salman Toor Go to original source

September 11, 2025
When A Difference Actually Makes A Difference

When A Difference Actually Makes A Difference Bite-Sized Analytics for Business Decision-Makers (1) The post When A Difference Actually Makes A Difference appeared first on Towards Data Science. Mena Wang Go to original source

September 11, 2025
Why Task-Based Evaluations Matter

Why Task-Based Evaluations Matter This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype to Production: Evaluation Strategies for Agentic Applications. Task-based evaluations, which measure an AI system’s performance in use-case-specific, real-world settings, are underadopted and understudied. There is still an outsized focus in AI literature on foundation model benchmarks.…

September 11, 2025
How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n

How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n Email → n8n → LangGraph → FastAPI: turning budget requests into optimised CAPEX portfolios that maximise ROI for decision-makers. The post How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n appeared first…

September 11, 2025
NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice

NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice arXiv:2509.07123v1 Announce Type: new Abstract: Nested logit (NL) has been commonly used for discrete choice analysis, including a wide range of applications such as travel mode choice, automobile ownership, or location decisions. However, the classical NL models are restricted by…

September 10, 2025
ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression

ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression arXiv:2509.07108v1 Announce Type: new Abstract: Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performance. However, most of these models do not provide interpretable insights into the association between exposures and…

September 10, 2025
Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space

Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space arXiv:2509.07289v1 Announce Type: new Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives–such as invariance to augmentations, variance preservation, and feature decorrelation–without requiring labels. However, most existing methods operate in Euclidean space, limiting their ability to capture…

September 10, 2025
Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression

Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression arXiv:2509.07300v1 Announce Type: new Abstract: Recent advances in neuroimaging analysis have enabled accurate decoding of mental state from brain activation patterns during functional magnetic resonance imaging scans. A commonly applied tool for this purpose is principal components regression regularized with the least absolute shrinkage and…

September 10, 2025
Asynchronous Gossip Algorithms for Rank-Based Statistical Methods

Asynchronous Gossip Algorithms for Rank-Based Statistical Methods arXiv:2509.07543v1 Announce Type: new Abstract: As decentralized AI and edge intelligence become increasingly prevalent, ensuring robustness and trustworthiness in such distributed settings has become a critical issue-especially in the presence of corrupted or adversarial data. Traditional decentralized algorithms are vulnerable to data contamination as they typically rely on…

September 10, 2025
LangChain for EDA: Build a CSV Sanity-Check Agent in Python

LangChain for EDA: Build a CSV Sanity-Check Agent in Python A practical LangChain tutorial for data scientists to inspect CSVs The post LangChain for EDA: Build a CSV Sanity-Check Agent in Python appeared first on Towards Data Science. Sarah Schürch Go to original source

September 10, 2025
How to Build Effective AI Agents to Process Millions of Requests

How to Build Effective AI Agents to Process Millions of Requests Learn how to build production ready systems using AI agents The post How to Build Effective AI Agents to Process Millions of Requests appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 10, 2025
The Hungarian Algorithm and Its Applications in Computer Vision

The Hungarian Algorithm and Its Applications in Computer Vision Introduction Multi-object tracking (MOT) is a task in which an algorithm must detect and track multiple objects in a video. Most known algorithms are based on using simple detectors (e.g. YOLO) designed for processing individual images. The overall method involves separately using a detector on consecutive video…

September 10, 2025
LangGraph 201: Adding Human Oversight to Your Deep Research Agent

LangGraph 201: Adding Human Oversight to Your Deep Research Agent Losing control of your AI agent in the middle of the workflow is a common pain point. If you have built your own agentic applications, you’ve most likely already seen this happen. While LLMs nowadays are incredibly capable, they’re still not quite there yet to…

September 10, 2025
Exploring Merit Order and Marginal Abatement Cost Curve in Python

Exploring Merit Order and Marginal Abatement Cost Curve in Python To achieve the global temperature limit goals of 1.5°C by the end of the century set by the Paris Agreement, different institutions have come up with different scenarios. There is a consensus among the mitigation scenarios that the share of low-carbon technologies such as renewable energy needs…

September 10, 2025
Cryo-EM as a Stochastic Inverse Problem

Cryo-EM as a Stochastic Inverse Problem arXiv:2509.05541v1 Announce Type: new Abstract: Cryo-electron microscopy (Cryo-EM) enables high-resolution imaging of biomolecules, but structural heterogeneity remains a major challenge in 3D reconstruction. Traditional methods assume a discrete set of conformations, limiting their ability to recover continuous structural variability. In this work, we formulate cryo-EM reconstruction as a stochastic…

September 9, 2025