Category: aimldsaimlds

  • Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animals

    Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animals arXiv:2509.15989v1 Announce Type: new Abstract: We propose a novel family of model-free algorithms for node clustering and parameter inference in graphs generated from the Stochastic Block Model (SBM), a fundamental framework in community detection. Drawing inspiration from…

  • What is a good matching of probability measures? A counterfactual lens on transport maps

    What is a good matching of probability measures? A counterfactual lens on transport maps arXiv:2509.16027v1 Announce Type: new Abstract: Coupling probability measures lies at the core of many problems in statistics and machine learning, from domain adaptation to transfer learning and causal inference. Yet, even when restricted to deterministic transports, such couplings are not identifiable:…

  • Weekly Entering & Transitioning – Thread 22 Sep, 2025 – 29 Sep, 2025

    Weekly Entering & Transitioning – Thread 22 Sep, 2025 – 29 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Need input from mid-career dara Scientists (2-5 year range)

    Need input from mid-career dara Scientists (2-5 year range) I am a DS with 2YOE (plus about 6 coops). I’m looking for feedback from folks specifically transitioned out of early career and into mid-career phase. (Unfortunately I don’t have any in my immediate network) Context: I’m coming upto 2 years in my role and have…

  • Is it due to the tech recession?

    Is it due to the tech recession? We know that in many companies Data Scientists are Product Analytics / Data Analysts. I thought it was because MLEs had absorbed the duties of DSs, but i have noticed that this may not be exactly the case. There are basically three distinct roles: Data Analyst / Product…

  • What’s the right thing to say to salary expectations question?

    What’s the right thing to say to salary expectations question? I have come across usually two types of scenarios here and I am not sure what’s the best way to deal. I ask for a range and they give you range. Should you just say you’re okay with the range? But what if I make…

  • Updated based on subreddit feedback. Applying for mid-senior based roles. Thank you

    Updated based on subreddit feedback. Applying for mid-senior based roles. Thank you submitted by /u/StormyT [link] [comments] /u/StormyT Go to original source

  • Data Visualization Explained: What It Is and Why It Matters

    Data Visualization Explained: What It Is and Why It Matters A brief introduction to data visualization and its importance in today’s technological landscape. The post Data Visualization Explained: What It Is and Why It Matters appeared first on Towards Data Science. Murtaza Ali Go to original source

  • Python Can Now Call Mojo

    Python Can Now Call Mojo Boost your runtimes with lightning-fast Mojo code The post Python Can Now Call Mojo appeared first on Towards Data Science. Thomas Reid Go to original source

  • Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output

    Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output A hands-on example of building a time-series anomaly detection system entirely through visualization and prompting The post Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output appeared first on Towards…

  • The SyncNet Research Paper, Clearly Explained

    The SyncNet Research Paper, Clearly Explained A Deep Dive into “Out of Time: Automated Lip Sync in the Wild” The post The SyncNet Research Paper, Clearly Explained appeared first on Towards Data Science. Aman Agrawal Go to original source

  • Deploying a PICO Extractor in Five Steps

    Deploying a PICO Extractor in Five Steps Lessons learned deploying a domain-specific NER model The post Deploying a PICO Extractor in Five Steps appeared first on Towards Data Science. Elena Jolkver Go to original source

  • An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers

    An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers An overview of 4 fundamental computer vision tasks – image classification, image segmentation, image captioning and visual question answering, with transformer models. Compare ViT, DETR, BLIP, and ViLT performance interactively by providing a practical Streamlit app implementation guide. The post An Interactive Guide to…

  • How to Select the 5 Most Relevant Documents for AI Search

    How to Select the 5 Most Relevant Documents for AI Search Improve the document retrieval step of your RAG pipeline The post How to Select the 5 Most Relevant Documents for AI Search appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Towards universal property prediction in Cartesian space: TACE is all you need

    Towards universal property prediction in Cartesian space: TACE is all you need arXiv:2509.14961v1 Announce Type: new Abstract: Machine learning has revolutionized atomistic simulations and materials science, yet current approaches often depend on spherical-harmonic representations. Here we introduce the Tensor Atomic Cluster Expansion and Tensor Moment Potential, the first unified framework formulated entirely in Cartesian space…

  • Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression

    Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression arXiv:2509.15141v1 Announce Type: new Abstract: Empirical Risk Minimization (ERM) is a foundational framework for supervised learning but primarily optimizes average-case performance, often neglecting fairness and robustness considerations. Tilted Empirical Risk Minimization (TERM) extends ERM by introducing an exponential tilt…

  • Learning Rate Should Scale Inversely with High-Order Data Moments in High-Dimensional Online Independent Component Analysis

    Learning Rate Should Scale Inversely with High-Order Data Moments in High-Dimensional Online Independent Component Analysis arXiv:2509.15127v1 Announce Type: new Abstract: We investigate the impact of high-order moments on the learning dynamics of an online Independent Component Analysis (ICA) algorithm under a high-dimensional data model composed of a weighted sum of two non-Gaussian random variables. This…

  • Next-Depth Lookahead Tree

    Next-Depth Lookahead Tree arXiv:2509.15143v1 Announce Type: new Abstract: This paper proposes the Next-Depth Lookahead Tree (NDLT), a single-tree model designed to improve performance by evaluating node splits not only at the node being optimized but also by evaluating the quality of the next depth level. Jaeho Lee, Kangjin Kim, Gyeong Taek Lee Go to original…

  • Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models

    Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models arXiv:2509.15152v1 Announce Type: new Abstract: We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the…

  • TDS Newsletter: How to Make Smarter Business Decisions with AI

    TDS Newsletter: How to Make Smarter Business Decisions with AI Research agents, budget planners, and more The post TDS Newsletter: How to Make Smarter Business Decisions with AI appeared first on Towards Data Science. TDS Editors Go to original source

  • How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify

    How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify All ideas can be turned into action in a matter of time now. The post How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify appeared first on Towards Data Science. Soner Yıldırım Go to…

  • From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples

    From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples Learn the basics of JavaScript through tiny n8n Code node snippets for sales data analytics The post From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples appeared first on Towards Data Science. Samir Saci…

  • Rapid Prototyping of Chatbots with Streamlit and Chainlit

    Rapid Prototyping of Chatbots with Streamlit and Chainlit End-to-end demos, comparison of pros and cons, and practical recommendations The post Rapid Prototyping of Chatbots with Streamlit and Chainlit appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

  • From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory

    From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory Achieve natural multi-turn conversations without sacrificing content control. The post From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory appeared first on Towards Data Science. Nicole Ren Go to original source

  • On the Rate of Gaussian Approximation for Linear Regression Problems

    On the Rate of Gaussian Approximation for Linear Regression Problems arXiv:2509.14039v1 Announce Type: new Abstract: In this paper, we consider the problem of Gaussian approximation for the online linear regression task. We derive the corresponding rates for the setting of a constant learning rate and study the explicit dependence of the convergence rate upon the…

  • Field of View Enhanced Signal Dependent Binauralization with Mixture of Experts Framework for Continuous Source Motion

    Field of View Enhanced Signal Dependent Binauralization with Mixture of Experts Framework for Continuous Source Motion arXiv:2509.13548v1 Announce Type: cross Abstract: We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress…

  • Imputation-Powered Inference

    Imputation-Powered Inference arXiv:2509.13778v1 Announce Type: cross Abstract: Modern multi-modal and multi-site data frequently suffer from blockwise missingness, where subsets of features are missing for groups of individuals, creating complex patterns that challenge standard inference methods. Existing approaches have critical limitations: complete-case analysis discards informative data and is potentially biased; doubly robust estimators for non-monotone missingness-where…

  • Towards a Physics Foundation Model

    Towards a Physics Foundation Model arXiv:2509.13805v1 Announce Type: cross Abstract: Foundation models have revolutionized natural language processing through a “train once, deploy anywhere” paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative — democratizing access to high-fidelity simulations, accelerating scientific discovery,…

  • Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus

    Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus arXiv:2509.13923v1 Announce Type: cross Abstract: Cross-validation is one of the most widely used methods for model selection and evaluation; its efficiency for large covariance matrix estimation appears robust in practice, but little is known about the theoretical behavior of its error. In this paper,…

  • Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour

    Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour Applying causal inference to measure the effect of product unavailability on retail sales at Carrefour The post Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour appeared first on Towards Data Science. Thanh Liêm NGUYEN Go…

  • RAG Explained: Understanding Embeddings, Similarity, and Retrieval

    RAG Explained: Understanding Embeddings, Similarity, and Retrieval Let’s take a closer look at how the retrieval mechanism works The post RAG Explained: Understanding Embeddings, Similarity, and Retrieval appeared first on Towards Data Science. Maria Mouschoutzi Go to original source

  • Evaluating Your RAG Solution

    Evaluating Your RAG Solution A guide to building and evaluating RAG solutions by leveraging LLM-as-a-Judge capabilities. The post Evaluating Your RAG Solution appeared first on Towards Data Science. Alex Davis Go to original source

  • Deploying AI Safely and Responsibly

    Deploying AI Safely and Responsibly Experts debunk the biggest myths about trustworthy AI The post Deploying AI Safely and Responsibly appeared first on Towards Data Science. Stephanie Kirmer Go to original source

  • ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models

    ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models Understand how ROC curves and AUC help you go beyond accuracy with visuals and examples. The post ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models appeared first on Towards Data Science. Nikhil Dasari Go to original source

  • PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models

    PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models arXiv:2509.12666v1 Announce Type: new Abstract: Physics-Informed Neural Networks (PINNs) leverage machine learning with differential equations to solve direct and inverse problems, ensuring predictions follow physical laws. Physiologically based pharmacokinetic (PBPK) modeling advances beyond classical compartmental approaches by using a mechanistic, physiology focused framework.…

  • SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty

    SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty arXiv:2509.13189v1 Announce Type: new Abstract: We present a direct inverse modeling method named SURGIN, a SURrogate-guided Generative INversion framework tailed for subsurface multiphase flow data assimilation. Unlike existing inversion methods that require adaptation for each new observational configuration, SURGIN features a zero-shot conditional generation…

  • Jackknife Variance Estimation for H’ajek-Dominated Generalized U-Statistics

    Jackknife Variance Estimation for H’ajek-Dominated Generalized U-Statistics arXiv:2509.12356v1 Announce Type: cross Abstract: We prove ratio-consistency of the jackknife variance estimator, and certain variants, for a broad class of generalized U-statistics whose variance is asymptotically dominated by their H’ajek projection, with the classical fixed-order case recovered as a special instance. This H’ajek projection dominance condition unifies…

  • Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization

    Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization arXiv:2509.12387v1 Announce Type: cross Abstract: Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human-like intelligence-robust, sample-efficient learning-stems from…

  • Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC)

    Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC) arXiv:2509.12401v1 Announce Type: cross Abstract: Physics-aware deep learning (PADL) has gained popularity for use in complex spatiotemporal dynamics (field evolution) simulations, such as those that arise frequently in computational modeling of energetic materials (EM). Here, we show that…

  • Building a Unified Intent Recognition Engine

    Building a Unified Intent Recognition Engine How modular design can simplify and scale intent classification in enterprise AI systems The post Building a Unified Intent Recognition Engine appeared first on Towards Data Science. Shruti Tiwari Go to original source

  • Using Python to Build a Calculator

    Using Python to Build a Calculator A beginner-friendly Python project to understand conditional statements, loops and recursive functions The post Using Python to Build a Calculator appeared first on Towards Data Science. Mahnoor Javed Go to original source

  • My Experiments with NotebookLM for Teaching 

    My Experiments with NotebookLM for Teaching  Exploring NotebookLM as a teaching companion The post My Experiments with NotebookLM for Teaching  appeared first on Towards Data Science. Parul Pandey Go to original source

  • Why Your A/B Test Winner Might Just Be Random Noise

    Why Your A/B Test Winner Might Just Be Random Noise What a coach’s warm-up trial can teach us about running better experiments The post Why Your A/B Test Winner Might Just Be Random Noise appeared first on Towards Data Science. Pol Marin Go to original source

  • Variable Selection Using Relative Importance Rankings

    Variable Selection Using Relative Importance Rankings arXiv:2509.10853v1 Announce Type: new Abstract: Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable ranking and filter-based selection before model creation. Specifically, we anticipate…

  • Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning

    Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning arXiv:2509.11070v1 Announce Type: new Abstract: We develop a stochastic approximation framework for learning nonlinear operators between infinite-dimensional spaces utilizing general Mercer operator-valued kernels. Our framework encompasses two key classes: (i) compact kernels, which admit discrete spectral decompositions, and (ii) diagonal kernels of the form $K(x,x’)=k(x,x’)T$, where $k$…

  • Maximum diversity, weighting and invariants of time series

    Maximum diversity, weighting and invariants of time series arXiv:2509.11146v1 Announce Type: new Abstract: Magnitude, obtained as a special case of Euler characteristic of enriched category, represents a sense of the size of metric spaces and is related to classical notions such as cardinality, dimension, and volume. While the studies have explained the meaning of magnitude…

  • Predictable Compression Failures: Why Language Models Actually Hallucinate

    Predictable Compression Failures: Why Language Models Actually Hallucinate arXiv:2509.11208v1 Announce Type: new Abstract: Large language models perform near-Bayesian inference yet violate permutation invariance on exchangeable data. We resolve this by showing transformers minimize expected conditional description length (cross-entropy) over orderings, $mathbb{E}_pi[ell(Y mid Gamma_pi(X))]$, which admits a Kolmogorov-complexity interpretation up to additive constants, rather than the…

  • Contrastive Network Representation Learning

    Contrastive Network Representation Learning arXiv:2509.11316v1 Announce Type: new Abstract: Network representation learning seeks to embed networks into a low-dimensional space while preserving the structural and semantic properties, thereby facilitating downstream tasks such as classification, trait prediction, edge identification, and community detection. Motivated by challenges in brain connectivity data analysis that is characterized by subject-specific, high-dimensional,…

  • A Visual Guide to Tuning Gradient Boosted Trees

    A Visual Guide to Tuning Gradient Boosted Trees Introduction My previous posts looked at the bog-standard decision tree and the wonder of a random forest. Now, to complete the triplet, I’ll visually explore gradient boosted trees! There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. However, for this I’m going…

  • Implementing the Coffee Machine Project in Python Using Object Oriented Programming

    Implementing the Coffee Machine Project in Python Using Object Oriented Programming Understanding classes, objects, attributes, and methods The post Implementing the Coffee Machine Project in Python Using Object Oriented Programming appeared first on Towards Data Science. Mahnoor Javed Go to original source

  • You Only Need 3 Things to Turn AI Experiments into AI Advantage

    You Only Need 3 Things to Turn AI Experiments into AI Advantage Trapped in a purgatory of POCs enterprises need to focus and build just 3 pillars to realize value from AI The post You Only Need 3 Things to Turn AI Experiments into AI Advantage appeared first on Towards Data Science. Shreshth Sharma Go…

  • Learn How to Use Transformers with HuggingFace and SpaCy

    Learn How to Use Transformers with HuggingFace and SpaCy Mastering NLP with spaCy: Part 4 The post Learn How to Use Transformers with HuggingFace and SpaCy appeared first on Towards Data Science. Marcello Politi Go to original source

  • How to Become a Machine Learning Engineer (Step-by-Step)

    How to Become a Machine Learning Engineer (Step-by-Step) Your one-stop guide to becoming a machine learning engineer The post How to Become a Machine Learning Engineer (Step-by-Step) appeared first on Towards Data Science. Egor Howell Go to original source

  • An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards

    An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards arXiv:2509.09855v1 Announce Type: new Abstract: Credit risk modeling relies extensively on Weight of Evidence (WoE) and Information Value (IV) for feature engineering, and Population Stability Index (PSI) for drift monitoring, yet their theoretical foundations remain disconnected. We…

  • Why does your graph neural network fail on some graphs? Insights from exact generalisation error

    Why does your graph neural network fail on some graphs? Insights from exact generalisation error arXiv:2509.10337v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) are widely used in learning on graph-structured data, yet a principled understanding of why they succeed or fail remains elusive. While prior works have examined architectural limitations such as over-smoothing and…

  • Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance

    Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance arXiv:2509.10166v1 Announce Type: new Abstract: In this paper, we consider the problem of computing the integral of a function on the unit sphere, in any dimension, using Monte Carlo methods. Although the methods we present are general, our guiding thread is the sliced Wasserstein…

  • Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise

    Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise arXiv:2509.10385v1 Announce Type: new Abstract: In this work, we explore differentially private synthetic data generation in a decentralized-data setting by building on the recently proposed Differentially Private Class-Centric Data Aggregation (DP-CDA). DP-CDA synthesizes data in a centralized setting by mixing multiple randomly-selected samples from…

  • Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

    Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation arXiv:2509.09802v1 Announce Type: cross Abstract: We propose and study Sparse Polyak, a variant of Polyak’s adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak…

  • Weekly Entering & Transitioning – Thread 15 Sep, 2025 – 22 Sep, 2025

    Weekly Entering & Transitioning – Thread 15 Sep, 2025 – 22 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice?

    Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice? I’ve been experimenting with generating synthetic datasets for financial indicators (GDP, inflation, unemployment, etc.) and found that CTGAN offered stronger privacy protection in simple linkage tests, but its overall analytical utility was much weaker. In contrast, Gaussian Copula provided reasonably strong privacy and…

  • Texts for creating better visualizations/presentations?

    Texts for creating better visualizations/presentations? I started working for an HR team and have been tasked with creating visualizations, both in PowerPoint (I’ve been using Seaborn and Matplotlib for visualizations) and PowerBI Dashboards. I’ve been having a lot of fun creating visualizations, but I’m looking for a few texts or maybe courses/videos about design. Anything…

  • Does meta only have product analytics?

    Does meta only have product analytics? I have been told that all meta data scientists are all product analysts meaning that they do ab tests and sql. Despite this, i ve been told by friends of mine that google, amazon, uber… they all have two different types of data scientist: one doing product analytics and…

  • Database tools and method for tree structured data?

    Database tools and method for tree structured data? I have a database structure which I believe is very common, and very general, so I’m wondering how this is tackled. The database structured like: -> Project (Name of project) -> Category (simple word, ~20 categories) -> Study Study is a directory containing: – README with date…

  • The Rise of Semantic Entity Resolution

    The Rise of Semantic Entity Resolution Semantic entity resolution uses language models to bring an increased level of automation to schema alignment, blocking (grouping records into smaller, efficient blocks for all-pairs comparison at quadratic, n² complexity), matching and even merging duplicate nodes and edges. In the past, entity resolution systems relied on statistical tricks such…

  • No Peeking Ahead: Time-Aware Graph Fraud Detection

    No Peeking Ahead: Time-Aware Graph Fraud Detection How to implement leak-free graph fraud detection The post No Peeking Ahead: Time-Aware Graph Fraud Detection appeared first on Towards Data Science. Erika G. Gonçalves Go to original source

  • Building Research Agents for Tech Insights

    Building Research Agents for Tech Insights Using a controlled workflow, unique data & prompt chaining The post Building Research Agents for Tech Insights appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

  • If we use AI to do our work – what is our job, then?

    If we use AI to do our work – what is our job, then? Images. Text. Audio. There’s no modality that is not handled by AI. And AI systems reach even further, planning advertisement and marketing campaigns, automating social media postings, … Most of this was unthinkable a mere ten years ago. But then, the…

  • Docling: The Document Alchemist

    Docling: The Document Alchemist Why do we still wrestle with documents in 2025? Spend some time in any data-driven organisation, and you’ll encounter a host of PDFs, Word files, PowerPoints, half-scanned images, handwritten notes, and the occasional surprise CSV lurking in a SharePoint folder. Business and data analysts waste hours converting, splitting, and cajoling those formats…

  • Generalists Can Also Dig Deep

    Generalists Can Also Dig Deep Ida Silfverskiöld on AI agents, RAG, evals, and what design choice ended up mattering more than expected The post Generalists Can Also Dig Deep appeared first on Towards Data Science. TDS Editors Go to original source

  • A Focused Approach to Learning SQL

    A Focused Approach to Learning SQL Data is everywhere, but how do you draw insights from it? Often, structured data is stored in relational databases, meaning collections of related tables of data. For instance, a company might store customer purchases in one table, customer demographics in another, and suppliers in a third table. These tables…

  • Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation

    Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation arXiv:2509.09238v1 Announce Type: new Abstract: Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject…

  • Scalable extensions to given-data Sobol’ index estimators

    Scalable extensions to given-data Sobol’ index estimators arXiv:2509.09078v1 Announce Type: new Abstract: Given-data methods for variance-based sensitivity analysis have significantly advanced the feasibility of Sobol’ index computation for computationally expensive models and models with many inputs. However, the limitations of existing methods still preclude their application to models with an extremely large number of inputs.…

  • Low-degree lower bounds via almost orthonormal bases

    Low-degree lower bounds via almost orthonormal bases arXiv:2509.09353v1 Announce Type: new Abstract: Low-degree polynomials have emerged as a powerful paradigm for providing evidence of statistical-computational gaps across a variety of high-dimensional statistical models [Wein25]. For detection problems — where the goal is to test a planted distribution $mathbb{P}’$ against a null distribution $mathbb{P}$ with independent…

  • Uncertainty Estimation using Variance-Gated Distributions

    Uncertainty Estimation using Variance-Gated Distributions arXiv:2509.08846v1 Announce Type: cross Abstract: Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications. A common approach is to use the predictive distribution from Bayesian or approximation models and decompose the corresponding predictive uncertainty into epistemic (model-related) and aleatoric (data-related) components. However, additive decomposition…

  • Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications

    Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications arXiv:2509.08911v1 Announce Type: cross Abstract: The Matrix Multiplicative Weight Update (MMWU) is a seminal online learning algorithm with numerous applications. Applied to the matrix version of the Learning from Expert Advice (LEA) problem on the $d$-dimensional spectraplex, it is well known that MMWU achieves the minimax-optimal…

  • Why Context Is the New Currency in AI: From RAG to Context Engineering

    Why Context Is the New Currency in AI: From RAG to Context Engineering Context, not computation, is the real currency of intelligent systems The post Why Context Is the New Currency in AI: From RAG to Context Engineering appeared first on Towards Data Science. Sudheer Singamsetty Go to original source

  • How to Analyze and Optimize Your LLMs in 3 Steps

    How to Analyze and Optimize Your LLMs in 3 Steps Learn to enhance your LLMs with my 3 step process, inspecting, improving and iterating on your LLMs The post How to Analyze and Optimize Your LLMs in 3 Steps appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • The Crucial Role of Color Theory in Data Analysis and Visualization

    The Crucial Role of Color Theory in Data Analysis and Visualization How research-backed color principles improved clarity and storytelling in my dashboards The post The Crucial Role of Color Theory in Data Analysis and Visualization appeared first on Towards Data Science. Benjamin Nweke Go to original source

  • kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions

    kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions arXiv:2509.08366v1 Announce Type: new Abstract: We study a missing-value imputation method, termed kNNSampler, that imputes a given unit’s missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample…

  • Gaussian Process Regression — Neural Network Hybrid with Optimized Redundant Coordinates

    Gaussian Process Regression — Neural Network Hybrid with Optimized Redundant Coordinates arXiv:2509.08457v1 Announce Type: new Abstract: Recently, a Gaussian Process Regression – neural network (GPRNN) hybrid machine learning method was proposed, which is based on additive-kernel GPR in redundant coordinates constructed by rules [J. Phys. Chem. A 127 (2023) 7823]. The method combined the expressive…

  • PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research

    PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research arXiv:2509.08553v1 Announce Type: new Abstract: Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major…

  • Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation

    Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation arXiv:2509.08163v1 Announce Type: cross Abstract: Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally…

  • A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo

    A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo arXiv:2509.08619v1 Announce Type: new Abstract: The unadjusted Langevin algorithm is widely used for sampling from complex high-dimensional distributions. It is well known to be biased, with the bias typically scaling linearly with the dimension when measured in squared Wasserstein distance. However,…

  • Is Your Training Data Representative? A Guide to Checking with PSI in Python

    Is Your Training Data Representative? A Guide to Checking with PSI in Python Comparing Variable Distributions Between Two Datasets Using Population Stability Index (PSI) and Cramér’s V. The post Is Your Training Data Representative? A Guide to Checking with PSI in Python appeared first on Towards Data Science. JUNIOR JUMBONG Go to original source

  • Fighting Back Against Attacks in Federated Learning 

    Fighting Back Against Attacks in Federated Learning  Lessons from a multi-node simulator The post Fighting Back Against Attacks in Federated Learning  appeared first on Towards Data Science. Salman Toor Go to original source

  • When A Difference Actually Makes A Difference

    When A Difference Actually Makes A Difference Bite-Sized Analytics for Business Decision-Makers (1) The post When A Difference Actually Makes A Difference appeared first on Towards Data Science. Mena Wang Go to original source

  • Why Task-Based Evaluations Matter

    Why Task-Based Evaluations Matter This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype to Production: Evaluation Strategies for Agentic Applications. Task-based evaluations, which measure an AI system’s performance in use-case-specific, real-world settings, are underadopted and understudied. There is still an outsized focus in AI literature on foundation model benchmarks.…

  • How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n

    How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n Email → n8n → LangGraph → FastAPI: turning budget requests into optimised CAPEX portfolios that maximise ROI for decision-makers. The post How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n appeared first…

  • NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice

    NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice arXiv:2509.07123v1 Announce Type: new Abstract: Nested logit (NL) has been commonly used for discrete choice analysis, including a wide range of applications such as travel mode choice, automobile ownership, or location decisions. However, the classical NL models are restricted by…

  • ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression

    ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression arXiv:2509.07108v1 Announce Type: new Abstract: Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performance. However, most of these models do not provide interpretable insights into the association between exposures and…

  • Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space

    Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space arXiv:2509.07289v1 Announce Type: new Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives–such as invariance to augmentations, variance preservation, and feature decorrelation–without requiring labels. However, most existing methods operate in Euclidean space, limiting their ability to capture…

  • Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression

    Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression arXiv:2509.07300v1 Announce Type: new Abstract: Recent advances in neuroimaging analysis have enabled accurate decoding of mental state from brain activation patterns during functional magnetic resonance imaging scans. A commonly applied tool for this purpose is principal components regression regularized with the least absolute shrinkage and…

  • Asynchronous Gossip Algorithms for Rank-Based Statistical Methods

    Asynchronous Gossip Algorithms for Rank-Based Statistical Methods arXiv:2509.07543v1 Announce Type: new Abstract: As decentralized AI and edge intelligence become increasingly prevalent, ensuring robustness and trustworthiness in such distributed settings has become a critical issue-especially in the presence of corrupted or adversarial data. Traditional decentralized algorithms are vulnerable to data contamination as they typically rely on…

  • LangChain for EDA: Build a CSV Sanity-Check Agent in Python

    LangChain for EDA: Build a CSV Sanity-Check Agent in Python A practical LangChain tutorial for data scientists to inspect CSVs The post LangChain for EDA: Build a CSV Sanity-Check Agent in Python appeared first on Towards Data Science. Sarah Schürch Go to original source

  • How to Build Effective AI Agents to Process Millions of Requests

    How to Build Effective AI Agents to Process Millions of Requests Learn how to build production ready systems using AI agents The post How to Build Effective AI Agents to Process Millions of Requests appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • The Hungarian Algorithm and Its Applications in Computer Vision

    The Hungarian Algorithm and Its Applications in Computer Vision Introduction Multi-object tracking (MOT) is a task in which an algorithm must detect and track multiple objects in a video. Most known algorithms are based on using simple detectors (e.g. YOLO) designed for processing individual images. The overall method involves separately using a detector on consecutive video…

  • LangGraph 201: Adding Human Oversight to Your Deep Research Agent

    LangGraph 201: Adding Human Oversight to Your Deep Research Agent Losing control of your AI agent in the middle of the workflow is a common pain point. If you have built your own agentic applications, you’ve most likely already seen this happen. While LLMs nowadays are incredibly capable, they’re still not quite there yet to…

  • Exploring Merit Order and Marginal Abatement Cost Curve in Python

    Exploring Merit Order and Marginal Abatement Cost Curve in Python To achieve the global temperature limit goals of 1.5°C by the end of the century set by the Paris Agreement, different institutions have come up with different scenarios. There is a consensus among the mitigation scenarios that the share of low-carbon technologies such as renewable energy needs…

  • Cryo-EM as a Stochastic Inverse Problem

    Cryo-EM as a Stochastic Inverse Problem arXiv:2509.05541v1 Announce Type: new Abstract: Cryo-electron microscopy (Cryo-EM) enables high-resolution imaging of biomolecules, but structural heterogeneity remains a major challenge in 3D reconstruction. Traditional methods assume a discrete set of conformations, limiting their ability to recover continuous structural variability. In this work, we formulate cryo-EM reconstruction as a stochastic…