Category: aimldsaimlds
-
Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animals
Model-free algorithms for fast node clustering in SBM type graphs and application to social role inference in animals arXiv:2509.15989v1 Announce Type: new Abstract: We propose a novel family of model-free algorithms for node clustering and parameter inference in graphs generated from the Stochastic Block Model (SBM), a fundamental framework in community detection. Drawing inspiration from…
-
What is a good matching of probability measures? A counterfactual lens on transport maps
What is a good matching of probability measures? A counterfactual lens on transport maps arXiv:2509.16027v1 Announce Type: new Abstract: Coupling probability measures lies at the core of many problems in statistics and machine learning, from domain adaptation to transfer learning and causal inference. Yet, even when restricted to deterministic transports, such couplings are not identifiable:…
-
Weekly Entering & Transitioning – Thread 22 Sep, 2025 – 29 Sep, 2025
Weekly Entering & Transitioning – Thread 22 Sep, 2025 – 29 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
Need input from mid-career dara Scientists (2-5 year range)
Need input from mid-career dara Scientists (2-5 year range) I am a DS with 2YOE (plus about 6 coops). I’m looking for feedback from folks specifically transitioned out of early career and into mid-career phase. (Unfortunately I don’t have any in my immediate network) Context: I’m coming upto 2 years in my role and have…
-
Is it due to the tech recession?
Is it due to the tech recession? We know that in many companies Data Scientists are Product Analytics / Data Analysts. I thought it was because MLEs had absorbed the duties of DSs, but i have noticed that this may not be exactly the case. There are basically three distinct roles: Data Analyst / Product…
-
What’s the right thing to say to salary expectations question?
What’s the right thing to say to salary expectations question? I have come across usually two types of scenarios here and I am not sure what’s the best way to deal. I ask for a range and they give you range. Should you just say you’re okay with the range? But what if I make…
-
Updated based on subreddit feedback. Applying for mid-senior based roles. Thank you
Updated based on subreddit feedback. Applying for mid-senior based roles. Thank you submitted by /u/StormyT [link] [comments] /u/StormyT Go to original source
-
Data Visualization Explained: What It Is and Why It Matters
Data Visualization Explained: What It Is and Why It Matters A brief introduction to data visualization and its importance in today’s technological landscape. The post Data Visualization Explained: What It Is and Why It Matters appeared first on Towards Data Science. Murtaza Ali Go to original source
-
Python Can Now Call Mojo
Python Can Now Call Mojo Boost your runtimes with lightning-fast Mojo code The post Python Can Now Call Mojo appeared first on Towards Data Science. Thomas Reid Go to original source
-
Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output
Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output A hands-on example of building a time-series anomaly detection system entirely through visualization and prompting The post Building LLM Apps That Can See, Think, and Integrate: Using o3 with Multimodal Input and Structured Output appeared first on Towards…
-
The SyncNet Research Paper, Clearly Explained
The SyncNet Research Paper, Clearly Explained A Deep Dive into “Out of Time: Automated Lip Sync in the Wild” The post The SyncNet Research Paper, Clearly Explained appeared first on Towards Data Science. Aman Agrawal Go to original source
-
An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers
An Interactive Guide to 4 Fundamental Computer Vision Tasks Using Transformers An overview of 4 fundamental computer vision tasks – image classification, image segmentation, image captioning and visual question answering, with transformer models. Compare ViT, DETR, BLIP, and ViLT performance interactively by providing a practical Streamlit app implementation guide. The post An Interactive Guide to…
-
How to Select the 5 Most Relevant Documents for AI Search
How to Select the 5 Most Relevant Documents for AI Search Improve the document retrieval step of your RAG pipeline The post How to Select the 5 Most Relevant Documents for AI Search appeared first on Towards Data Science. Eivind Kjosbakken Go to original source
-
Towards universal property prediction in Cartesian space: TACE is all you need
Towards universal property prediction in Cartesian space: TACE is all you need arXiv:2509.14961v1 Announce Type: new Abstract: Machine learning has revolutionized atomistic simulations and materials science, yet current approaches often depend on spherical-harmonic representations. Here we introduce the Tensor Atomic Cluster Expansion and Tensor Moment Potential, the first unified framework formulated entirely in Cartesian space…
-
Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression
Benefits of Online Tilted Empirical Risk Minimization: A Case Study of Outlier Detection and Robust Regression arXiv:2509.15141v1 Announce Type: new Abstract: Empirical Risk Minimization (ERM) is a foundational framework for supervised learning but primarily optimizes average-case performance, often neglecting fairness and robustness considerations. Tilted Empirical Risk Minimization (TERM) extends ERM by introducing an exponential tilt…
-
Learning Rate Should Scale Inversely with High-Order Data Moments in High-Dimensional Online Independent Component Analysis
Learning Rate Should Scale Inversely with High-Order Data Moments in High-Dimensional Online Independent Component Analysis arXiv:2509.15127v1 Announce Type: new Abstract: We investigate the impact of high-order moments on the learning dynamics of an online Independent Component Analysis (ICA) algorithm under a high-dimensional data model composed of a weighted sum of two non-Gaussian random variables. This…
-
Next-Depth Lookahead Tree
Next-Depth Lookahead Tree arXiv:2509.15143v1 Announce Type: new Abstract: This paper proposes the Next-Depth Lookahead Tree (NDLT), a single-tree model designed to improve performance by evaluating node splits not only at the node being optimized but also by evaluating the quality of the next depth level. Jaeho Lee, Kangjin Kim, Gyeong Taek Lee Go to original…
-
Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models
Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models arXiv:2509.15152v1 Announce Type: new Abstract: We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the…
-
How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify
How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify All ideas can be turned into action in a matter of time now. The post How I Built and Deployed an App in 2 days with Lovable, Supabase, and Netlify appeared first on Towards Data Science. Soner Yıldırım Go to…
-
From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples
From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples Learn the basics of JavaScript through tiny n8n Code node snippets for sales data analytics The post From Python to JavaScript: A Playbook for Data Analytics in n8n with Code Node Examples appeared first on Towards Data Science. Samir Saci…
-
Rapid Prototyping of Chatbots with Streamlit and Chainlit
Rapid Prototyping of Chatbots with Streamlit and Chainlit End-to-end demos, comparison of pros and cons, and practical recommendations The post Rapid Prototyping of Chatbots with Streamlit and Chainlit appeared first on Towards Data Science. Chinmay Kakatkar Go to original source
-
From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory
From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory Achieve natural multi-turn conversations without sacrificing content control. The post From Amnesia to Awareness: Giving Retrieval-Only Chatbots Memory appeared first on Towards Data Science. Nicole Ren Go to original source
-
On the Rate of Gaussian Approximation for Linear Regression Problems
On the Rate of Gaussian Approximation for Linear Regression Problems arXiv:2509.14039v1 Announce Type: new Abstract: In this paper, we consider the problem of Gaussian approximation for the online linear regression task. We derive the corresponding rates for the setting of a constant learning rate and study the explicit dependence of the convergence rate upon the…
-
Field of View Enhanced Signal Dependent Binauralization with Mixture of Experts Framework for Continuous Source Motion
Field of View Enhanced Signal Dependent Binauralization with Mixture of Experts Framework for Continuous Source Motion arXiv:2509.13548v1 Announce Type: cross Abstract: We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress…
-
Imputation-Powered Inference
Imputation-Powered Inference arXiv:2509.13778v1 Announce Type: cross Abstract: Modern multi-modal and multi-site data frequently suffer from blockwise missingness, where subsets of features are missing for groups of individuals, creating complex patterns that challenge standard inference methods. Existing approaches have critical limitations: complete-case analysis discards informative data and is potentially biased; doubly robust estimators for non-monotone missingness-where…
-
Towards a Physics Foundation Model
Towards a Physics Foundation Model arXiv:2509.13805v1 Announce Type: cross Abstract: Foundation models have revolutionized natural language processing through a “train once, deploy anywhere” paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative — democratizing access to high-fidelity simulations, accelerating scientific discovery,…
-
Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus
Holdout cross-validation for large non-Gaussian covariance matrix estimation using Weingarten calculus arXiv:2509.13923v1 Announce Type: cross Abstract: Cross-validation is one of the most widely used methods for model selection and evaluation; its efficiency for large covariance matrix estimation appears robust in practice, but little is known about the theoretical behavior of its error. In this paper,…
-
Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour
Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour Applying causal inference to measure the effect of product unavailability on retail sales at Carrefour The post Analysis of Sales Shift in Retail with Causal Impact: A Case Study at Carrefour appeared first on Towards Data Science. Thanh Liêm NGUYEN Go…
-
RAG Explained: Understanding Embeddings, Similarity, and Retrieval
RAG Explained: Understanding Embeddings, Similarity, and Retrieval Let’s take a closer look at how the retrieval mechanism works The post RAG Explained: Understanding Embeddings, Similarity, and Retrieval appeared first on Towards Data Science. Maria Mouschoutzi Go to original source
-
Evaluating Your RAG Solution
Evaluating Your RAG Solution A guide to building and evaluating RAG solutions by leveraging LLM-as-a-Judge capabilities. The post Evaluating Your RAG Solution appeared first on Towards Data Science. Alex Davis Go to original source
-
ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models
ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models Understand how ROC curves and AUC help you go beyond accuracy with visuals and examples. The post ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models appeared first on Towards Data Science. Nikhil Dasari Go to original source
-
PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models
PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models arXiv:2509.12666v1 Announce Type: new Abstract: Physics-Informed Neural Networks (PINNs) leverage machine learning with differential equations to solve direct and inverse problems, ensuring predictions follow physical laws. Physiologically based pharmacokinetic (PBPK) modeling advances beyond classical compartmental approaches by using a mechanistic, physiology focused framework.…
-
SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty
SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty arXiv:2509.13189v1 Announce Type: new Abstract: We present a direct inverse modeling method named SURGIN, a SURrogate-guided Generative INversion framework tailed for subsurface multiphase flow data assimilation. Unlike existing inversion methods that require adaptation for each new observational configuration, SURGIN features a zero-shot conditional generation…
-
Jackknife Variance Estimation for H’ajek-Dominated Generalized U-Statistics
Jackknife Variance Estimation for H’ajek-Dominated Generalized U-Statistics arXiv:2509.12356v1 Announce Type: cross Abstract: We prove ratio-consistency of the jackknife variance estimator, and certain variants, for a broad class of generalized U-statistics whose variance is asymptotically dominated by their H’ajek projection, with the classical fixed-order case recovered as a special instance. This H’ajek projection dominance condition unifies…
-
Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization
Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization arXiv:2509.12387v1 Announce Type: cross Abstract: Modern deep learning models excel at pattern recognition but remain fundamentally limited by their reliance on spurious correlations, leading to poor generalization and a demand for massive datasets. We argue that a key ingredient for human-like intelligence-robust, sample-efficient learning-stems from…
-
Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC)
Reduced Order Modeling of Energetic Materials Using Physics-Aware Recurrent Convolutional Neural Networks in a Latent Space (LatentPARC) arXiv:2509.12401v1 Announce Type: cross Abstract: Physics-aware deep learning (PADL) has gained popularity for use in complex spatiotemporal dynamics (field evolution) simulations, such as those that arise frequently in computational modeling of energetic materials (EM). Here, we show that…
-
Building a Unified Intent Recognition Engine
Building a Unified Intent Recognition Engine How modular design can simplify and scale intent classification in enterprise AI systems The post Building a Unified Intent Recognition Engine appeared first on Towards Data Science. Shruti Tiwari Go to original source
-
Using Python to Build a Calculator
Using Python to Build a Calculator A beginner-friendly Python project to understand conditional statements, loops and recursive functions The post Using Python to Build a Calculator appeared first on Towards Data Science. Mahnoor Javed Go to original source
-
My Experiments with NotebookLM for Teaching
My Experiments with NotebookLM for Teaching Exploring NotebookLM as a teaching companion The post My Experiments with NotebookLM for Teaching appeared first on Towards Data Science. Parul Pandey Go to original source
-
Why Your A/B Test Winner Might Just Be Random Noise
Why Your A/B Test Winner Might Just Be Random Noise What a coach’s warm-up trial can teach us about running better experiments The post Why Your A/B Test Winner Might Just Be Random Noise appeared first on Towards Data Science. Pol Marin Go to original source
-
Variable Selection Using Relative Importance Rankings
Variable Selection Using Relative Importance Rankings arXiv:2509.10853v1 Announce Type: new Abstract: Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable ranking and filter-based selection before model creation. Specifically, we anticipate…
-
Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning
Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning arXiv:2509.11070v1 Announce Type: new Abstract: We develop a stochastic approximation framework for learning nonlinear operators between infinite-dimensional spaces utilizing general Mercer operator-valued kernels. Our framework encompasses two key classes: (i) compact kernels, which admit discrete spectral decompositions, and (ii) diagonal kernels of the form $K(x,x’)=k(x,x’)T$, where $k$…
-
Maximum diversity, weighting and invariants of time series
Maximum diversity, weighting and invariants of time series arXiv:2509.11146v1 Announce Type: new Abstract: Magnitude, obtained as a special case of Euler characteristic of enriched category, represents a sense of the size of metric spaces and is related to classical notions such as cardinality, dimension, and volume. While the studies have explained the meaning of magnitude…
-
Predictable Compression Failures: Why Language Models Actually Hallucinate
Predictable Compression Failures: Why Language Models Actually Hallucinate arXiv:2509.11208v1 Announce Type: new Abstract: Large language models perform near-Bayesian inference yet violate permutation invariance on exchangeable data. We resolve this by showing transformers minimize expected conditional description length (cross-entropy) over orderings, $mathbb{E}_pi[ell(Y mid Gamma_pi(X))]$, which admits a Kolmogorov-complexity interpretation up to additive constants, rather than the…
-
Contrastive Network Representation Learning
Contrastive Network Representation Learning arXiv:2509.11316v1 Announce Type: new Abstract: Network representation learning seeks to embed networks into a low-dimensional space while preserving the structural and semantic properties, thereby facilitating downstream tasks such as classification, trait prediction, edge identification, and community detection. Motivated by challenges in brain connectivity data analysis that is characterized by subject-specific, high-dimensional,…
-
A Visual Guide to Tuning Gradient Boosted Trees
A Visual Guide to Tuning Gradient Boosted Trees Introduction My previous posts looked at the bog-standard decision tree and the wonder of a random forest. Now, to complete the triplet, I’ll visually explore gradient boosted trees! There are a bunch of gradient boosted tree libraries, including XGBoost, CatBoost, and LightGBM. However, for this I’m going…
-
Implementing the Coffee Machine Project in Python Using Object Oriented Programming
Implementing the Coffee Machine Project in Python Using Object Oriented Programming Understanding classes, objects, attributes, and methods The post Implementing the Coffee Machine Project in Python Using Object Oriented Programming appeared first on Towards Data Science. Mahnoor Javed Go to original source
-
You Only Need 3 Things to Turn AI Experiments into AI Advantage
You Only Need 3 Things to Turn AI Experiments into AI Advantage Trapped in a purgatory of POCs enterprises need to focus and build just 3 pillars to realize value from AI The post You Only Need 3 Things to Turn AI Experiments into AI Advantage appeared first on Towards Data Science. Shreshth Sharma Go…
-
Learn How to Use Transformers with HuggingFace and SpaCy
Learn How to Use Transformers with HuggingFace and SpaCy Mastering NLP with spaCy: Part 4 The post Learn How to Use Transformers with HuggingFace and SpaCy appeared first on Towards Data Science. Marcello Politi Go to original source
-
How to Become a Machine Learning Engineer (Step-by-Step)
How to Become a Machine Learning Engineer (Step-by-Step) Your one-stop guide to becoming a machine learning engineer The post How to Become a Machine Learning Engineer (Step-by-Step) appeared first on Towards Data Science. Egor Howell Go to original source
-
An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards
An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards arXiv:2509.09855v1 Announce Type: new Abstract: Credit risk modeling relies extensively on Weight of Evidence (WoE) and Information Value (IV) for feature engineering, and Population Stability Index (PSI) for drift monitoring, yet their theoretical foundations remain disconnected. We…
-
Why does your graph neural network fail on some graphs? Insights from exact generalisation error
Why does your graph neural network fail on some graphs? Insights from exact generalisation error arXiv:2509.10337v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) are widely used in learning on graph-structured data, yet a principled understanding of why they succeed or fail remains elusive. While prior works have examined architectural limitations such as over-smoothing and…
-
Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance
Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance arXiv:2509.10166v1 Announce Type: new Abstract: In this paper, we consider the problem of computing the integral of a function on the unit sphere, in any dimension, using Monte Carlo methods. Although the methods we present are general, our guiding thread is the sliced Wasserstein…
-
Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise
Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise arXiv:2509.10385v1 Announce Type: new Abstract: In this work, we explore differentially private synthetic data generation in a decentralized-data setting by building on the recently proposed Differentially Private Class-Centric Data Aggregation (DP-CDA). DP-CDA synthesizes data in a centralized setting by mixing multiple randomly-selected samples from…
-
Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation
Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation arXiv:2509.09802v1 Announce Type: cross Abstract: We propose and study Sparse Polyak, a variant of Polyak’s adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak…
-
Weekly Entering & Transitioning – Thread 15 Sep, 2025 – 22 Sep, 2025
Weekly Entering & Transitioning – Thread 15 Sep, 2025 – 22 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice?
Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice? I’ve been experimenting with generating synthetic datasets for financial indicators (GDP, inflation, unemployment, etc.) and found that CTGAN offered stronger privacy protection in simple linkage tests, but its overall analytical utility was much weaker. In contrast, Gaussian Copula provided reasonably strong privacy and…
-
Texts for creating better visualizations/presentations?
Texts for creating better visualizations/presentations? I started working for an HR team and have been tasked with creating visualizations, both in PowerPoint (I’ve been using Seaborn and Matplotlib for visualizations) and PowerBI Dashboards. I’ve been having a lot of fun creating visualizations, but I’m looking for a few texts or maybe courses/videos about design. Anything…
-
Database tools and method for tree structured data?
Database tools and method for tree structured data? I have a database structure which I believe is very common, and very general, so I’m wondering how this is tackled. The database structured like: -> Project (Name of project) -> Category (simple word, ~20 categories) -> Study Study is a directory containing: – README with date…
-
The Rise of Semantic Entity Resolution
The Rise of Semantic Entity Resolution Semantic entity resolution uses language models to bring an increased level of automation to schema alignment, blocking (grouping records into smaller, efficient blocks for all-pairs comparison at quadratic, n² complexity), matching and even merging duplicate nodes and edges. In the past, entity resolution systems relied on statistical tricks such…
-
No Peeking Ahead: Time-Aware Graph Fraud Detection
No Peeking Ahead: Time-Aware Graph Fraud Detection How to implement leak-free graph fraud detection The post No Peeking Ahead: Time-Aware Graph Fraud Detection appeared first on Towards Data Science. Erika G. Gonçalves Go to original source
-
Building Research Agents for Tech Insights
Building Research Agents for Tech Insights Using a controlled workflow, unique data & prompt chaining The post Building Research Agents for Tech Insights appeared first on Towards Data Science. Ida Silfverskiöld Go to original source
-
If we use AI to do our work – what is our job, then?
If we use AI to do our work – what is our job, then? Images. Text. Audio. There’s no modality that is not handled by AI. And AI systems reach even further, planning advertisement and marketing campaigns, automating social media postings, … Most of this was unthinkable a mere ten years ago. But then, the…
-
Docling: The Document Alchemist
Docling: The Document Alchemist Why do we still wrestle with documents in 2025? Spend some time in any data-driven organisation, and you’ll encounter a host of PDFs, Word files, PowerPoints, half-scanned images, handwritten notes, and the occasional surprise CSV lurking in a SharePoint folder. Business and data analysts waste hours converting, splitting, and cajoling those formats…
-
A Focused Approach to Learning SQL
A Focused Approach to Learning SQL Data is everywhere, but how do you draw insights from it? Often, structured data is stored in relational databases, meaning collections of related tables of data. For instance, a company might store customer purchases in one table, customer demographics in another, and suppliers in a third table. These tables…
-
Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation
Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation arXiv:2509.09238v1 Announce Type: new Abstract: Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject…
-
Scalable extensions to given-data Sobol’ index estimators
Scalable extensions to given-data Sobol’ index estimators arXiv:2509.09078v1 Announce Type: new Abstract: Given-data methods for variance-based sensitivity analysis have significantly advanced the feasibility of Sobol’ index computation for computationally expensive models and models with many inputs. However, the limitations of existing methods still preclude their application to models with an extremely large number of inputs.…
-
Low-degree lower bounds via almost orthonormal bases
Low-degree lower bounds via almost orthonormal bases arXiv:2509.09353v1 Announce Type: new Abstract: Low-degree polynomials have emerged as a powerful paradigm for providing evidence of statistical-computational gaps across a variety of high-dimensional statistical models [Wein25]. For detection problems — where the goal is to test a planted distribution $mathbb{P}’$ against a null distribution $mathbb{P}$ with independent…
-
Uncertainty Estimation using Variance-Gated Distributions
Uncertainty Estimation using Variance-Gated Distributions arXiv:2509.08846v1 Announce Type: cross Abstract: Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications. A common approach is to use the predictive distribution from Bayesian or approximation models and decompose the corresponding predictive uncertainty into epistemic (model-related) and aleatoric (data-related) components. However, additive decomposition…
-
Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications
Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications arXiv:2509.08911v1 Announce Type: cross Abstract: The Matrix Multiplicative Weight Update (MMWU) is a seminal online learning algorithm with numerous applications. Applied to the matrix version of the Learning from Expert Advice (LEA) problem on the $d$-dimensional spectraplex, it is well known that MMWU achieves the minimax-optimal…
-
Why Context Is the New Currency in AI: From RAG to Context Engineering
Why Context Is the New Currency in AI: From RAG to Context Engineering Context, not computation, is the real currency of intelligent systems The post Why Context Is the New Currency in AI: From RAG to Context Engineering appeared first on Towards Data Science. Sudheer Singamsetty Go to original source
-
How to Analyze and Optimize Your LLMs in 3 Steps
How to Analyze and Optimize Your LLMs in 3 Steps Learn to enhance your LLMs with my 3 step process, inspecting, improving and iterating on your LLMs The post How to Analyze and Optimize Your LLMs in 3 Steps appeared first on Towards Data Science. Eivind Kjosbakken Go to original source
-
The Crucial Role of Color Theory in Data Analysis and Visualization
The Crucial Role of Color Theory in Data Analysis and Visualization How research-backed color principles improved clarity and storytelling in my dashboards The post The Crucial Role of Color Theory in Data Analysis and Visualization appeared first on Towards Data Science. Benjamin Nweke Go to original source
-
kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions arXiv:2509.08366v1 Announce Type: new Abstract: We study a missing-value imputation method, termed kNNSampler, that imputes a given unit’s missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample…
-
Gaussian Process Regression — Neural Network Hybrid with Optimized Redundant Coordinates
Gaussian Process Regression — Neural Network Hybrid with Optimized Redundant Coordinates arXiv:2509.08457v1 Announce Type: new Abstract: Recently, a Gaussian Process Regression – neural network (GPRNN) hybrid machine learning method was proposed, which is based on additive-kernel GPR in redundant coordinates constructed by rules [J. Phys. Chem. A 127 (2023) 7823]. The method combined the expressive…
-
PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research
PEHRT: A Common Pipeline for Harmonizing Electronic Health Record data for Translational Research arXiv:2509.08553v1 Announce Type: new Abstract: Integrative analysis of multi-institutional Electronic Health Record (EHR) data enhances the reliability and generalizability of translational research by leveraging larger, more diverse patient cohorts and incorporating multiple data modalities. However, harmonizing EHR data across institutions poses major…
-
Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation
Machine Learning with Multitype Protected Attributes: Intersectional Fairness through Regularisation arXiv:2509.08163v1 Announce Type: cross Abstract: Ensuring equitable treatment (fairness) across protected attributes (such as gender or ethnicity) is a critical issue in machine learning. Most existing literature focuses on binary classification, but achieving fairness in regression tasks-such as insurance pricing or hiring score assessments-is equally…
-
A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo
A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo arXiv:2509.08619v1 Announce Type: new Abstract: The unadjusted Langevin algorithm is widely used for sampling from complex high-dimensional distributions. It is well known to be biased, with the bias typically scaling linearly with the dimension when measured in squared Wasserstein distance. However,…
-
Is Your Training Data Representative? A Guide to Checking with PSI in Python
Is Your Training Data Representative? A Guide to Checking with PSI in Python Comparing Variable Distributions Between Two Datasets Using Population Stability Index (PSI) and Cramér’s V. The post Is Your Training Data Representative? A Guide to Checking with PSI in Python appeared first on Towards Data Science. JUNIOR JUMBONG Go to original source
-
Fighting Back Against Attacks in Federated Learning
Fighting Back Against Attacks in Federated Learning Lessons from a multi-node simulator The post Fighting Back Against Attacks in Federated Learning appeared first on Towards Data Science. Salman Toor Go to original source
-
When A Difference Actually Makes A Difference
When A Difference Actually Makes A Difference Bite-Sized Analytics for Business Decision-Makers (1) The post When A Difference Actually Makes A Difference appeared first on Towards Data Science. Mena Wang Go to original source
-
Why Task-Based Evaluations Matter
Why Task-Based Evaluations Matter This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype to Production: Evaluation Strategies for Agentic Applications. Task-based evaluations, which measure an AI system’s performance in use-case-specific, real-world settings, are underadopted and understudied. There is still an outsized focus in AI literature on foundation model benchmarks.…
-
How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n
How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n Email → n8n → LangGraph → FastAPI: turning budget requests into optimised CAPEX portfolios that maximise ROI for decision-makers. The post How to Build an AI Budget-Planning Optimizer for Your 2026 CAPEX Review: LangGraph, FastAPI, and n8n appeared first…
-
NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice
NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice arXiv:2509.07123v1 Announce Type: new Abstract: Nested logit (NL) has been commonly used for discrete choice analysis, including a wide range of applications such as travel mode choice, automobile ownership, or location decisions. However, the classical NL models are restricted by…
-
ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression
ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression arXiv:2509.07108v1 Announce Type: new Abstract: Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performance. However, most of these models do not provide interpretable insights into the association between exposures and…
-
Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space
Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space arXiv:2509.07289v1 Announce Type: new Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives–such as invariance to augmentations, variance preservation, and feature decorrelation–without requiring labels. However, most existing methods operate in Euclidean space, limiting their ability to capture…
-
Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression
Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression arXiv:2509.07300v1 Announce Type: new Abstract: Recent advances in neuroimaging analysis have enabled accurate decoding of mental state from brain activation patterns during functional magnetic resonance imaging scans. A commonly applied tool for this purpose is principal components regression regularized with the least absolute shrinkage and…
-
Asynchronous Gossip Algorithms for Rank-Based Statistical Methods
Asynchronous Gossip Algorithms for Rank-Based Statistical Methods arXiv:2509.07543v1 Announce Type: new Abstract: As decentralized AI and edge intelligence become increasingly prevalent, ensuring robustness and trustworthiness in such distributed settings has become a critical issue-especially in the presence of corrupted or adversarial data. Traditional decentralized algorithms are vulnerable to data contamination as they typically rely on…
-
LangChain for EDA: Build a CSV Sanity-Check Agent in Python
LangChain for EDA: Build a CSV Sanity-Check Agent in Python A practical LangChain tutorial for data scientists to inspect CSVs The post LangChain for EDA: Build a CSV Sanity-Check Agent in Python appeared first on Towards Data Science. Sarah Schürch Go to original source
-
How to Build Effective AI Agents to Process Millions of Requests
How to Build Effective AI Agents to Process Millions of Requests Learn how to build production ready systems using AI agents The post How to Build Effective AI Agents to Process Millions of Requests appeared first on Towards Data Science. Eivind Kjosbakken Go to original source
-
The Hungarian Algorithm and Its Applications in Computer Vision
The Hungarian Algorithm and Its Applications in Computer Vision Introduction Multi-object tracking (MOT) is a task in which an algorithm must detect and track multiple objects in a video. Most known algorithms are based on using simple detectors (e.g. YOLO) designed for processing individual images. The overall method involves separately using a detector on consecutive video…
-
LangGraph 201: Adding Human Oversight to Your Deep Research Agent
LangGraph 201: Adding Human Oversight to Your Deep Research Agent Losing control of your AI agent in the middle of the workflow is a common pain point. If you have built your own agentic applications, you’ve most likely already seen this happen. While LLMs nowadays are incredibly capable, they’re still not quite there yet to…
-
Exploring Merit Order and Marginal Abatement Cost Curve in Python
Exploring Merit Order and Marginal Abatement Cost Curve in Python To achieve the global temperature limit goals of 1.5°C by the end of the century set by the Paris Agreement, different institutions have come up with different scenarios. There is a consensus among the mitigation scenarios that the share of low-carbon technologies such as renewable energy needs…
-
Cryo-EM as a Stochastic Inverse Problem
Cryo-EM as a Stochastic Inverse Problem arXiv:2509.05541v1 Announce Type: new Abstract: Cryo-electron microscopy (Cryo-EM) enables high-resolution imaging of biomolecules, but structural heterogeneity remains a major challenge in 3D reconstruction. Traditional methods assume a discrete set of conformations, limiting their ability to recover continuous structural variability. In this work, we formulate cryo-EM reconstruction as a stochastic…