Category: aimldsaimlds

Graph neural networks for residential location choice: connection to classical logit models

Graph neural networks for residential location choice: connection to classical logit models arXiv:2507.21334v1 Announce Type: new Abstract: Researchers have adopted deep learning for classical discrete choice analysis as it can capture complex feature relationships and achieve higher predictive performance. However, the existing deep learning approaches cannot explicitly capture the relationship among choice alternatives, which has…

July 30, 2025
From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions

From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions arXiv:2507.21429v1 Announce Type: new Abstract: The convergence of gradient descent (GD) on the non-convex loss landscapes of deep neural networks (DNNs) presents a fundamental theoretical challenge. While recent work has established that GD converges to a stationary point at a sublinear rate…

July 30, 2025
From Global to Local: A Scalable Benchmark for Local Posterior Sampling

From Global to Local: A Scalable Benchmark for Local Posterior Sampling arXiv:2507.21449v1 Announce Type: new Abstract: Degeneracy is an inherent feature of the loss landscape of neural networks, but it is not well understood how stochastic gradient MCMC (SGMCMC) algorithms interact with this degeneracy. In particular, current global convergence guarantees for common SGMCMC algorithms rely…

July 30, 2025
Measuring Sample Quality with Copula Discrepancies

Measuring Sample Quality with Copula Discrepancies arXiv:2507.21434v1 Announce Type: new Abstract: The scalable Markov chain Monte Carlo (MCMC) algorithms that underpin modern Bayesian machine learning, such as Stochastic Gradient Langevin Dynamics (SGLD), sacrifice asymptotic exactness for computational speed, creating a critical diagnostic gap: traditional sample quality measures fail catastrophically when applied to biased samplers. While…

July 30, 2025
Stochastic forest transition model dynamics and parameter estimation via deep learning

Stochastic forest transition model dynamics and parameter estimation via deep learning arXiv:2507.21486v1 Announce Type: new Abstract: Forest transitions, characterized by dynamic shifts between forest, agricultural, and abandoned lands, are complex phenomena. This study developed a stochastic differential equation model to capture the intricate dynamics of these transitions. We established the existence of global positive solutions…

July 30, 2025
Skills vs. AI Skills

Skills vs. AI Skills Which skills are timeless, and where is the gap? The post Skills vs. AI Skills appeared first on Towards Data Science. Marina Tosic Go to original source

July 30, 2025
How Your Prompts Lead AI Astray

How Your Prompts Lead AI Astray Practical tips to recognise and avoid prompt bias. The post How Your Prompts Lead AI Astray appeared first on Towards Data Science. Daphne de Klerk Go to original source

July 30, 2025
How to Evaluate Graph Retrieval in MCP Agentic Systems

How to Evaluate Graph Retrieval in MCP Agentic Systems A framework for measuring retrieval quality in Model Context Protocol agents. The post How to Evaluate Graph Retrieval in MCP Agentic Systems appeared first on Towards Data Science. Tomaz Bratanic Go to original source

July 30, 2025
Physics-Informed Neural Networks for Inverse PDE Problems

Physics-Informed Neural Networks for Inverse PDE Problems Solving the Heat Equation using DeepXDE. The post Physics-Informed Neural Networks for Inverse PDE Problems appeared first on Towards Data Science. Marco Hening Tallarico Go to original source

July 30, 2025
Mastering NLP with spaCY — Part 1

Mastering NLP with spaCY — Part 1 Learn about tokenization, lemmatization and the core operations. The post Mastering NLP with spaCY — Part 1 appeared first on Towards Data Science. Marcello Politi Go to original source

July 30, 2025
Bayesian symbolic regression: Automated equation discovery from a physicists’ perspective

Bayesian symbolic regression: Automated equation discovery from a physicists’ perspective arXiv:2507.19540v1 Announce Type: new Abstract: Symbolic regression automates the process of learning closed-form mathematical models from data. Standard approaches to symbolic regression, as well as newer deep learning approaches, rely on heuristic model selection criteria, heuristic regularization, and heuristic exploration of model space. Here, we…

July 29, 2025
Adaptive Bayesian Data-Driven Design of Reliable Solder Joints for Micro-electronic Devices

Adaptive Bayesian Data-Driven Design of Reliable Solder Joints for Micro-electronic Devices arXiv:2507.19663v1 Announce Type: new Abstract: Solder joint reliability related to failures due to thermomechanical loading is a critically important yet physically complex engineering problem. As a result, simulated behavior is oftentimes computationally expensive. In an increasingly data-driven world, the usage of efficient data-driven design…

July 29, 2025
Sparse-mode Dynamic Mode Decomposition for Disambiguating Local and Global Structures

Sparse-mode Dynamic Mode Decomposition for Disambiguating Local and Global Structures arXiv:2507.19787v1 Announce Type: new Abstract: The dynamic mode decomposition (DMD) is a data-driven approach that extracts the dominant features from spatiotemporal data. In this work, we introduce sparse-mode DMD, a new variant of the optimized DMD framework that specifically leverages sparsity-promoting regularization in order to…

July 29, 2025
Bag of Coins: A Statistical Probe into Neural Confidence Structures

Bag of Coins: A Statistical Probe into Neural Confidence Structures arXiv:2507.19774v1 Announce Type: new Abstract: Modern neural networks, despite their high accuracy, often produce poorly calibrated confidence scores, limiting their reliability in high-stakes applications. Existing calibration methods typically post-process model outputs without interrogating the internal consistency of the predictions themselves. In this work, we introduce…

July 29, 2025
Predicting Parkinson’s Disease Progression Using Statistical and Neural Mixed Effects Models: A Comparative Study on Longitudinal Biomarkers

Predicting Parkinson’s Disease Progression Using Statistical and Neural Mixed Effects Models: A Comparative Study on Longitudinal Biomarkers arXiv:2507.20058v1 Announce Type: new Abstract: Predicting Parkinson’s Disease (PD) progression is crucial, and voice biomarkers offer a non-invasive method for tracking symptom severity (UPDRS scores) through telemonitoring. Analyzing this longitudinal data is challenging due to within-subject correlations and…

July 29, 2025
The Stanford Framework That Turns AI into Your PM Superpower

The Stanford Framework That Turns AI into Your PM Superpower A human-centric guide to AI automation for product managers. The post The Stanford Framework That Turns AI into Your PM Superpower appeared first on Towards Data Science. Rahul Vir Go to original source

July 29, 2025
Talk to my Agent

Talk to my Agent The exciting new world of designing conversation driven APIs for LLMs. The post Talk to my Agent appeared first on Towards Data Science. Roni Dover Go to original source

July 29, 2025
End-to-End AWS RDS Setup with Bastion Host Using Terraform

End-to-End AWS RDS Setup with Bastion Host Using Terraform Learn how to automate secure AWS infrastructure using Terraform — including VPC, public/private subnets, a MySQL RDS database, and a Bastion host for secure access. The post End-to-End AWS RDS Setup with Bastion Host Using Terraform appeared first on Towards Data Science. Yagmur Gulec Go to…

July 29, 2025
Central limit theorems for the eigenvalues of graph Laplacians on data clouds

Central limit theorems for the eigenvalues of graph Laplacians on data clouds arXiv:2507.18803v1 Announce Type: new Abstract: Given i.i.d. samples $X_n ={ x_1, dots, x_n }$ from a distribution supported on a low dimensional manifold ${M}$ embedded in Eucliden space, we consider the graph Laplacian operator $Delta_n$ associated to an $varepsilon$-proximity graph over $X_n$ and…

July 28, 2025
Perfect Clustering in Very Sparse Diverse Multiplex Networks

Perfect Clustering in Very Sparse Diverse Multiplex Networks arXiv:2507.19423v1 Announce Type: new Abstract: The paper studies the DIverse MultiPLEx Signed Generalized Random Dot Product Graph (DIMPLE-SGRDPG) network model (Pensky (2024)), where all layers of the network have the same collection of nodes. In addition, all layers can be partitioned into groups such that the layers…

July 28, 2025
Probably Approximately Correct Causal Discovery

Probably Approximately Correct Causal Discovery arXiv:2507.18903v1 Announce Type: new Abstract: The discovery of causal relationships is a foundational problem in artificial intelligence, statistics, epidemiology, economics, and beyond. While elegant theories exist for accurate causal discovery given infinite data, real-world applications are inherently resource-constrained. Effective methods for inferring causal relationships from observational data must perform well…

July 28, 2025
Deep Neural Network Driven Simulation Based Inference Method for Pole Position Estimation under Model Misspecification

Deep Neural Network Driven Simulation Based Inference Method for Pole Position Estimation under Model Misspecification arXiv:2507.18824v1 Announce Type: cross Abstract: Simulation Based Inference (SBI) is shown to yield more accurate resonance parameter estimates than traditional chi-squared minimization in certain cases of model misspecification, demonstrated through a case study of pi-pi scattering and the rho(770) resonance.…

July 28, 2025
Flow Stochastic Segmentation Networks

Flow Stochastic Segmentation Networks arXiv:2507.18838v1 Announce Type: cross Abstract: We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank…

July 28, 2025
Weekly Entering & Transitioning – Thread 28 Jul, 2025 – 04 Aug, 2025

Weekly Entering & Transitioning – Thread 28 Jul, 2025 – 04 Aug, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

July 28, 2025
New Grad Data Scientist feeling overwhelmed and disillusioned at first job

New Grad Data Scientist feeling overwhelmed and disillusioned at first job Hi all, I recently graduated with a degree in Data Science and just started my first job as a data scientist. The company is very focused on staying ahead/keeping up with the AI hype train and wants my team (which has no other data…

July 28, 2025
why OneHotEncoder give better results than get.dummies/reindex?

why OneHotEncoder give better results than get.dummies/reindex? I can’t figure out why I get a better score with OneHotEncoder : preprocessor = ColumnTransformer( transformers=[ (‘cat’, categorical_transformer, categorical_cols) ], remainder=’passthrough’ # <– this keeps the numerical columns ) model_GBR = GradientBoostingRegressor(n_estimators=1100, loss=’squared_error’, subsample = 0.35, learning_rate = 0.05,random_state=1) GBR_Pipeline = Pipeline(steps=[(‘preprocessor’, preprocessor),(‘model’, model_GBR)]) than get.dummies/reindex: X_test…

July 28, 2025
Can LLMs Reason – I don’t know, depends on the definition of reasoning. Denny Zhou – Founder/Lead of Google Deepmind LLM Reasoning Team

Can LLMs Reason – I don’t know, depends on the definition of reasoning. Denny Zhou – Founder/Lead of Google Deepmind LLM Reasoning Team AI influencers: LLMs can think given this godly prompt bene gesserit oracle of the world blahblah, hence xxx/yyy/zzz is dead. See more below. Meanwhile, literally the founder/lead of the reasoning team: https://preview.redd.it/z9uwnummqeff1.png?width=652&format=png&auto=webp&s=c84727d328d059504adf64768b8badac45d20611…

July 28, 2025
Anomoly detection with only categorical variables

Anomoly detection with only categorical variables Hello everyone, I have an anomoly detection project but all of my data is categorical. I suppose I could try and ask them to change it prediction but does anyone have any advice. The goal is to there are groups within the data and and do an analysis to…

July 28, 2025
Declarative and Imperative Prompt Engineering for Generative AI

Declarative and Imperative Prompt Engineering for Generative AI Conceptual overview and practical considerations The post Declarative and Imperative Prompt Engineering for Generative AI appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

July 26, 2025
What Is a Query Folding in Power BI and Why should You Care?

What Is a Query Folding in Power BI and Why should You Care? “Will that break a query folding?” “Does your query fold?”… Maybe someone asked you those questions, but you were like: “Query…Whaaaat?! In this article, we demistify the query folding and its importance for efficient data refresh in Power BI The post What…

July 26, 2025
How I Fine-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned

How I Fine-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned A hands-on journey exploring fine-tuning techniques that unlock the power of small vision models. The post How I Fine-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned appeared first on Towards Data Science. Julio Sanchez Go…

July 26, 2025
Sliding Window Informative Canonical Correlation Analysis

Sliding Window Informative Canonical Correlation Analysis arXiv:2507.17921v1 Announce Type: new Abstract: Canonical correlation analysis (CCA) is a technique for finding correlated sets of features between two datasets. In this paper, we propose a novel extension of CCA to the online, streaming data setting: Sliding Window Informative Canonical Correlation Analysis (SWICCA). Our method uses a streaming…

July 25, 2025
Learning graphons from data: Random walks, transfer operators, and spectral clustering

Learning graphons from data: Random walks, transfer operators, and spectral clustering arXiv:2507.18147v1 Announce Type: new Abstract: Many signals evolve in time as a stochastic process, randomly switching between states over discretely sampled time points. Here we make an explicit link between the underlying stochastic process of a signal that can take on a bounded continuum…

July 25, 2025
A Two-armed Bandit Framework for A/B Testing

A Two-armed Bandit Framework for A/B Testing arXiv:2507.18118v1 Announce Type: new Abstract: A/B testing is widely used in modern technology companies for policy evaluation and product deployment, with the goal of comparing the outcomes under a newly-developed policy against a standard control. Various causal inference and reinforcement learning methods developed in the literature are applicable…

July 25, 2025
On Reconstructing Training Data From Bayesian Posteriors and Trained Models

On Reconstructing Training Data From Bayesian Posteriors and Trained Models arXiv:2507.18372v1 Announce Type: new Abstract: Publicly releasing the specification of a model with its trained parameters means an adversary can attempt to reconstruct information about the training data via training data reconstruction attacks, a major vulnerability of modern machine learning methods. This paper makes three…

July 25, 2025
DriftMoE: A Mixture of Experts Approach to Handle Concept Drifts

DriftMoE: A Mixture of Experts Approach to Handle Concept Drifts arXiv:2507.18464v1 Announce Type: new Abstract: Learning from non-stationary data streams subject to concept drift requires models that can adapt on-the-fly while remaining resource-efficient. Existing adaptive ensemble methods often rely on coarse-grained adaptation mechanisms or simple voting schemes that fail to optimally leverage specialized knowledge. This…

July 25, 2025
When 50/50 Isn’t Optimal: Debunking Even Rebalancing

When 50/50 Isn’t Optimal: Debunking Even Rebalancing A new theory of class imbalance demonstrates that the optimal training imbalance in a binary problem is not 50% The post When 50/50 Isn’t Optimal: Debunking Even Rebalancing appeared first on Towards Data Science. Marco Baity-Jesi Go to original source

July 25, 2025
Getting AI Discovery Right

Getting AI Discovery Right A guide to ideating, validating, and prioritizing your AI use cases The post Getting AI Discovery Right appeared first on Towards Data Science. Dr. Janna Lipenkova Go to original source

July 25, 2025
How Do Grayscale Images Affect Visual Anomaly Detection?

How Do Grayscale Images Affect Visual Anomaly Detection? A practical exploration focusing on performance and speed The post How Do Grayscale Images Affect Visual Anomaly Detection? appeared first on Towards Data Science. Aimira Baitieva Go to original source

July 25, 2025
Optimize for Impact: How to Stay Ahead of Gen AI and Thrive as a Data Scientist

Optimize for Impact: How to Stay Ahead of Gen AI and Thrive as a Data Scientist The data scientists who survive won’t be the ones who code better than ChatGPT—they’ll be the ones who think strategically The post Optimize for Impact: How to Stay Ahead of Gen AI and Thrive as a Data Scientist appeared…

July 25, 2025
Automating Ticket Creation in Jira With the OpenAI Agents SDK: A Step-by-Step Guide

Automating Ticket Creation in Jira With the OpenAI Agents SDK: A Step-by-Step Guide Learn how to create AI Agents using the OpenAI Agents SDK to automate Jira ticket creation from a meeting transcript. The post Automating Ticket Creation in Jira With the OpenAI Agents SDK: A Step-by-Step Guide appeared first on Towards Data Science. Juan…

July 25, 2025
Fundamental limits of distributed covariance matrix estimation via a conditional strong data processing inequality

Fundamental limits of distributed covariance matrix estimation via a conditional strong data processing inequality arXiv:2507.16953v1 Announce Type: new Abstract: Estimating high-dimensional covariance matrices is a key task across many fields. This paper explores the theoretical limits of distributed covariance estimation in a feature-split setting, where communication between agents is constrained. Specifically, we study a scenario…

July 24, 2025
Bayesian preference elicitation for decision support in multiobjective optimization

Bayesian preference elicitation for decision support in multiobjective optimization arXiv:2507.16999v1 Announce Type: new Abstract: We present a novel approach to help decision-makers efficiently identify preferred solutions from the Pareto set of a multi-objective optimization problem. Our method uses a Bayesian model to estimate the decision-maker’s utility function based on pairwise comparisons. Aided by this model,…

July 24, 2025
The surprising strength of weak classifiers for validating neural posterior estimates

The surprising strength of weak classifiers for validating neural posterior estimates arXiv:2507.17026v1 Announce Type: new Abstract: Neural Posterior Estimation (NPE) has emerged as a powerful approach for amortized Bayesian inference when the true posterior $p(theta mid y)$ is intractable or difficult to sample. But evaluating the accuracy of neural posterior estimates remains challenging, with existing…

July 24, 2025
CoLT: The conditional localization test for assessing the accuracy of neural posterior estimates

CoLT: The conditional localization test for assessing the accuracy of neural posterior estimates arXiv:2507.17030v1 Announce Type: new Abstract: We consider the problem of validating whether a neural posterior estimate ( q(theta mid x) ) is an accurate approximation to the true, unknown true posterior ( p(theta mid x) ). Existing methods for evaluating the quality…

July 24, 2025
Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability

Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability arXiv:2507.17316v1 Announce Type: new Abstract: We consider the problem of estimating a discrete distribution $p$ with support of size $K$ and provide both upper and lower bounds with high probability in KL divergence. We prove that in the worst case, for any estimator $widehat{p}$,…

July 24, 2025
How Not to Mislead with Your Data-Driven Story

How Not to Mislead with Your Data-Driven Story Data storytelling can enlighten—but it can also deceive. When persuasive narratives meet biased framing, cherry-picked data, or misleading visuals, insights risk becoming illusions. This article explores the hidden biases embedded in data-driven storytelling—from the seduction of beautiful charts to the quiet influence of AI-generated insights—and offers practical…

July 24, 2025
Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks

Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks Building a tool to interactively visualize the forward pass of any Pytorch model from within notebooks. The post Torchvista: Building an Interactive Pytorch Visualization Package for Notebooks appeared first on Towards Data Science. Sachin Hosmani Go to original source

July 24, 2025
Structural DID with ML: Theory, Simulation, and a Roadmap for Applied Research

Structural DID with ML: Theory, Simulation, and a Roadmap for Applied Research arXiv:2507.15899v1 Announce Type: new Abstract: Causal inference in observational panel data has become a central concern in economics,policy analysis,and the broader social sciences.To address the core contradiction where traditional difference-in-differences (DID) struggles with high-dimensional confounding variables in observational panel data,while machine learning (ML)…

July 23, 2025
Generative AI Models for Learning Flow Maps of Stochastic Dynamical Systems in Bounded Domains

Generative AI Models for Learning Flow Maps of Stochastic Dynamical Systems in Bounded Domains arXiv:2507.15990v1 Announce Type: new Abstract: Simulating stochastic differential equations (SDEs) in bounded domains, presents significant computational challenges due to particle exit phenomena, which requires accurate modeling of interior stochastic dynamics and boundary interactions. Despite the success of machine learning-based methods in…

July 23, 2025
Estimating Treatment Effects with Independent Component Analysis

Estimating Treatment Effects with Independent Component Analysis arXiv:2507.16467v1 Announce Type: new Abstract: The field of causal inference has developed a variety of methods to accurately estimate treatment effects in the presence of nuisance. Meanwhile, the field of identifiability theory has developed methods like Independent Component Analysis (ICA) to identify latent sources and mixing weights from…

July 23, 2025
PAC Off-Policy Prediction of Contextual Bandits

PAC Off-Policy Prediction of Contextual Bandits arXiv:2507.16236v1 Announce Type: new Abstract: This paper investigates off-policy evaluation in contextual bandits, aiming to quantify the performance of a target policy using data collected under a different and potentially unknown behavior policy. Recently, methods based on conformal prediction have been developed to construct reliable prediction intervals that guarantee…

July 23, 2025
Structural Effect and Spectral Enhancement of High-Dimensional Regularized Linear Discriminant Analysis

Structural Effect and Spectral Enhancement of High-Dimensional Regularized Linear Discriminant Analysis arXiv:2507.16682v1 Announce Type: new Abstract: Regularized linear discriminant analysis (RLDA) is a widely used tool for classification and dimensionality reduction, but its performance in high-dimensional scenarios is inconsistent. Existing theoretical analyses of RLDA often lack clear insight into how data structure affects classification performance.…

July 23, 2025
Things I Wish I Had Known Before Starting ML

Things I Wish I Had Known Before Starting ML Part 1: Data, Sales Pitches, Bugs, and Breakthroughs The post Things I Wish I Had Known Before Starting ML appeared first on Towards Data Science. Pascal Janetzky Go to original source

July 23, 2025
NumPy API on a GPU?

NumPy API on a GPU? It’s here already from Nvidia and it’s called cuNumeric. The post NumPy API on a GPU? appeared first on Towards Data Science. Thomas Reid Go to original source

July 23, 2025
A Well-Designed Experiment Can Teach You More Than a Time Machine!

A Well-Designed Experiment Can Teach You More Than a Time Machine! How experimentation is more powerful than knowing counterfactuals The post A Well-Designed Experiment Can Teach You More Than a Time Machine! appeared first on Towards Data Science. Jarom Hulet Go to original source

July 23, 2025
From Rules to Relationships: How Machines Are Learning to Understand Each Other

From Rules to Relationships: How Machines Are Learning to Understand Each Other Using knowledge graphs to handle the unexpected in semantic communication The post From Rules to Relationships: How Machines Are Learning to Understand Each Other appeared first on Towards Data Science. Shireesh Kumar Singh Go to original source

July 23, 2025
What Optimization Terminologies for Linear Programming Really Mean

What Optimization Terminologies for Linear Programming Really Mean Understanding the duality of optimization problem, primal to dual conversion, and the optimality conditions for linear problems. The post What Optimization Terminologies for Linear Programming Really Mean appeared first on Towards Data Science. Himalaya Bir Shrestha Go to original source

July 23, 2025
Statistical and Algorithmic Foundations of Reinforcement Learning

Statistical and Algorithmic Foundations of Reinforcement Learning arXiv:2507.14444v1 Announce Type: new Abstract: As a paradigm for sequential decision making in unknown environments, reinforcement learning (RL) has received a flurry of attention in recent years. However, the explosion of model complexity in emerging applications and the presence of nonconvexity exacerbate the challenge of achieving efficient RL…

July 22, 2025
Diffusion Models for Time Series Forecasting: A Survey

Diffusion Models for Time Series Forecasting: A Survey arXiv:2507.14507v1 Announce Type: new Abstract: Diffusion models, initially developed for image synthesis, demonstrate remarkable generative capabilities. Recently, their application has expanded to time series forecasting (TSF), yielding promising results. In this survey, we firstly introduce the standard diffusion models and their prevalent variants, explaining their adaptation to…

July 22, 2025
Deep Learning-Based Survival Analysis with Copula-Based Activation Functions for Multivariate Response Prediction

Deep Learning-Based Survival Analysis with Copula-Based Activation Functions for Multivariate Response Prediction arXiv:2507.14641v1 Announce Type: new Abstract: This research integrates deep learning, copula functions, and survival analysis to effectively handle highly correlated and right-censored multivariate survival data. It introduces copula-based activation functions (Clayton, Gumbel, and their combinations) to model the nonlinear dependencies inherent in such…

July 22, 2025
When few labeled target data suffice: a theory of semi-supervised domain adaptation via fine-tuning from multiple adaptive starts

When few labeled target data suffice: a theory of semi-supervised domain adaptation via fine-tuning from multiple adaptive starts arXiv:2507.14661v1 Announce Type: new Abstract: Semi-supervised domain adaptation (SSDA) aims to achieve high predictive performance in the target domain with limited labeled target data by exploiting abundant source and unlabeled target data. Despite its significance in numerous…

July 22, 2025
Accelerating Hamiltonian Monte Carlo for Bayesian Inference in Neural Networks and Neural Operators

Accelerating Hamiltonian Monte Carlo for Bayesian Inference in Neural Networks and Neural Operators arXiv:2507.14652v1 Announce Type: new Abstract: Hamiltonian Monte Carlo (HMC) is a powerful and accurate method to sample from the posterior distribution in Bayesian inference. However, HMC techniques are computationally demanding for Bayesian neural networks due to the high dimensionality of the network’s…

July 22, 2025
How To Significantly Enhance LLMs by Leveraging Context Engineering

How To Significantly Enhance LLMs by Leveraging Context Engineering The benefits and practical aspects of context engineering for LLMs The post How To Significantly Enhance LLMs by Leveraging Context Engineering appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

July 22, 2025
Hands‑On with Agents SDK: Your First API‑Calling Agent

Hands‑On with Agents SDK: Your First API‑Calling Agent A practical, beginner‑friendly guide to building an AI weather assistant with Python, OpenAI Agents SDK, API tools, and Streamlit. The post Hands‑On with Agents SDK: Your First API‑Calling Agent appeared first on Towards Data Science. Iqbal Rahmadhan Go to original source

July 22, 2025
I Analysed 25,000 Hotel Names and Found Four Surprising Truths

I Analysed 25,000 Hotel Names and Found Four Surprising Truths Why are there so many hotels named after cities they are not in? Follow along for a data analysis on hotel names. The post I Analysed 25,000 Hotel Names and Found Four Surprising Truths appeared first on Towards Data Science. Anna Gordun Peiro Go to…

July 22, 2025
MCP Client Development with Streamlit: Build Your AI-Powered Web App

MCP Client Development with Streamlit: Build Your AI-Powered Web App MCP client development with Streamlit to enhance the tool calling capabilities of remote MCP servers, from setting up your development environment and securing API keys, handling user input, connecting to remote MCP servers, and displaying AI-generated responses. The post MCP Client Development with Streamlit: Build…

July 22, 2025
Advanced Topic Modeling with LLMs

Advanced Topic Modeling with LLMs A deep dive into topic modeling by leveraging representation models and generative AI with BERTopic The post Advanced Topic Modeling with LLMs appeared first on Towards Data Science. Alex Davis Go to original source

July 22, 2025
Differential Privacy in Kernelized Contextual Bandits via Random Projections

Differential Privacy in Kernelized Contextual Bandits via Random Projections arXiv:2507.13639v1 Announce Type: new Abstract: We consider the problem of contextual kernel bandits with stochastic contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space. We study this problem under an additional constraint of Differential Privacy, where the agent needs to ensure…

July 21, 2025
Conformal Data Contamination Tests for Trading or Sharing of Data

Conformal Data Contamination Tests for Trading or Sharing of Data arXiv:2507.13835v1 Announce Type: new Abstract: The amount of quality data in many machine learning tasks is limited to what is available locally to data owners. The set of quality data can be expanded through trading or sharing with external data agents. However, data buyers need…

July 21, 2025
A Survey of Dimension Estimation Methods

A Survey of Dimension Estimation Methods arXiv:2507.13887v1 Announce Type: new Abstract: It is a standard assumption that datasets in high dimension have an internal structure which means that they in fact lie on, or near, subsets of a lower dimension. In many instances it is important to understand the real dimension of the data, hence…

July 21, 2025
Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design

Step-DAD: Semi-Amortized Policy-Based Bayesian Experimental Design arXiv:2507.14057v1 Announce Type: new Abstract: We develop a semi-amortized, policy-based, approach to Bayesian experimental design (BED) called Stepwise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before the experiment. However, rather than keeping this policy fixed, Step-DAD periodically updates it…

July 21, 2025
Conformalized Regression for Continuous Bounded Outcomes

Conformalized Regression for Continuous Bounded Outcomes arXiv:2507.14023v1 Announce Type: new Abstract: Regression problems with bounded continuous outcomes frequently arise in real-world statistical and machine learning applications, such as the analysis of rates and proportions. A central challenge in this setting is predicting a response associated with a new covariate value. Most of the existing statistical…

July 21, 2025
Weekly Entering & Transitioning – Thread 21 Jul, 2025 – 28 Jul, 2025

Weekly Entering & Transitioning – Thread 21 Jul, 2025 – 28 Jul, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

July 21, 2025
Company Killed University Programs

Company Killed University Programs Normally, I would have a post around this time hyping up fall recruiting and trying to provide pointers. The company I work for has decided to hire no additional entry level data scientists this year outside of intern return offers. They have also cut the number of intern positions in half…

July 21, 2025
Detect LLM hallucinations using uncertainty quantification techniques with UQLM

Detect LLM hallucinations using uncertainty quantification techniques with UQLM UQLM (uncertainty quantification for language models) is an open source Python package for generation time, zero-resource hallucination detection. It leverages state-of-the-art uncertainty quantification (UQ) techniques from the academic literature to compute response-level confidence scores based on response consistency (in multiple responses to the same prompt), token…

July 21, 2025
Generating random noise for media data

Generating random noise for media data Hey everyone – I work on an ML team in the industry, and I’m currently building a predictive model to catch signals in live media data to sense when potential viral moments or crises are happening for brands. We have live media trackers at my company that capture all…

July 21, 2025
How would you structure a project (data frame) to scrape and track listing changes over time?

How would you structure a project (data frame) to scrape and track listing changes over time? I’m working on a project where I want to scrape data daily (e.g., real estate listings from a site like RentFaster or Zillow) and track how each listing changes over time. I want to be able to answer questions…

July 21, 2025
Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 2)

Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 2) Let’s observe the matter on the atomic level The post Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 2) appeared first on Towards Data Science. Dmitrii Eliuseev Go to original source

July 19, 2025
The Hidden Trap of Fixed and Random Effects

The Hidden Trap of Fixed and Random Effects My lesson of how blindly over-controlling for noise can erase the effects you are measuring The post The Hidden Trap of Fixed and Random Effects appeared first on Towards Data Science. Ngoc Doan Go to original source

July 19, 2025
Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow

Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow A practical approach to instance segmentation using SOLOv2 and TensorFlow The post Gain a Better Understanding of Computer Vision: Dynamic SOLO (SOLOv2) with TensorFlow appeared first on Towards Data Science. Pavel Timonin Go to original source

July 19, 2025
From Reactive to Predictive: Forecasting Network Congestion with Machine Learning and INT

From Reactive to Predictive: Forecasting Network Congestion with Machine Learning and INT Learn how machine learning can predict network congestion before it happens The post From Reactive to Predictive: Forecasting Network Congestion with Machine Learning and INT appeared first on Towards Data Science. Shireesh Kumar Singh Go to original source

July 19, 2025
TDS Authors Can Now Edit Their Published Articles

TDS Authors Can Now Edit Their Published Articles One of our guiding principles as a publication is that authors’ work remains theirs. This hasn’t changed since launching our independent site earlier this year; quite the opposite. On a practical level, however, we knew that an important element was missing as long as authors couldn’t directly…

July 19, 2025
Physics constrained learning of stochastic characteristics

Physics constrained learning of stochastic characteristics arXiv:2507.12661v1 Announce Type: new Abstract: Accurate state estimation requires careful consideration of uncertainty surrounding the process and measurement models; these characteristics are usually not well-known and need an experienced designer to select the covariance matrices. An error in the selection of covariance matrices could impact the accuracy of the…

July 18, 2025
Self Balancing Neural Network: A Novel Method to Estimate Average Treatment Effect

Self Balancing Neural Network: A Novel Method to Estimate Average Treatment Effect arXiv:2507.12818v1 Announce Type: new Abstract: In observational studies, confounding variables affect both treatment and outcome. Moreover, instrumental variables also influence the treatment assignment mechanism. This situation sets the study apart from a standard randomized controlled trial, where the treatment assignment is random. Due…

July 18, 2025
Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights

Finite-Dimensional Gaussian Approximation for Deep Neural Networks: Universality in Random Weights arXiv:2507.12686v1 Announce Type: new Abstract: We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly initialized weights that have finite-order moments. Specifically, we establish Gaussian approximation bounds in the Wasserstein-$1$ norm between the FDDs and their Gaussian limit assuming a Lipschitz activation…

July 18, 2025
Bayesian Modeling and Estimation of Linear Time-Variant Systems using Neural Networks and Gaussian Processes

Bayesian Modeling and Estimation of Linear Time-Variant Systems using Neural Networks and Gaussian Processes arXiv:2507.12878v1 Announce Type: new Abstract: The identification of Linear Time-Variant (LTV) systems from input-output data is a fundamental yet challenging ill-posed inverse problem. This work introduces a unified Bayesian framework that models the system’s impulse response, $h(t, tau)$, as a stochastic…

July 18, 2025
When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values

When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values arXiv:2507.13024v1 Announce Type: new Abstract: Predicting a response with partially missing inputs remains a challenging task even in parametric models, since parameter estimation in itself is not sufficient to predict on partially observed inputs. Several works study prediction in linear models. In…

July 18, 2025
The Age of Self-Evolving AI Is Here

The Age of Self-Evolving AI Is Here How Meta’s latest breakthrough lets models learn, adapt, and improve — all on their own The post The Age of Self-Evolving AI Is Here appeared first on Towards Data Science. Moulik Gupta Go to original source

July 18, 2025
Estimating Disease Rates Without Diagnosis

Estimating Disease Rates Without Diagnosis Immune genes as predictors of disease The post Estimating Disease Rates Without Diagnosis appeared first on Towards Data Science. David Wells Go to original source

July 18, 2025
Don’t Waste Your Labeled Anomalies: 3 Practical Strategies to Boost Anomaly Detection Performance

Don’t Waste Your Labeled Anomalies: 3 Practical Strategies to Boost Anomaly Detection Performance A few labels go a long way in anomaly detection The post Don’t Waste Your Labeled Anomalies: 3 Practical Strategies to Boost Anomaly Detection Performance appeared first on Towards Data Science. Shuai Guo Go to original source

July 18, 2025
Your 1M+ Context Window LLM Is Less Powerful Than You Think

Your 1M+ Context Window LLM Is Less Powerful Than You Think Why working memory is a more important bottleneck than raw context window size The post Your 1M+ Context Window LLM Is Less Powerful Than You Think appeared first on Towards Data Science. Tobias Schnabel Go to original source

July 18, 2025
LLMs are Bayesian, in Expectation, not in Realization

LLMs are Bayesian, in Expectation, not in Realization arXiv:2507.11768v1 Announce Type: new Abstract: Large language models demonstrate remarkable in-context learning capabilities, adapting to new tasks without parameter updates. While this phenomenon has been successfully modeled as implicit Bayesian inference, recent empirical findings reveal a fundamental contradiction: transformers systematically violate the martingale property, a cornerstone requirement…

July 17, 2025
Choosing the Better Bandit Algorithm under Data Sharing: When Do A/B Experiments Work?

Choosing the Better Bandit Algorithm under Data Sharing: When Do A/B Experiments Work? arXiv:2507.11891v1 Announce Type: new Abstract: We study A/B experiments that are designed to compare the performance of two recommendation algorithms. Prior work has shown that the standard difference-in-means estimator is biased in estimating the global treatment effect (GTE) due to a particular…

July 17, 2025
Newfluence: Boosting Model interpretability and Understanding in High Dimensions

Newfluence: Boosting Model interpretability and Understanding in High Dimensions arXiv:2507.11895v1 Announce Type: new Abstract: The increasing complexity of machine learning (ML) and artificial intelligence (AI) models has created a pressing need for tools that help scientists, engineers, and policymakers interpret and refine model decisions and predictions. Influence functions, originating from robust statistics, have emerged as…

July 17, 2025
Incorporating Fairness Constraints into Archetypal Analysis

Incorporating Fairness Constraints into Archetypal Analysis arXiv:2507.12021v1 Announce Type: new Abstract: Archetypal Analysis (AA) is an unsupervised learning method that represents data as convex combinations of extreme patterns called archetypes. While AA provides interpretable and low-dimensional representations, it can inadvertently encode sensitive attributes, leading to fairness concerns. In this work, we propose Fair Archetypal Analysis…

July 17, 2025
Distribution-Free Uncertainty-Aware Virtual Sensing via Conformalized Neural Operators

Distribution-Free Uncertainty-Aware Virtual Sensing via Conformalized Neural Operators arXiv:2507.11574v1 Announce Type: cross Abstract: Robust uncertainty quantification (UQ) remains a critical barrier to the safe deployment of deep learning in real-time virtual sensing, particularly in high-stakes domains where sparse, noisy, or non-collocated sensor data are the norm. We introduce the Conformalized Monte Carlo Operator (CMCO), a…

July 17, 2025
Midyear 2025 AI Reflection

Midyear 2025 AI Reflection Impressions on agentic AI progress and the AI-2027 Jobocalypse scenario The post Midyear 2025 AI Reflection appeared first on Towards Data Science. Marina Tosic Go to original source

July 17, 2025
Exploring Prompt Learning: Using English Feedback to Optimize LLM Systems

Exploring Prompt Learning: Using English Feedback to Optimize LLM Systems Prompt learning presents a compelling approach for continuous improvement of AI applications The post Exploring Prompt Learning: Using English Feedback to Optimize LLM Systems appeared first on Towards Data Science. Aparna Dhinakaran Go to original source

July 17, 2025
How to Overlay a Heatmap on a Real Map with Python

How to Overlay a Heatmap on a Real Map with Python Visualizing historical tornado trends The post How to Overlay a Heatmap on a Real Map with Python appeared first on Towards Data Science. Lee Vaughan Go to original source

July 17, 2025