Category: aimldsaimlds

Overfitting has a limitation: a model-independent generalization error bound based on R’enyi entropy

Overfitting has a limitation: a model-independent generalization error bound based on R’enyi entropy arXiv:2506.00182v1 Announce Type: new Abstract: Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization error, which is the impact of overfitting. Understanding generalization error behavior of increasingly large-scale…

June 3, 2025
Riemannian Principal Component Analysis

Riemannian Principal Component Analysis arXiv:2506.00226v1 Announce Type: new Abstract: This paper proposes an innovative extension of Principal Component Analysis (PCA) that transcends the traditional assumption of data lying in Euclidean space, enabling its application to data on Riemannian manifolds. The primary challenge addressed is the lack of vector space operations on such manifolds. Fletcher et…

June 3, 2025
Beyond Winning: Margin of Victory Relative to Expectation Unlocks Accurate Skill Ratings

Beyond Winning: Margin of Victory Relative to Expectation Unlocks Accurate Skill Ratings arXiv:2506.00348v1 Announce Type: new Abstract: Knowledge of accurate relative skills in any competitive system is essential, but foundational approaches such as ELO discard extremely relevant performance data by concentrating exclusively on binary outcomes. While margin of victory (MOV) extensions exist, they often lack…

June 3, 2025
Bayesian Data Sketching for Varying Coefficient Regression Models

Bayesian Data Sketching for Varying Coefficient Regression Models arXiv:2506.00270v1 Announce Type: new Abstract: Varying coefficient models are popular for estimating nonlinear regression functions in functional data models. Their Bayesian variants have received limited attention in large data applications, primarily due to prohibitively slow posterior computations using Markov chain Monte Carlo (MCMC) algorithms. We introduce Bayesian…

June 3, 2025
LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries

LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries Local Large Language Models can convert massive DataFrames to presentable Markdown reports — here’s how. The post LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries appeared first on Towards Data Science. Dario Radečić Go to original source

June 3, 2025
Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning

Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning It’s like grading papers, but your student is an LLM The post Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning appeared first on Towards Data Science. Stephanie Kirmer Go to original source

June 3, 2025
Vision Transformer on a Budget

Vision Transformer on a Budget Introduction The vanilla ViT is problematic. If you take a look at the original ViT paper [1], you’ll notice that although this deep learning model proved to work extremely well, it requires hundreds of millions of labeled training images to achieve this. Well, that’s a lot. This requirement of an enormous…

June 3, 2025
Inside Google’s Agent2Agent (A2A) Protocol: Teaching AI Agents to Talk to Each Other

Inside Google’s Agent2Agent (A2A) Protocol: Teaching AI Agents to Talk to Each Other Exploring how Google’s A2A enables plug-and-play communication between LLM-powered agents across frameworks The post Inside Google’s Agent2Agent (A2A) Protocol: Teaching AI Agents to Talk to Each Other appeared first on Towards Data Science. Hailey Quach Go to original source

June 3, 2025
Your DNA Is a Machine Learning Model: It’s Already Out There

Your DNA Is a Machine Learning Model: It’s Already Out There Even if you never sequenced your genome, predictive systems already know a lot about it. Genomic inference has become a population-scale model, and you’re probably in it. The post Your DNA Is a Machine Learning Model: It’s Already Out There appeared first on Towards…

June 3, 2025
Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning

Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning arXiv:2505.23783v1 Announce Type: new Abstract: In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performances in classification. While calibration techniques are proposed to…

June 2, 2025
Gibbs randomness-compression proposition: An efficient deep learning

Gibbs randomness-compression proposition: An efficient deep learning arXiv:2505.23869v1 Announce Type: new Abstract: A proposition that connects randomness and compression put forward via Gibbs entropy over set of measurement vectors associated with a compression process. The proposition states that a lossy compression process is equivalent to {it directed randomness} that preserves information content. The proposition originated…

June 2, 2025
Conformal Object Detection by Sequential Risk Control

Conformal Object Detection by Sequential Risk Control arXiv:2505.24038v1 Announce Type: new Abstract: Recent advances in object detectors have led to their adoption for industrial uses. However, their deployment in critical applications is hindered by the inherent lack of reliability of neural networks and the complex structure of object detection models. To address these challenges, we…

June 2, 2025
Performative Risk Control: Calibrating Models for Reliable Deployment under Performativity

Performative Risk Control: Calibrating Models for Reliable Deployment under Performativity arXiv:2505.24097v1 Announce Type: new Abstract: Calibrating blackbox machine learning models to achieve risk control is crucial to ensure reliable decision-making. A rich line of literature has been studying how to calibrate a model so that its predictions satisfy explicit finite-sample statistical guarantees under a fixed,…

June 2, 2025
A Mathematical Perspective On Contrastive Learning

A Mathematical Perspective On Contrastive Learning arXiv:2505.24134v1 Announce Type: new Abstract: Multimodal contrastive learning is a methodology for linking different data modalities; the canonical example is linking image and text data. The methodology is typically framed as the identification of a set of encoders, one for each modality, that align representations within a common latent…

June 2, 2025
Weekly Entering & Transitioning – Thread 02 Jun, 2025 – 09 Jun, 2025

Weekly Entering & Transitioning – Thread 02 Jun, 2025 – 09 Jun, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

June 2, 2025
How I scraped 4.1 million jobs with GPT4o-mini

How I scraped 4.1 million jobs with GPT4o-mini Background: During my PhD in Data Science at Stanford, I got sick and tired of ghost jobs & 3rd party offshore agencies on LinkedIn & Indeed. So I wrote a script that fetches jobs from 100k+ company websites’ career pages and uses GPT4o-mini to extract relevant information…

June 2, 2025
Can data science be used in computer networking (if not can it be used in cybersecurity)?

Can data science be used in computer networking (if not can it be used in cybersecurity)? Hi, I’m a high schooler (junior year) who is extremely interested in data science to the point where it is the main career field I want to go into. However, I got enrolled in a program where we train…

June 2, 2025
Advice on processing ~1M jobs/month with LLaMA for cost savings

Advice on processing ~1M jobs/month with LLaMA for cost savings I’m using GPT-4o-mini to process ~1 million jobs/month. It’s doing things like deduplication, classification, title normalization, and enrichment. This setup is fast and easy, but the cost is starting to hurt. I’m considering distilling this pipeline into an open-source LLM, like LLaMA 3 or Mistral,…

June 2, 2025
What is your functional area?

What is your functional area? I don’t mean industry. I mean product, operations, etc. I work in operations. I don’t grow the business. I keep the business alive. submitted by /u/Trick-Interaction396 [link] [comments] /u/Trick-Interaction396 Go to original source

June 2, 2025
Agentic RAG Applications: Company Knowledge Slack Agents

Agentic RAG Applications: Company Knowledge Slack Agents Lessons learnt using LlamaIndex and Modal The post Agentic RAG Applications: Company Knowledge Slack Agents appeared first on Towards Data Science. Ida Silfverskiöld Go to original source

May 31, 2025
Hands-On Attention Mechanism for Time Series Classification, with Python

Hands-On Attention Mechanism for Time Series Classification, with Python This is how to use the attention mechanism in a time series classification framework The post Hands-On Attention Mechanism for Time Series Classification, with Python appeared first on Towards Data Science. Piero Paialunga Go to original source

May 31, 2025
The Secret Power of Data Science in Customer Support

The Secret Power of Data Science in Customer Support Customer support is a data goldmine. Here’s how to unlock its full potential with data science. The post The Secret Power of Data Science in Customer Support appeared first on Towards Data Science. Yu Dong Go to original source

May 31, 2025
Gaining Strategic Clarity in AI

Gaining Strategic Clarity in AI Introducing the AI strategy playbook The post Gaining Strategic Clarity in AI appeared first on Towards Data Science. Dr. Janna Lipenkova Go to original source

May 31, 2025
LLM Optimization: LoRA and QLoRA

LLM Optimization: LoRA and QLoRA Scalable fine-tuning techniques for large language models The post LLM Optimization: LoRA and QLoRA appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

May 31, 2025
Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games

Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games arXiv:2505.22781v1 Announce Type: new Abstract: We introduce Mean-Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean-Field Games (MFG) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting,…

May 30, 2025
Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Highly Efficient and Effective LLMs with Multi-Boolean Architectures arXiv:2505.22811v1 Announce Type: new Abstract: Weight binarization has emerged as a promising strategy to drastically reduce the complexity of large language models (LLMs). It is mainly classified into two approaches: post-training binarization and finetuning with training-aware binarization methods. The first approach, while having low complexity, leads to…

May 30, 2025
Theoretical Foundations of the Deep Copula Classifier: A Generative Approach to Modeling Dependent Features

Theoretical Foundations of the Deep Copula Classifier: A Generative Approach to Modeling Dependent Features arXiv:2505.22997v1 Announce Type: new Abstract: Traditional classifiers often assume feature independence or rely on overly simplistic relationships, leading to poor performance in settings where real-world dependencies matter. We introduce the Deep Copula Classifier (DCC), a generative model that separates the learning…

May 30, 2025
JAPAN: Joint Adaptive Prediction Areas with Normalising-Flows

JAPAN: Joint Adaptive Prediction Areas with Normalising-Flows arXiv:2505.23196v1 Announce Type: new Abstract: Conformal prediction provides a model-agnostic framework for uncertainty quantification with finite-sample validity guarantees, making it an attractive tool for constructing reliable prediction sets. However, existing approaches commonly rely on residual-based conformity scores, which impose geometric constraints and struggle when the underlying distribution is…

May 30, 2025
Stable Thompson Sampling: Valid Inference via Variance Inflation

Stable Thompson Sampling: Valid Inference via Variance Inflation arXiv:2505.23260v1 Announce Type: new Abstract: We consider the problem of statistical inference when the data is collected via a Thompson Sampling-type algorithm. While Thompson Sampling (TS) is known to be both asymptotically optimal and empirically effective, its adaptive sampling scheme poses challenges for constructing confidence intervals for…

May 30, 2025
GAIA: The LLM Agent Benchmark Everyone’s Talking About

GAIA: The LLM Agent Benchmark Everyone’s Talking About What practitioners need to know about this LLM agent benchmark The post GAIA: The LLM Agent Benchmark Everyone’s Talking About appeared first on Towards Data Science. Shuai Guo Go to original source

May 30, 2025
A Bird’s Eye View of Linear Algebra: The Basics

A Bird’s Eye View of Linear Algebra: The Basics We think basis-free, we write basis-free, but when the chips are down we close the office door and compute with matrices like fury. The post A Bird’s Eye View of Linear Algebra: The Basics appeared first on Towards Data Science. Rohit Pandey Go to original source

May 30, 2025
A Practical Introduction to Google Analytics

A Practical Introduction to Google Analytics Learn the key concepts and reports of Google Analytics while practising with the platform The post A Practical Introduction to Google Analytics appeared first on Towards Data Science. Eugenia Anello Go to original source

May 30, 2025
The Hidden Security Risks of LLMs

The Hidden Security Risks of LLMs And why self-hosting might be the safer bet The post The Hidden Security Risks of LLMs appeared first on Towards Data Science. Anouk Dutrée Go to original source

May 30, 2025
I Transitioned from Data Science to AI Engineering: Here’s Everything You Need to Know

I Transitioned from Data Science to AI Engineering: Here’s Everything You Need to Know A personal guide to the skills, tools, and mindset behind the title The post I Transitioned from Data Science to AI Engineering: Here’s Everything You Need to Know appeared first on Towards Data Science. Sara Nobrega Go to original source

May 30, 2025
A Kernelised Stein Discrepancy for Assessing the Fit of Inhomogeneous Random Graph Models

A Kernelised Stein Discrepancy for Assessing the Fit of Inhomogeneous Random Graph Models arXiv:2505.21580v1 Announce Type: new Abstract: Complex data are often represented as a graph, which in turn can often be viewed as a realisation of a random graph, such as of an inhomogeneous random graph model (IRG). For general fast goodness-of-fit tests in…

May 29, 2025
STACI: Spatio-Temporal Aleatoric Conformal Inference

STACI: Spatio-Temporal Aleatoric Conformal Inference arXiv:2505.21658v1 Announce Type: new Abstract: Fitting Gaussian Processes (GPs) provides interpretable aleatoric uncertainty quantification for estimation of spatio-temporal fields. Spatio-temporal deep learning models, while scalable, typically assume a simplistic independent covariance matrix for the response, failing to capture the underlying correlation structure. However, spatio-temporal GPs suffer from issues of scalability…

May 29, 2025
Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference

Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference arXiv:2505.21721v1 Announce Type: new Abstract: We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at an almost dimension-independent rate. Specifically, for strongly log-concave and log-smooth targets, the number of iterations for BBVI with a sub-Gaussian family to achieve…

May 29, 2025
Global Minimizers of $ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks

Global Minimizers of $ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks arXiv:2505.21791v1 Announce Type: new Abstract: Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of…

May 29, 2025
A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with Polyak Averaging

A General-Purpose Theorem for High-Probability Bounds of Stochastic Approximation with Polyak Averaging arXiv:2505.21796v1 Announce Type: new Abstract: Polyak-Ruppert averaging is a widely used technique to achieve the optimal asymptotic variance of stochastic approximation (SA) algorithms, yet its high-probability performance guarantees remain underexplored in general settings. In this paper, we present a general framework for establishing…

May 29, 2025
From Data to Stories: Code Agents for KPI Narratives

From Data to Stories: Code Agents for KPI Narratives HuggingFace’s smolagents framework in action The post From Data to Stories: Code Agents for KPI Narratives appeared first on Towards Data Science. Mariya Mansurova Go to original source

May 29, 2025
Multi-Agent Communication with the A2A Python SDK

Multi-Agent Communication with the A2A Python SDK The Agent Card helps discover agents, but how does communication between agents actually work in practice? The post Multi-Agent Communication with the A2A Python SDK appeared first on Towards Data Science. Deborah Mesquita Go to original source

May 29, 2025
JAX: Is This Google’s NumPy killer?

JAX: Is This Google’s NumPy killer? Auto differentiation and JIT compilation make a compelling case. The post JAX: Is This Google’s NumPy killer? appeared first on Towards Data Science. Thomas Reid Go to original source

May 29, 2025
Detecting Malicious URLs Using LSTM and Google’s BERT Models

Detecting Malicious URLs Using LSTM and Google’s BERT Models A progressive approach to implementing AI-powered webpage detection applications into production The post Detecting Malicious URLs Using LSTM and Google’s BERT Models appeared first on Towards Data Science. Toluwase Babalola Go to original source

May 29, 2025
Tree of Thought Prompting: Teaching LLMs to Think Slowly

Tree of Thought Prompting: Teaching LLMs to Think Slowly Playing Minesweeper with Augmented Reasoning The post Tree of Thought Prompting: Teaching LLMs to Think Slowly appeared first on Towards Data Science. Shuyang Go to original source

May 29, 2025
Differentially private ratio statistics

Differentially private ratio statistics arXiv:2505.20351v1 Announce Type: new Abstract: Ratio statistics–such as relative risk and odds ratios–play a central role in hypothesis testing, model evaluation, and decision-making across many areas of machine learning, including causal inference and fairness analysis. However, despite privacy concerns surrounding many datasets and despite increasing adoption of differential privacy, differentially private…

May 28, 2025
Learning with Expected Signatures: Theory and Applications

Learning with Expected Signatures: Theory and Applications arXiv:2505.20465v1 Announce Type: new Abstract: The expected signature maps a collection of data streams to a lower dimensional representation, with a remarkable property: the resulting feature tensor can fully characterize the data generating distribution. This “model-free” embedding has been successfully leveraged to build multiple domain-agnostic machine learning (ML)…

May 28, 2025
Kernel Quantile Embeddings and Associated Probability Metrics

Kernel Quantile Embeddings and Associated Probability Metrics arXiv:2505.20433v1 Announce Type: new Abstract: Embedding probability distributions into reproducing kernel Hilbert spaces (RKHS) has enabled powerful nonparametric methods such as the maximum mean discrepancy (MMD), a statistical distance with strong theoretical and computational properties. At its core, the MMD relies on kernel mean embeddings to represent distributions…

May 28, 2025
Covariate-Adjusted Deep Causal Learning for Heterogeneous Panel Data Models

Covariate-Adjusted Deep Causal Learning for Heterogeneous Panel Data Models arXiv:2505.20536v1 Announce Type: new Abstract: This paper studies the task of estimating heterogeneous treatment effects in causal panel data models, in the presence of covariate effects. We propose a novel Covariate-Adjusted Deep Causal Learning (CoDEAL) for panel data models, that employs flexible model structures and powerful…

May 28, 2025
Balancing Performance and Costs in Best Arm Identification

Balancing Performance and Costs in Best Arm Identification arXiv:2505.20583v1 Announce Type: new Abstract: We consider the problem of identifying the best arm in a multi-armed bandit model. Despite a wealth of literature in the traditional fixed budget and fixed confidence regimes of the best arm identification problem, it still remains a mystery to most practitioners…

May 28, 2025
Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models

Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models Explore how Bayesian Optimization outperforms Grid Search in efficiency and performance over binary classification tasks. The post Bayesian Optimization for Hyperparameter Tuning of Deep Learning Models appeared first on Towards Data Science. Kuriko Iwai Go to original source

May 28, 2025
How Microsoft Power BI Elevated My Data Analysis and Visualization Workflow

How Microsoft Power BI Elevated My Data Analysis and Visualization Workflow Explaining useful features every data analyst needs The post How Microsoft Power BI Elevated My Data Analysis and Visualization Workflow appeared first on Towards Data Science. Benjamin Nweke Go to original source

May 28, 2025
Reinforcement Learning Made Simple: Build a Q-Learning Agent in Python

Reinforcement Learning Made Simple: Build a Q-Learning Agent in Python Inspired by AlphaGo’s Move 37 — learn how agents explore, exploit, and win The post Reinforcement Learning Made Simple: Build a Q-Learning Agent in Python appeared first on Towards Data Science. Sarah Schürch Go to original source

May 28, 2025
Why Regularization Isn’t Enough: A Better Way to Train Neural Networks with Two Objectives

Why Regularization Isn’t Enough: A Better Way to Train Neural Networks with Two Objectives Why splitting your objectives and your model might be the key to better performance and clearer trade-offs in deep learning. The post Why Regularization Isn’t Enough: A Better Way to Train Neural Networks with Two Objectives appeared first on Towards Data…

May 28, 2025
Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems

Preconditioned Langevin Dynamics with Score-Based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems arXiv:2505.18276v1 Announce Type: new Abstract: Designing algorithms for solving high-dimensional Bayesian inverse problems directly in infinite-dimensional function spaces – where such problems are naturally formulated – is crucial to ensure stability and convergence as the discretization of the underlying problem is refined.…

May 27, 2025
Operator Learning for Schr”{o}dinger Equation: Unitarity, Error Bounds, and Time Generalization

Operator Learning for Schr”{o}dinger Equation: Unitarity, Error Bounds, and Time Generalization arXiv:2505.18288v1 Announce Type: new Abstract: We consider the problem of learning the evolution operator for the time-dependent Schr”{o}dinger equation, where the Hamiltonian may vary with time. Existing neural network-based surrogates often ignore fundamental properties of the Schr”{o}dinger equation, such as linearity and unitarity, and…

May 27, 2025
On the Mechanisms of Weak-to-Strong Generalization: A Theoretical Perspective

On the Mechanisms of Weak-to-Strong Generalization: A Theoretical Perspective arXiv:2505.18346v1 Announce Type: new Abstract: Weak-to-strong generalization, where a student model trained on imperfect labels generated by a weaker teacher nonetheless surpasses that teacher, has been widely observed but the mechanisms that enable it have remained poorly understood. In this paper, through a theoretical analysis of…

May 27, 2025
Online Statistical Inference of Constrained Stochastic Optimization via Random Scaling

Online Statistical Inference of Constrained Stochastic Optimization via Random Scaling arXiv:2505.18327v1 Announce Type: new Abstract: Constrained stochastic nonlinear optimization problems have attracted significant attention for their ability to model complex real-world scenarios in physics, economics, and biology. As datasets continue to grow, online inference methods have become crucial for enabling real-time decision-making without the need…

May 27, 2025
Identifiability of latent causal graphical models without pure children

Identifiability of latent causal graphical models without pure children arXiv:2505.18410v1 Announce Type: new Abstract: This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the…

May 27, 2025
Code Agents: The Future of Agentic AI

Code Agents: The Future of Agentic AI HuggingFace smolagents framework in action The post Code Agents: The Future of Agentic AI appeared first on Towards Data Science. Mariya Mansurova Go to original source

May 27, 2025
How to Reduce Your Power BI Model Size by 90%

How to Reduce Your Power BI Model Size by 90% Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? Learn on a real-life example about data model optimization and general rules for reducing data model The post How to Reduce Your Power BI Model Size by 90%…

May 27, 2025
How to Generate Synthetic Data: A Comprehensive Guide Using Bayesian Sampling and Univariate Distributions

How to Generate Synthetic Data: A Comprehensive Guide Using Bayesian Sampling and Univariate Distributions Data makes the engine run in many organisations. But what if the number of observations is too low or there is only expert knowledge? I will demonstrate how to generate synthetic data with applications in predictive maintenance. The post How to…

May 27, 2025
The Best AI Books & Courses for Getting a Job

The Best AI Books & Courses for Getting a Job A comprehensive guide to the books and courses that helped me learn AI The post The Best AI Books & Courses for Getting a Job appeared first on Towards Data Science. Egor Howell Go to original source

May 27, 2025
Understanding Matrices | Part 1: Matrix-Vector Multiplication

Understanding Matrices | Part 1: Matrix-Vector Multiplication The physical meaning of multiplying a matrix by a vector, and how it works on several special matrices. The post Understanding Matrices | Part 1: Matrix-Vector Multiplication appeared first on Towards Data Science. Tigran Hayrapetyan Go to original source

May 27, 2025
Liouville PDE-based sliced-Wasserstein flow for fair regression

Liouville PDE-based sliced-Wasserstein flow for fair regression arXiv:2505.17204v1 Announce Type: new Abstract: The sliced Wasserstein flow (SWF), a nonparametric and implicit generative gradient flow, is applied to fair regression. We have improved the SWF in a few aspects. First, the stochastic diffusive term from the Fokker-Planck equation-based Monte Carlo is transformed to Liouville partial differential…

May 26, 2025
Learning Probabilities of Causation from Finite Population Data

Learning Probabilities of Causation from Finite Population Data arXiv:2505.17133v1 Announce Type: new Abstract: Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with textbf{insufficient} data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities…

May 26, 2025
Deconfounded Warm-Start Thompson Sampling with Applications to Precision Medicine

Deconfounded Warm-Start Thompson Sampling with Applications to Precision Medicine arXiv:2505.17283v1 Announce Type: new Abstract: Randomized clinical trials often require large patient cohorts before drawing definitive conclusions, yet abundant observational data from parallel studies remains underutilized due to confounding and hidden biases. To bridge this gap, we propose Deconfounded Warm-Start Thompson Sampling (DWTS), a practical approach…

May 26, 2025
Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation

Learning to Choose or Choosing to Learn: Best-of-N vs. Supervised Fine-Tuning for Bit String Generation arXiv:2505.17288v1 Announce Type: new Abstract: Using the bit string generation problem as a case study, we theoretically compare two standard methods for adapting large language models to new tasks. The first, referred to as supervised fine-tuning, involves training a new…

May 26, 2025
Optimal Transport with Heterogeneously Missing Data

Optimal Transport with Heterogeneously Missing Data arXiv:2505.17291v1 Announce Type: new Abstract: We consider the problem of solving the optimal transport problem between two empirical distributions with missing values. Our main assumption is that the data is missing completely at random (MCAR), but we allow for heterogeneous missingness probabilities across features and across the two distributions.…

May 26, 2025
Weekly Entering & Transitioning – Thread 26 May, 2025 – 02 Jun, 2025

Weekly Entering & Transitioning – Thread 26 May, 2025 – 02 Jun, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

May 26, 2025
2025 stack check: which DS/ML tools am I missing?

2025 stack check: which DS/ML tools am I missing? Hi all, I work in ad-tech, where my job is to improve the product with data-driven algorithms, mostly on tabular datasets (CTR models, bidding, attribution, the usual). Current work stack (quite classic I guess) pandas, numpy, scikit-learn, xgboost, statsmodels PyTorch (light use) JupyterLab & notebooks matplotlib,…

May 26, 2025
Found a really amazing video , providing context to the breakthrough as well as the misconceived hype around Alphaevolve

Found a really amazing video , providing context to the breakthrough as well as the misconceived hype around Alphaevolve I am sure by now most of us would have seen or atleast heard about AlphaEvolve and it’s many breakthroughs including the 4*4 MM improvement. While this was a fantastic step forward in constrained optimisation problems…

May 26, 2025
Can you explain to me the product analytics job?

Can you explain to me the product analytics job? I ve watched videos about Data Scientist Product Analytics but i still dont understand if the job would excite me. Can someone explain it more in depth so that i can understand if i like it? I like the data science job (i am pursuing a…

May 26, 2025
Is studying Data Science still worth it?

Is studying Data Science still worth it? Hi everyone, I’m currently studying data science, but I’ve been hearing that the demand for data scientists is decreasing significantly. I’ve also been told that many data scientists are essentially becoming analysts, while the machine learning side of things is increasingly being handled by engineers. Does it still…

May 26, 2025
Prototyping Gradient Descent in Machine Learning

Prototyping Gradient Descent in Machine Learning Mathematical theorem and credit transaction prediction using Stochastic / Batch GD The post Prototyping Gradient Descent in Machine Learning appeared first on Towards Data Science. Kuriko Iwai Go to original source

May 24, 2025
Estimating Product-Level Price Elasticities Using Hierarchical Bayesian

Estimating Product-Level Price Elasticities Using Hierarchical Bayesian Using one model to personalize ML results The post Estimating Product-Level Price Elasticities Using Hierarchical Bayesian appeared first on Towards Data Science. Derek Tran Go to original source

May 24, 2025
Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype

Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype Improve static analysis and run-time validation with full generic specification The post Do More with NumPy Array Type Hints: Annotate & Validate Shape & Dtype appeared first on Towards Data Science. Christopher Ariza Go to original source

May 24, 2025
New to LLMs? Start Here

New to LLMs? Start Here A guide to Agents, LLMs, RAG, Fine-tuning, LangChain with practical examples to start building The post New to LLMs? Start Here appeared first on Towards Data Science. ALESSANDRA COSTA Go to original source

May 24, 2025
How to Evaluate LLMs and Algorithms — The Right Way

How to Evaluate LLMs and Algorithms — The Right Way Never miss a new edition of The Variable, our weekly newsletter featuring a top-notch selection of editors’ picks, deep dives, community news, and more. Subscribe today! All the hard work it takes to integrate large language models and powerful algorithms into your workflows can go to waste…

May 24, 2025
PO-Flow: Flow-based Generative Models for Sampling Potential Outcomes and Counterfactuals

PO-Flow: Flow-based Generative Models for Sampling Potential Outcomes and Counterfactuals arXiv:2505.16051v1 Announce Type: new Abstract: We propose PO-Flow, a novel continuous normalizing flow (CNF) framework for causal inference that jointly models potential outcomes and counterfactuals. Trained via flow matching, PO-Flow provides a unified framework for individualized potential outcome prediction, counterfactual predictions, and uncertainty-aware density learning.…

May 23, 2025
CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision

CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision arXiv:2505.15927v1 Announce Type: new Abstract: Learning complex functions that involve multi-step reasoning poses a significant challenge for standard supervised learning from input-output examples. Chain-of-thought (CoT) supervision, which provides intermediate reasoning steps together with the final output, has emerged as a powerful empirical technique, underpinning much of the…

May 23, 2025
Oh SnapMMD! Forecasting Stochastic Dynamics Beyond the Schr”odinger Bridge’s End

Oh SnapMMD! Forecasting Stochastic Dynamics Beyond the Schr”odinger Bridge’s End arXiv:2505.16082v1 Announce Type: new Abstract: Scientists often want to make predictions beyond the observed time horizon of “snapshot” data following latent stochastic dynamics. For example, in time course single-cell mRNA profiling, scientists have access to cellular transcriptional state measurements (snapshots) from different biological replicates at…

May 23, 2025
Dimension-adapted Momentum Outscales SGD

Dimension-adapted Momentum Outscales SGD arXiv:2505.16098v1 Announce Type: new Abstract: We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target…

May 23, 2025
Exponential Convergence of CAVI for Bayesian PCA

Exponential Convergence of CAVI for Bayesian PCA arXiv:2505.16145v1 Announce Type: new Abstract: Probabilistic principal component analysis (PCA) and its Bayesian variant (BPCA) are widely used for dimension reduction in machine learning and statistics. The main advantage of probabilistic PCA over the traditional formulation is allowing uncertainty quantification. The parameters of BPCA are typically learned using…

May 23, 2025
Google’s AlphaEvolve: Getting Started with Evolutionary Coding Agents

Google’s AlphaEvolve: Getting Started with Evolutionary Coding Agents Introduction AlphaEvolve [1] is a promising new coding agent by Google’s DeepMind. Let’s look at what it is and why it is generating hype. Much of the Google paper is on the claim that AlphaEvolve is facilitating novel research through its ability to improve code until it solves…

May 23, 2025
Multiple Linear Regression Analysis

Multiple Linear Regression Analysis Implementation of multiple linear regression on real data: Assumption checks, model evaluation, and interpretation of results using Python. The post Multiple Linear Regression Analysis appeared first on Towards Data Science. JUNIOR JUMBONG Go to original source

May 23, 2025
Inheritance: A Software Engineering Concept Data Scientists Must Know To Succeed

Inheritance: A Software Engineering Concept Data Scientists Must Know To Succeed Coding concepts that distinguish an amateur from a professional data scientist The post Inheritance: A Software Engineering Concept Data Scientists Must Know To Succeed appeared first on Towards Data Science. Benjamin Lee Go to original source

May 23, 2025
What Statistics Can Tell Us About NBA Coaches

What Statistics Can Tell Us About NBA Coaches Using Python to determine where NBA coaches come from and what makes them successful The post What Statistics Can Tell Us About NBA Coaches appeared first on Towards Data Science. Brayden Gerrard Go to original source

May 23, 2025
About Calculating Date Ranges in DAX

About Calculating Date Ranges in DAX When performing date calculations, creating date ranges can be helpful. But how can we do this, and which DAX function can help us in which case? Now you can learn more about this topic. The post About Calculating Date Ranges in DAX appeared first on Towards Data Science. Salvatore…

May 23, 2025
Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective arXiv:2505.14808v1 Announce Type: new Abstract: This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the…

May 22, 2025
LOBSTUR: A Local Bootstrap Framework for Tuning Unsupervised Representations in Graph Neural Networks

LOBSTUR: A Local Bootstrap Framework for Tuning Unsupervised Representations in Graph Neural Networks arXiv:2505.14867v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) are increasingly used in conjunction with unsupervised learning techniques to learn powerful node representations, but their deployment is hindered by their high sensitivity to hyperparameter tuning and the absence of established methodologies for…

May 22, 2025
Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya Bounds

Convergence of Adam in Deep ReLU Networks via Directional Complexity and Kakeya Bounds arXiv:2505.15013v1 Announce Type: new Abstract: First-order adaptive optimization methods like Adam are the default choices for training modern deep neural networks. Despite their empirical success, the theoretical understanding of these methods in non-smooth settings, particularly in Deep ReLU networks, remains limited. ReLU…

May 22, 2025
A Linear Approach to Data Poisoning

A Linear Approach to Data Poisoning arXiv:2505.15175v1 Announce Type: new Abstract: We investigate the theoretical foundations of data poisoning attacks in machine learning models. Our analysis reveals that the Hessian with respect to the input serves as a diagnostic tool for detecting poisoning, exhibiting spectral signatures that characterize compromised datasets. We use random matrix theory…

May 22, 2025
Infinite hierarchical contrastive clustering for personal digital envirotyping

Infinite hierarchical contrastive clustering for personal digital envirotyping arXiv:2505.15022v1 Announce Type: new Abstract: Daily environments have profound influence on our health and behavior. Recent work has shown that digital envirotyping, where computer vision is applied to images of daily environments taken during ecological momentary assessment (EMA), can be used to identify meaningful relationships between environmental…

May 22, 2025
Use PyTorch to Easily Access Your GPU

Use PyTorch to Easily Access Your GPU Let’s say you are lucky enough to have access to a system with an Nvidia Graphical Processing Unit (Gpu). Did you know there is an absurdly easy method to use your GPU’s capabilities using a Python library intended and predominantly used for machine learning (ML) applications? Don’t worry…

May 22, 2025
Top Machine Learning Jobs and How to Prepare For Them

Top Machine Learning Jobs and How to Prepare For Them These days, job titles like data scientist, machine learning engineer, and Ai Engineer are everywhere — and if you were anything like me, it can be hard to understand what each of them actually does if you are not working within the field. And then there are titles…

May 22, 2025
Building AI Applications in Ruby

Building AI Applications in Ruby This is the second in a multi-part series on creating web applications with generative AI integration. Part 1 focused on explaining the AI stack and why the application layer is the best place in the stack to be. Check it out here. Table of Contents Introduction I thought spas were supposed…

May 22, 2025
Continuous Domain Generalization

Continuous Domain Generalization arXiv:2505.13519v1 Announce Type: new Abstract: Real-world data distributions often shift continuously across multiple latent factors such as time, geography, and socioeconomic context. However, existing domain generalization approaches typically treat domains as discrete or evolving along a single axis (e.g., time), which fails to capture the complex, multi-dimensional nature of real-world variation. This…

May 21, 2025
Data Balancing Strategies: A Survey of Resampling and Augmentation Methods

Data Balancing Strategies: A Survey of Resampling and Augmentation Methods arXiv:2505.13518v1 Announce Type: new Abstract: Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling…

May 21, 2025
Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback

Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback arXiv:2505.13562v1 Announce Type: new Abstract: Learning in games is a fundamental problem in machine learning and artificial intelligence, with numerous applications~citep{silver2016mastering,schrittwieser2020mastering}. This work investigates two-player zero-sum matrix games with an unknown payoff matrix and bandit feedback, where each player observes their actions and the…

May 21, 2025
Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles

Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles arXiv:2505.13585v1 Announce Type: new Abstract: This work introduces a new method called scalable Bayesian Monte Carlo (SBMC). The model interpolates between a point estimator and the posterior, and the algorithm is a parallel implementation of a consistent (asymptotically unbiased) Bayesian deep learning algorithm: sequential Monte…

May 21, 2025