Category: aimldsaimlds

How to Build An AI Agent with Function Calling and GPT-5

How to Build An AI Agent with Function Calling and GPT-5 How an AI agent works: a step-by-step guide The post How to Build An AI Agent with Function Calling and GPT-5 appeared first on Towards Data Science. Ayoola Olafenwa Go to original source

October 21, 2025
How to Use Frontier Vision LLMs: Qwen3-VL

How to Use Frontier Vision LLMs: Qwen3-VL Learn how to apply VLMs to advanced document understanding tasks The post How to Use Frontier Vision LLMs: Qwen3-VL appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

October 21, 2025
How I Tailored the Resume That Landed Me $100K+ Data Science and ML Offers

How I Tailored the Resume That Landed Me $100K+ Data Science and ML Offers How to write a data science and machine learning resume that actually lands jobs. The post How I Tailored the Resume That Landed Me $100K+ Data Science and ML Offers appeared first on Towards Data Science. Egor Howell Go to original…

October 21, 2025
Things I Learned by Participating in GenAI Hackathons Over the Past 6 Months

Things I Learned by Participating in GenAI Hackathons Over the Past 6 Months Sharing my two cents from the building in public journey so far The post Things I Learned by Participating in GenAI Hackathons Over the Past 6 Months appeared first on Towards Data Science. Parul Pandey Go to original source

October 21, 2025
From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons

From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons arXiv:2510.15012v1 Announce Type: new Abstract: We revisit the Universal Approximation Theorem(UAT) through the lens of the tropical geometry of neural networks and introduce a constructive, geometry-aware initialization for sigmoidal multi-layer perceptrons (MLPs). Tropical geometry shows that Rectified Linear Unit (ReLU) networks admit decision functions with…

October 20, 2025
Reliable data clustering with Bayesian community detection

Reliable data clustering with Bayesian community detection arXiv:2510.15013v1 Announce Type: new Abstract: From neuroscience and genomics to systems biology and ecology, researchers rely on clustering similarity data to uncover modular structure. Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise. A common workaround…

October 20, 2025
The Tree-SNE Tree Exists

The Tree-SNE Tree Exists arXiv:2510.15014v1 Announce Type: new Abstract: The clustering and visualisation of high-dimensional data is a ubiquitous task in modern data science. Popular techniques include nonlinear dimensionality reduction methods like t-SNE or UMAP. These methods face the `scale-problem’ of clustering: when dealing with the MNIST dataset, do we want to distinguish different digits…

October 20, 2025
The Coverage Principle: How Pre-training Enables Post-Training

The Coverage Principle: How Pre-training Enables Post-Training arXiv:2510.15020v1 Announce Type: new Abstract: Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy…

October 20, 2025
The Minimax Lower Bound of Kernel Stein Discrepancy Estimation

The Minimax Lower Bound of Kernel Stein Discrepancy Estimation arXiv:2510.15058v1 Announce Type: new Abstract: Kernel Stein discrepancies (KSDs) have emerged as a powerful tool for quantifying goodness-of-fit over the last decade, featuring numerous successful applications. To the best of our knowledge, all existing KSD estimators with known rate achieve $sqrt n$-convergence. In this work, we…

October 20, 2025
How to perform synthetic control for multiple treated units? What are the things to keep in mind while performing it? Also, what python package i could use? Also have questions about metrics

How to perform synthetic control for multiple treated units? What are the things to keep in mind while performing it? Also, what python package i could use? Also have questions about metrics Hi I have never done Synthetic control, i want to work on a small project (like small data. My task is to find…

October 20, 2025
Weekly Entering & Transitioning – Thread 20 Oct, 2025 – 27 Oct, 2025

Weekly Entering & Transitioning – Thread 20 Oct, 2025 – 27 Oct, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

October 20, 2025
Anyone else tired of the non-stop LLM hype in personal and/or professional life?

Anyone else tired of the non-stop LLM hype in personal and/or professional life? I have a complex relationship with LLMs. At work, I’m told they’re the best thing since the invention of the internet, electricity, or [insert other trite comparison here], and that I’ll lose my job to people who do use them if I…

October 20, 2025
I built a project and I thought I might share it with the group

I built a project and I thought I might share it with the group Disclaimer: It’s UK focused. Hi everyone, When I was looking to buy a house, a big annoyance I had was that I couldn’t easily tell if I was getting value for money. Although, in my opinion, any property is expensive as…

October 20, 2025
Transformers, Time Series, and the Myth of Permutation Invariance

Transformers, Time Series, and the Myth of Permutation Invariance There’s a common misconception in ML/DL that Transformers shouldn’t be used for forecasting because attention is permutation-invariant. Latest evidence shows the opposite, such as Google’s latest model, where the experiments show the model performs just as well with or without positional embeddings. You can find an…

October 20, 2025
Conceptual Frameworks for Data Science Projects

Conceptual Frameworks for Data Science Projects An overview of common framework types and a simple process for building custom frameworks The post Conceptual Frameworks for Data Science Projects appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

October 20, 2025
How to Build Guardrails for Effective Agents

How to Build Guardrails for Effective Agents Learn how to set up effective guardrails to enforce desired behaviour from your agents The post How to Build Guardrails for Effective Agents appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

October 20, 2025
Can We Save the AI Economy?

Can We Save the AI Economy? And do we want to? The post Can We Save the AI Economy? appeared first on Towards Data Science. Stephanie Kirmer Go to original source

October 19, 2025
Python 3.14 and the End of the GIL

Python 3.14 and the End of the GIL Exploring the opportunities and challenges of a GIL-free Python The post Python 3.14 and the End of the GIL appeared first on Towards Data Science. Thomas Reid Go to original source

October 19, 2025
Machine Learning Meets Panel Data: What Practitioners Need to Know

Machine Learning Meets Panel Data: What Practitioners Need to Know How to avoid overestimating machine learning models’ performance, usefulness, and real-world applicability due to hidden data leakage The post Machine Learning Meets Panel Data: What Practitioners Need to Know appeared first on Towards Data Science. Marco Letta Go to original source

October 18, 2025
How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch

How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch A step-by-step introduction to understanding cancer from the perspective of a data scientist. The post How to Classify Lung Cancer Subtype from DNA Copy Numbers Using PyTorch appeared first on Towards Data Science. Adam Streck Go to original source

October 18, 2025
How I Used Machine Learning to Predict 41% of Project Delays Before They Happened

How I Used Machine Learning to Predict 41% of Project Delays Before They Happened How data science can help project managers anticipate risks and save time The post How I Used Machine Learning to Predict 41% of Project Delays Before They Happened appeared first on Towards Data Science. Yassin Zehar Go to original source

October 18, 2025
Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis

Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis mcRigor detects dubious metacells within each metacell partition and selects the optimal metacell partitioning method and hyperparameter for a given dataset The post Statistical Method mcRigor Enhances the Rigor of Metacell Partitioning in Single-Cell Data Analysis appeared first on Towards Data Science.…

October 18, 2025
Exact Dynamics of Multi-class Stochastic Gradient Descent

Exact Dynamics of Multi-class Stochastic Gradient Descent arXiv:2510.14074v1 Announce Type: new Abstract: We develop a framework for analyzing the training and learning rate dynamics on a variety of high- dimensional optimization problems trained using one-pass stochastic gradient descent (SGD) with data generated from multiple anisotropic classes. We give exact expressions for a large class of…

October 17, 2025
deFOREST: Fusing Optical and Radar satellite data for Enhanced Sensing of Tree-loss

deFOREST: Fusing Optical and Radar satellite data for Enhanced Sensing of Tree-loss arXiv:2510.14092v1 Announce Type: new Abstract: In this paper we develop a deforestation detection pipeline that incorporates optical and Synthetic Aperture Radar (SAR) data. A crucial component of the pipeline is the construction of anomaly maps of the optical data, which is done using…

October 17, 2025
High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data

High-Dimensional BWDM: A Robust Nonparametric Clustering Validation Index for Large-Scale Data arXiv:2510.14145v1 Announce Type: new Abstract: Determining the appropriate number of clusters in unsupervised learning is a central problem in statistics and data science. Traditional validity indices such as Calinski-Harabasz, Silhouette, and Davies-Bouldin-depend on centroid-based distances and therefore degrade in high-dimensional or contaminated data. This…

October 17, 2025
Personalized federated learning, Row-wise fusion regularization, Multivariate modeling, Sparse estimation

Personalized federated learning, Row-wise fusion regularization, Multivariate modeling, Sparse estimation arXiv:2510.14413v1 Announce Type: new Abstract: We study personalized federated learning for multivariate responses where client models are heterogeneous yet share variable-level structure. Existing entry-wise penalties ignore cross-response dependence, while matrix-wise fusion over-couples clients. We propose a Sparse Row-wise Fusion (SROF) regularizer that clusters row vectors…

October 17, 2025
A novel Information-Driven Strategy for Optimal Regression Assessment

A novel Information-Driven Strategy for Optimal Regression Assessment arXiv:2510.14222v1 Announce Type: new Abstract: In Machine Learning (ML), a regression algorithm aims to minimize a loss function based on data. An assessment method in this context seeks to quantify the discrepancy between the optimal response for an input-output system and the estimate produced by a learned…

October 17, 2025
Feature Detection, Part 1: Image Derivatives, Gradients, and Sobel Operator

Feature Detection, Part 1: Image Derivatives, Gradients, and Sobel Operator Applying calculus fundamentals to computer vision for edge detection The post Feature Detection, Part 1: Image Derivatives, Gradients, and Sobel Operator appeared first on Towards Data Science. Vyacheslav Efimov Go to original source

October 17, 2025
Stop Feeling Lost : How to Master ML System Design

Stop Feeling Lost : How to Master ML System Design What machine learning system design is and how to prepare for it The post Stop Feeling Lost : How to Master ML System Design appeared first on Towards Data Science. Egor Howell Go to original source

October 17, 2025
A Beginner’s Guide to Robotics with Python

A Beginner’s Guide to Robotics with Python Build 3D simulations with PyBullet The post A Beginner’s Guide to Robotics with Python appeared first on Towards Data Science. Mauro Di Pietro Go to original source

October 17, 2025
How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k

How to Evaluate Retrieval Quality in RAG Pipelines: Precision@k, Recall@k, and F1@k In my previous posts, I have walked you through putting together a very basic RAG pipeline in Python, as well as chunking large text documents. We’ve also looked into how documents are transformed into embeddings, allowing us to quickly search for similar documents…

October 17, 2025
Efficient Inference for Coupled Hidden Markov Models in Continuous Time and Discrete Space

Efficient Inference for Coupled Hidden Markov Models in Continuous Time and Discrete Space arXiv:2510.12916v1 Announce Type: new Abstract: Systems of interacting continuous-time Markov chains are a powerful model class, but inference is typically intractable in high dimensional settings. Auxiliary information, such as noisy observations, is typically only available at discrete times, and incorporating it via…

October 16, 2025
Simplicial Gaussian Models: Representation and Inference

Simplicial Gaussian Models: Representation and Inference arXiv:2510.12983v1 Announce Type: new Abstract: Probabilistic graphical models (PGMs) are powerful tools for representing statistical dependencies through graphs in high-dimensional systems. However, they are limited to pairwise interactions. In this work, we propose the simplicial Gaussian model (SGM), which extends Gaussian PGM to simplicial complexes. SGM jointly models random…

October 16, 2025
Conformal Inference for Open-Set and Imbalanced Classification

Conformal Inference for Open-Set and Imbalanced Classification arXiv:2510.13037v1 Announce Type: new Abstract: This paper presents a conformal prediction method for classification in highly imbalanced and open-set settings, where there are many possible classes and not all may be represented in the data. Existing approaches require a finite, known label space and typically involve random sample…

October 16, 2025
A Multi-dimensional Semantic Surprise Framework Based on Low-Entropy Semantic Manifolds for Fine-Grained Out-of-Distribution Detection

A Multi-dimensional Semantic Surprise Framework Based on Low-Entropy Semantic Manifolds for Fine-Grained Out-of-Distribution Detection arXiv:2510.13093v1 Announce Type: new Abstract: Out-of-Distribution (OOD) detection is a cornerstone for the safe deployment of AI systems in the open world. However, existing methods treat OOD detection as a binary classification problem, a cognitive flattening that fails to distinguish between…

October 16, 2025
Gaussian Certified Unlearning in High Dimensions: A Hypothesis Testing Approach

Gaussian Certified Unlearning in High Dimensions: A Hypothesis Testing Approach arXiv:2510.13094v1 Announce Type: new Abstract: Machine unlearning seeks to efficiently remove the influence of selected data while preserving generalization. Significant progress has been made in low dimensions $(p ll n)$, but high dimensions pose serious theoretical challenges as standard optimization assumptions of $Omega(1)$ strong convexity…

October 16, 2025
First Principles Thinking for Data Scientists

First Principles Thinking for Data Scientists The mindset that turns good data scientists into great ones The post First Principles Thinking for Data Scientists appeared first on Towards Data Science. Greg Rafferty Go to original source

October 16, 2025
Prompt Engineering for Time-Series Analysis with Large Language Models

Prompt Engineering for Time-Series Analysis with Large Language Models Part 1: Prompts for Core Strategies in Time-Series The post Prompt Engineering for Time-Series Analysis with Large Language Models appeared first on Towards Data Science. Sara Nobrega Go to original source

October 16, 2025
Beyond Requests: Why httpx is the Modern HTTP Client You Need (Sometimes)

Beyond Requests: Why httpx is the Modern HTTP Client You Need (Sometimes) A comprehensive comparison of these two Python libraries The post Beyond Requests: Why httpx is the Modern HTTP Client You Need (Sometimes) appeared first on Towards Data Science. Thomas Reid Go to original source

October 16, 2025
How to Build Tools for AI Agents

How to Build Tools for AI Agents Learn how to design and build effective tools to be used by AI Agents The post How to Build Tools for AI Agents appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

October 16, 2025
Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models arXiv:2510.11789v1 Announce Type: new Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-frac{2beta}{2beta+1}}$ with $M$ being the sample size, depending…

October 15, 2025
On Thompson Sampling and Bilateral Uncertainty in Additive Bayesian Optimization

On Thompson Sampling and Bilateral Uncertainty in Additive Bayesian Optimization arXiv:2510.11792v1 Announce Type: new Abstract: In Bayesian Optimization (BO), additive assumptions can mitigate the twin difficulties of modeling and searching a complex function in high dimension. However, common acquisition functions, like the Additive Lower Confidence Bound, ignore pairwise covariances between dimensions, which we’ll call textit{bilateral…

October 15, 2025
Active Subspaces in Infinite Dimension

Active Subspaces in Infinite Dimension arXiv:2510.11871v1 Announce Type: new Abstract: Active subspace analysis uses the leading eigenspace of the gradient’s second moment to conduct supervised dimension reduction. In this article, we extend this methodology to real-valued functionals on Hilbert space. We define an operator which coincides with the active subspace matrix when applied to a…

October 15, 2025
High-Probability Bounds For Heterogeneous Local Differential Privacy

High-Probability Bounds For Heterogeneous Local Differential Privacy arXiv:2510.11895v1 Announce Type: new Abstract: We study statistical estimation under local differential privacy (LDP) when users may hold heterogeneous privacy levels and accuracy must be guaranteed with high probability. Departing from the common in-expectation analyses, and for one-dimensional and multi-dimensional mean estimation problems, we develop finite sample upper…

October 15, 2025
Simplifying Optimal Transport through Schatten-$p$ Regularization

Simplifying Optimal Transport through Schatten-$p$ Regularization arXiv:2510.11910v1 Announce Type: new Abstract: We propose a new general framework for recovering low-rank structure in optimal transport using Schatten-$p$ norm regularization. Our approach extends existing methods that promote sparse and interpretable transport maps or plans, while providing a unified and principled family of convex programs that encourage low-dimensional…

October 15, 2025
Learning Triton One Kernel at a Time: Matrix Multiplication

Learning Triton One Kernel at a Time: Matrix Multiplication Tiled GEMM, GPU memory, coalescing, and much more! The post Learning Triton One Kernel at a Time: Matrix Multiplication appeared first on Towards Data Science. Ryan Pégoud Go to original source

October 15, 2025
Building A Successful Relationship With Stakeholders

Building A Successful Relationship With Stakeholders Show your value by moving beyond the technical The post Building A Successful Relationship With Stakeholders appeared first on Towards Data Science. Kristopher McGlinchey Go to original source

October 15, 2025
Why AI Still Can’t Replace Analysts: A Predictive Maintenance Example

Why AI Still Can’t Replace Analysts: A Predictive Maintenance Example Learn about the limitations of AI in analytics through the example of bearing vibration data analysis The post Why AI Still Can’t Replace Analysts: A Predictive Maintenance Example appeared first on Towards Data Science. Illia Smoliienko Go to original source

October 15, 2025
Human Won’t Replace Python

Human Won’t Replace Python Why vibe-coding is not a step up from “classic” coding — and why it matters The post Human Won’t Replace Python appeared first on Towards Data Science. Elisha Rosensweig Go to original source

October 15, 2025
Learning with Incomplete Context: Linear Contextual Bandits with Pretrained Imputation

Learning with Incomplete Context: Linear Contextual Bandits with Pretrained Imputation arXiv:2510.09908v1 Announce Type: new Abstract: The rise of large-scale pretrained models has made it feasible to generate predictive or synthetic features at low cost, raising the question of how to incorporate such surrogate predictions into downstream decision-making. We study this problem in the setting of…

October 14, 2025
Calibrating Generative Models

Calibrating Generative Models arXiv:2510.10020v1 Announce Type: new Abstract: Generative models frequently suffer miscalibration, wherein class probabilities and other statistics of the sampling distribution deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback-Leibler divergence satisfying calibration constraints. To address the intractability of imposing these constraints exactly,…

October 14, 2025
Kernel Treatment Effects with Adaptively Collected Data

Kernel Treatment Effects with Adaptively Collected Data arXiv:2510.10245v1 Announce Type: new Abstract: Adaptive experiments improve efficiency by adjusting treatment assignments based on past outcomes, but this adaptivity breaks the i.i.d. assumptions that underpins classical asymptotics. At the same time, many questions of interest are distributional, extending beyond average effects. Kernel treatment effects (KTE) provide a…

October 14, 2025
Neural variational inference for cutting feedback during uncertainty propagation

Neural variational inference for cutting feedback during uncertainty propagation arXiv:2510.10268v1 Announce Type: new Abstract: In many scientific applications, uncertainty of estimates from an earlier (upstream) analysis needs to be propagated in subsequent (downstream) Bayesian analysis, without feedback. Cutting feedback methods, also termed cut-Bayes, achieve this by constructing a cut-posterior distribution that prevents backward information flow.…

October 14, 2025
On some practical challenges of conformal prediction

On some practical challenges of conformal prediction arXiv:2510.10324v1 Announce Type: new Abstract: Conformal prediction is a model-free machine learning method for creating prediction regions with a guaranteed coverage probability level. However, a data scientist often faces three challenges in practice: (i) the determination of a conformal prediction region is only approximate, jeopardizing the finite-sample validity…

October 14, 2025
How to Spin Up a Project Structure with Cookiecutter

How to Spin Up a Project Structure with Cookiecutter If you’re anything like me, “procrastination” might as well be your middle name. There’s always that nagging hesitation before starting a new project. Just thinking about setting up the project structure, creating documentation, or writing a decent README is enough to trigger yawns. It feels like…

October 14, 2025
A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization arXiv:2510.08916v1 Announce Type: new Abstract: The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual…

October 13, 2025
Gradient-Guided Furthest Point Sampling for Robust Training Set Selection

Gradient-Guided Furthest Point Sampling for Robust Training Set Selection arXiv:2510.08906v1 Announce Type: new Abstract: Smart training set selections procedures enable the reduction of data needs and improves predictive robustness in machine learning problems relevant to chemistry. We introduce Gradient Guided Furthest Point Sampling (GGFPS), a simple extension of Furthest Point Sampling (FPS) that leverages molecular…

October 13, 2025
Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains

Mirror Flow Matching with Heavy-Tailed Priors for Generative Modeling on Convex Domains arXiv:2510.08929v1 Announce Type: new Abstract: We study generative modeling on convex domains using flow matching and mirror maps, and identify two fundamental challenges. First, standard log-barrier mirror maps induce heavy-tailed dual distributions, leading to ill-posed dynamics. Second, coupling with Gaussian priors performs poorly…

October 13, 2025
Distributionally robust approximation property of neural networks

Distributionally robust approximation property of neural networks arXiv:2510.09177v1 Announce Type: new Abstract: The universal approximation property uniformly with respect to weakly compact families of measures is established for several classes of neural networks. To that end, we prove that these neural networks are dense in Orlicz spaces, thereby extending classical universal approximation theorems even beyond…

October 13, 2025
A unified Bayesian framework for adversarial robustness

A unified Bayesian framework for adversarial robustness arXiv:2510.09288v1 Announce Type: new Abstract: The vulnerability of machine learning models to adversarial attacks remains a critical security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. However, these deterministic approaches do not account for uncertainty in the adversary’s attack. While stochastic…

October 13, 2025
Weekly Entering & Transitioning – Thread 13 Oct, 2025 – 20 Oct, 2025

Weekly Entering & Transitioning – Thread 13 Oct, 2025 – 20 Oct, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

October 13, 2025
From data scientist to a new role ?

From data scientist to a new role ? Hi everyone, I’m 25, currently working as a Data Scientist & AI Engineer at a large Space company in Europe, with ~2.5 years of experience. My focus has been on LLM R&D, RAG pipelines, satellite telemetry anomaly detection, surrogate modeling, and some FPGA-compatible ML for onboard systems.…

October 13, 2025
Clustring very different values

Clustring very different values I have 200 observations, 3 variables ( somewhat correlated).For v1, the median is 300 dollars. but I have a really long tail. when I do the histogram, 100 obs are near 0 and the others form a really long tail, even when I cap outliers. what is best way to cluster?…

October 13, 2025
What should I ask my potential managers when choosing between two jobs?

What should I ask my potential managers when choosing between two jobs? I’m deciding between two mid-level data science offers at large tech companies. These are more applied scientist type of roles than analytics. Comp and level are similar, so I’m really trying to figure out which one will set me up for a stronger…

October 13, 2025
Free data set that links company to type of activity?

Free data set that links company to type of activity? Best ressource to classify for example: walmart. food ( top classification) supermarket ( sub classification). I work with european companies also. thanks. submitted by /u/Due-Duty961 [link] [comments] /u/Due-Duty961 Go to original source

October 13, 2025
10 Data + AI Observations for Fall 2025

10 Data + AI Observations for Fall 2025 What’s happening—and what’s next— for data and AI at the close of 2025. The post 10 Data + AI Observations for Fall 2025 appeared first on Towards Data Science. Barr Moses Go to original source

October 11, 2025
Dreaming in Blocks — MineWorld, the Minecraft World Model

Dreaming in Blocks — MineWorld, the Minecraft World Model Explaining “MineWorld: A real-time and open-source interactive world model on Minecraft” in simple terms. The post Dreaming in Blocks — MineWorld, the Minecraft World Model appeared first on Towards Data Science. Youssef Farag Go to original source

October 11, 2025
Evaluating and Learning Optimal Dynamic Treatment Regimes under Truncation by Death

Evaluating and Learning Optimal Dynamic Treatment Regimes under Truncation by Death arXiv:2510.07501v1 Announce Type: new Abstract: Truncation by death, a prevalent challenge in critical care, renders traditional dynamic treatment regime (DTR) evaluation inapplicable due to ill-defined potential outcomes. We introduce a principal stratification-based method, focusing on the always-survivor value function. We derive a semiparametrically efficient,…

October 10, 2025
From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation

From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation arXiv:2510.07624v1 Announce Type: new Abstract: Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in…

October 10, 2025
When Robustness Meets Conservativeness: Conformalized Uncertainty Calibration for Balanced Decision Making

When Robustness Meets Conservativeness: Conformalized Uncertainty Calibration for Balanced Decision Making arXiv:2510.07750v1 Announce Type: new Abstract: Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches…

October 10, 2025
A Honest Cross-Validation Estimator for Prediction Performance

A Honest Cross-Validation Estimator for Prediction Performance arXiv:2510.07649v1 Announce Type: new Abstract: Cross-validation is a standard tool for obtaining a honest assessment of the performance of a prediction model. The commonly used version repeatedly splits data, trains the prediction model on the training set, evaluates the model performance on the test set, and averages the…

October 10, 2025
Surrogate Graph Partitioning for Spatial Prediction

Surrogate Graph Partitioning for Spatial Prediction arXiv:2510.07832v1 Announce Type: new Abstract: Spatial prediction refers to the estimation of unobserved values from spatially distributed observations. Although recent advances have improved the capacity to model diverse observation types, adoption in practice remains limited in industries that demand interpretability. To mitigate this gap, surrogate models that explain black-box…

October 10, 2025
Past is Prologue: How Conversational Analytics Is Changing Data Work

Past is Prologue: How Conversational Analytics Is Changing Data Work The future of reporting will be about encoding the value proposition of a product into prompt design. The post Past is Prologue: How Conversational Analytics Is Changing Data Work appeared first on Towards Data Science. Whitney Marks Go to original source

October 10, 2025
How the Rise of Tabular Foundation Models Is Reshaping Data Science

How the Rise of Tabular Foundation Models Is Reshaping Data Science A turning point for data analysis? The post How the Rise of Tabular Foundation Models Is Reshaping Data Science appeared first on Towards Data Science. Pirmin Lemberger Go to original source

October 10, 2025
Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy

Online Matching via Reinforcement Learning: An Expert Policy Orchestration Strategy arXiv:2510.06515v1 Announce Type: new Abstract: Online matching problems arise in many complex systems, from cloud services and online marketplaces to organ exchange networks, where timely, principled decisions are critical for maintaining high system performance. Traditional heuristics in these settings are simple and interpretable but typically…

October 9, 2025
A General Constructive Upper Bound on Shallow Neural Nets Complexity

A General Constructive Upper Bound on Shallow Neural Nets Complexity arXiv:2510.06372v1 Announce Type: new Abstract: We provide an upper bound on the number of neurons required in a shallow neural network to approximate a continuous function on a compact set with a given accuracy. This method, inspired by a specific proof of the Stone-Weierstrass theorem,…

October 9, 2025
Q-Learning with Fine-Grained Gap-Dependent Regret

Q-Learning with Fine-Grained Gap-Dependent Regret arXiv:2510.06647v1 Announce Type: new Abstract: We study fine-grained gap-dependent regret bounds for model-free reinforcement learning in episodic tabular Markov Decision Processes. Existing model-free algorithms achieve minimax worst-case regret, but their gap-dependent bounds remain coarse and fail to fully capture the structure of suboptimality gaps. We address this limitation by establishing…

October 9, 2025
Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix arXiv:2510.06685v1 Announce Type: new Abstract: Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of…

October 9, 2025
Bayesian Nonparametric Dynamical Clustering of Time Series

Bayesian Nonparametric Dynamical Clustering of Time Series arXiv:2510.06919v1 Announce Type: new Abstract: We present a method that models the evolution of an unbounded number of time series clusters by switching among an unknown number of regimes with linear dynamics. We develop a Bayesian non-parametric approach using a hierarchical Dirichlet process as a prior on the…

October 9, 2025
Know Your Real Birthday: Astronomical Computation and Geospatial-Temporal Analytics in Python

Know Your Real Birthday: Astronomical Computation and Geospatial-Temporal Analytics in Python A hands-on walkthrough using skyfield, timezonefinder, geopy, and pytz, and further practical applications The post Know Your Real Birthday: Astronomical Computation and Geospatial-Temporal Analytics in Python appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

October 9, 2025
Data Visualization Explained (Part 3): The Role of Color

Data Visualization Explained (Part 3): The Role of Color A simple and powerful guide to using color for more impactful data stories. The post Data Visualization Explained (Part 3): The Role of Color appeared first on Towards Data Science. Murtaza Ali Go to original source

October 9, 2025
Minima and Critical Points of the Bethe Free Energy Are Invariant Under Deformation Retractions of Factor Graphs

Minima and Critical Points of the Bethe Free Energy Are Invariant Under Deformation Retractions of Factor Graphs arXiv:2510.05380v1 Announce Type: new Abstract: In graphical models, factor graphs, and more generally energy-based models, the interactions between variables are encoded by a graph, a hypergraph, or, in the most general case, a partially ordered set (poset). Inference…

October 8, 2025
Refereed Learning

Refereed Learning arXiv:2510.05440v1 Announce Type: new Abstract: We initiate an investigation of learning tasks in a setting where the learner is given access to two competing provers, only one of which is honest. Specifically, we consider the power of such learners in assessing purported properties of opaque models. Following prior work that considers the power…

October 8, 2025
Domain-Shift-Aware Conformal Prediction for Large Language Models

Domain-Shift-Aware Conformal Prediction for Large Language Models arXiv:2510.05566v1 Announce Type: new Abstract: Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under…

October 8, 2025
A Probabilistic Basis for Low-Rank Matrix Learning

A Probabilistic Basis for Low-Rank Matrix Learning arXiv:2510.05447v1 Announce Type: new Abstract: Low rank inference on matrices is widely conducted by optimizing a cost function augmented with a penalty proportional to the nuclear norm $Vert cdot Vert_*$. However, despite the assortment of computational methods for such problems, there is a surprising lack of understanding of…

October 8, 2025
Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes

Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes arXiv:2510.05568v1 Announce Type: new Abstract: Methods for solving scientific computing and inference problems, such as kernel- and neural network-based approaches for partial differential equations (PDEs), inverse problems, and supervised learning tasks, depend crucially on the choice of hyperparameters. Specifically, the…

October 8, 2025
This Puzzle Shows Just How Far LLMs Have Progressed in a Little Over a Year

This Puzzle Shows Just How Far LLMs Have Progressed in a Little Over a Year What took GPT-4o 2 hours to solve, Sonnet 4.5 does in 5 seconds The post This Puzzle Shows Just How Far LLMs Have Progressed in a Little Over a Year appeared first on Towards Data Science. Thomas Reid Go to original source

October 8, 2025
Quantile-Scaled Bayesian Optimization Using Rank-Only Feedback

Quantile-Scaled Bayesian Optimization Using Rank-Only Feedback arXiv:2510.03277v1 Announce Type: new Abstract: Bayesian Optimization (BO) is widely used for optimizing expensive black-box functions, particularly in hyperparameter tuning. However, standard BO assumes access to precise objective values, which may be unavailable, noisy, or unreliable in real-world settings where only relative or rank-based feedback can be obtained. In…

October 7, 2025
Mathematically rigorous proofs for Shapley explanations

Mathematically rigorous proofs for Shapley explanations arXiv:2510.03281v1 Announce Type: new Abstract: Machine Learning is becoming increasingly more important in today’s world. It is therefore very important to provide understanding of the decision-making process of machine-learning models. A popular way to do this is by looking at the Shapley-Values of these models as introduced by Lundberg…

October 7, 2025
Transformed $ell_1$ Regularizations for Robust Principal Component Analysis: Toward a Fine-Grained Understanding

Transformed $ell_1$ Regularizations for Robust Principal Component Analysis: Toward a Fine-Grained Understanding arXiv:2510.03624v1 Announce Type: new Abstract: Robust Principal Component Analysis (RPCA) aims to recover a low-rank structure from noisy, partially observed data that is also corrupted by sparse, potentially large-magnitude outliers. Traditional RPCA models rely on convex relaxations, such as nuclear norm and $ell_1$…

October 7, 2025
The analogy theorem in Hoare logic

The analogy theorem in Hoare logic arXiv:2510.03685v1 Announce Type: new Abstract: The introduction of machine learning methods has led to significant advances in automation, optimization, and discoveries in various fields of science and technology. However, their widespread application faces a fundamental limitation: the transfer of models between data domains generally lacks a rigorous mathematical justification.…

October 7, 2025
Spectral Thresholds for Identifiability and Stability:Finite-Sample Phase Transitions in High-Dimensional Learning

Spectral Thresholds for Identifiability and Stability:Finite-Sample Phase Transitions in High-Dimensional Learning arXiv:2510.03809v1 Announce Type: new Abstract: In high-dimensional learning, models remain stable until they collapse abruptly once the sample size falls below a critical level. This instability is not algorithm-specific but a geometric mechanism: when the weakest Fisher eigendirection falls beneath sample-level fluctuations, identifiability fails.…

October 7, 2025
How to Perform Effective Agentic Context Engineering

How to Perform Effective Agentic Context Engineering Learn how to optimize the context of your agents, for powerful agentic performance The post How to Perform Effective Agentic Context Engineering appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

October 7, 2025
How I Used ChatGPT to Land My Next Data Science Role

How I Used ChatGPT to Land My Next Data Science Role Practical AI hacks for every stage of the job search — with real prompts and examples The post How I Used ChatGPT to Land My Next Data Science Role appeared first on Towards Data Science. Yu Dong Go to original source

October 7, 2025
How To Build Effective Technical Guardrails for AI Applications

How To Build Effective Technical Guardrails for AI Applications Exploring the most practical guardrails to implement at ground level The post How To Build Effective Technical Guardrails for AI Applications appeared first on Towards Data Science. Nidhin Karunakaran Ponon Go to original source

October 7, 2025
Plotly Dash — A Structured Framework for a Multi-Page Dashboard

Plotly Dash — A Structured Framework for a Multi-Page Dashboard An easy starting point for larger and more complicated Dash dashboards The post Plotly Dash — A Structured Framework for a Multi-Page Dashboard appeared first on Towards Data Science. Michael Clayton Go to original source

October 7, 2025
Higher-arity PAC learning, VC dimension and packing lemma

Higher-arity PAC learning, VC dimension and packing lemma arXiv:2510.02420v1 Announce Type: new Abstract: The aim of this note is to overview some of our work in Chernikov, Towsner’20 (arXiv:2010.00726) developing higher arity VC theory (VC$_n$ dimension), including a generalization of Haussler packing lemma, and an associated tame (slice-wise) hypergraph regularity lemma; and to demonstrate that…

October 6, 2025
Predictive inference for time series: why is split conformal effective despite temporal dependence?

Predictive inference for time series: why is split conformal effective despite temporal dependence? arXiv:2510.02471v1 Announce Type: new Abstract: We consider the problem of uncertainty quantification for prediction in a time series: if we use past data to forecast the next time point, can we provide valid prediction intervals around our forecasts? To avoid placing distributional…

October 6, 2025
Beyond Linear Diffusions: Improved Representations for Rare Conditional Generative Modeling

Beyond Linear Diffusions: Improved Representations for Rare Conditional Generative Modeling arXiv:2510.02499v1 Announce Type: new Abstract: Diffusion models have emerged as powerful generative frameworks with widespread applications across machine learning and artificial intelligence systems. While current research has predominantly focused on linear diffusions, these approaches can face significant challenges when modeling a conditional distribution, $P(Y|X=x)$, when…

October 6, 2025
Adaptive randomized pivoting and volume sampling

Adaptive randomized pivoting and volume sampling arXiv:2510.02513v1 Announce Type: new Abstract: Adaptive randomized pivoting (ARP) is a recently proposed and highly effective algorithm for column subset selection. This paper reinterprets the ARP algorithm by drawing connections to the volume sampling distribution and active learning algorithms for linear regression. As consequences, this paper presents new analysis…

October 6, 2025