Category: aimldsaimlds

Robust variational neural posterior estimation for simulation-based inference

Robust variational neural posterior estimation for simulation-based inference arXiv:2509.05724v1 Announce Type: new Abstract: Recent advances in neural density estimation have enabled powerful simulation-based inference (SBI) methods that can flexibly approximate Bayesian inference for intractable stochastic models. Although these methods have demonstrated reliable posterior estimation when the simulator accurately represents the underlying data generative process (GDP),…

September 9, 2025
Risk-averse Fair Multi-class Classification

Risk-averse Fair Multi-class Classification arXiv:2509.05771v1 Announce Type: new Abstract: We develop a new classification framework based on the theory of coherent risk measures and systemic risk. The proposed approach is suitable for multi-class problems when the data is noisy, scarce (relative to the dimension of the problem), and the labeling might be unreliable. In the…

September 9, 2025
Fisher Random Walk: Automatic Debiasing Contextual Preference Inference for Large Language Model Evaluation

Fisher Random Walk: Automatic Debiasing Contextual Preference Inference for Large Language Model Evaluation arXiv:2509.05852v1 Announce Type: new Abstract: Motivated by the need for rigorous and scalable evaluation of large language models, we study contextual preference inference for pairwise comparison functionals of context-dependent preference score functions across domains. Focusing on the contextual Bradley-Terry-Luce model, we develop…

September 9, 2025
Causal Clustering for Conditional Average Treatment Effects Estimation and Subgroup Discovery

Causal Clustering for Conditional Average Treatment Effects Estimation and Subgroup Discovery arXiv:2509.05775v1 Announce Type: new Abstract: Estimating heterogeneous treatment effects is critical in domains such as personalized medicine, resource allocation, and policy evaluation. A central challenge lies in identifying subpopulations that respond differently to interventions, thereby enabling more targeted and effective decision-making. While clustering methods…

September 9, 2025
Implementing the Gaussian Challenge in Python

Implementing the Gaussian Challenge in Python Beginner-friendly tutorial to understand range function and Python loops The post Implementing the Gaussian Challenge in Python appeared first on Towards Data Science. Mahnoor Javed Go to original source

September 9, 2025
Agentic AI and the Future of Python Project Management Tooling

Agentic AI and the Future of Python Project Management Tooling Introducing a pyramid framework of evolution, accelerating and decelerating factors, and strategic recommendations for incumbents and new entrants The post Agentic AI and the Future of Python Project Management Tooling appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

September 9, 2025
From Tokens to Theorems: Building a Neuro-Symbolic AI Mathematician

From Tokens to Theorems: Building a Neuro-Symbolic AI Mathematician The next Gauss may not be born — they may be spun up in the cloud The post From Tokens to Theorems: Building a Neuro-Symbolic AI Mathematician appeared first on Towards Data Science. Sean Moran Go to original source

September 9, 2025
The End-to-End Data Scientist’s Prompt Playbook

The End-to-End Data Scientist’s Prompt Playbook Part 3: Prompts for docs, DevOps, and stakeholder communication The post The End-to-End Data Scientist’s Prompt Playbook appeared first on Towards Data Science. Sara Nobrega Go to original source

September 9, 2025
Implementing the Coffee Machine in Python

Implementing the Coffee Machine in Python A beginner-friendly step-by-step guide to coding a Coffee Maker in Python The post Implementing the Coffee Machine in Python appeared first on Towards Data Science. Mahnoor Javed Go to original source

September 9, 2025
Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment

Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment arXiv:2509.04852v1 Announce Type: new Abstract: Estimating density ratios is a fundamental problem in machine learning, but existing methods often trade off accuracy for efficiency. We propose textit{Interval-annealed Secant Alignment Density Ratio Estimation (ISA-DRE)}, a framework that enables accurate, any-step estimation without numerical integration. Instead of modeling infinitesimal…

September 8, 2025
Optimal Variance and Covariance Estimation under Differential Privacy in the Add-Remove Model and Beyond

Optimal Variance and Covariance Estimation under Differential Privacy in the Add-Remove Model and Beyond arXiv:2509.04919v1 Announce Type: new Abstract: In this paper, we study the problem of estimating the variance and covariance of datasets under differential privacy in the add-remove model. While estimation in the swap model has been extensively studied in the literature, the…

September 8, 2025
Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations

Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations arXiv:2509.05186v1 Announce Type: new Abstract: In-context operator networks (ICON) are a class of operator learning methods based on the novel architectures of foundation models. Trained on a diverse set of datasets of initial and boundary conditions paired with corresponding solutions to…

September 8, 2025
Spectral Algorithms in Misspecified Regression: Convergence under Covariate Shift

Spectral Algorithms in Misspecified Regression: Convergence under Covariate Shift arXiv:2509.05106v1 Announce Type: new Abstract: This paper investigates the convergence properties of spectral algorithms — a class of regularization methods originating from inverse problems — under covariate shift. In this setting, the marginal distributions of inputs differ between source and target domains, while the conditional distribution…

September 8, 2025
Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction

Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction arXiv:2509.04631v1 Announce Type: cross Abstract: Transductive conformal prediction addresses the simultaneous prediction for multiple data points. Given a desired confidence level, the objective is to construct a prediction set that includes the true outcomes with the prescribed confidence. We demonstrate a fundamental trade-off between confidence and…

September 8, 2025
Weekly Entering & Transitioning – Thread 08 Sep, 2025 – 15 Sep, 2025

Weekly Entering & Transitioning – Thread 08 Sep, 2025 – 15 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

September 8, 2025
🚀 Perpetual ML Suite: Now Live on the Snowflake Marketplace!

🚀 Perpetual ML Suite: Now Live on the Snowflake Marketplace! submitted by /u/mutlu_simsek [link] [comments] /u/mutlu_simsek Go to original source

September 8, 2025
Europe Salary Thread 2025 – What’s your role and salary?

Europe Salary Thread 2025 – What’s your role and salary? The yearly Europe-centric salary thread. You can find the last one here: https://old.reddit.com/r/datascience/comments/1fxrmzl/europe_salary_thread_2024_whats_your_role_and/ I think it’s worthwhile to learn from one another and see what different flavours of data scientists, analysts and engineers are out there in the wild. In my opinion, this is especially…

September 8, 2025
Help me evaluate a new job offer – Stay or go?

Help me evaluate a new job offer – Stay or go? Hi all, I’m having a really hard time deciding whether or not to take an offer I’ve recently received, would really appreciate some advice and a sense check. For context I generally feel my current role is comfortable but i’m starting to plateau after…

September 8, 2025
How to evaluate data transformations?

How to evaluate data transformations? There are several well-established benchmarks for text-to-SQL tasks like BIRD, Spider, and WikiSQL. However, I’m working on a data transformation system that handles per-row transformations with contextual understanding of the input data. The challenge is that most existing benchmarks focus on either: Pure SQL generation (BIRD, Spider) Simple data cleaning…

September 8, 2025
The Beauty of Space-Filling Curves: Understanding the Hilbert Curve

The Beauty of Space-Filling Curves: Understanding the Hilbert Curve A quick journey from theory to implementation and application The post The Beauty of Space-Filling Curves: Understanding the Hilbert Curve appeared first on Towards Data Science. Paul Fröhling Go to original source

September 8, 2025
Preventing Context Overload: Controlled Neo4j MCP Cypher Responses for LLMs

Preventing Context Overload: Controlled Neo4j MCP Cypher Responses for LLMs How timeouts, truncation, and result sanitization keep Cypher outputs LLM-ready The post Preventing Context Overload: Controlled Neo4j MCP Cypher Responses for LLMs appeared first on Towards Data Science. Tomaz Bratanic Go to original source

September 8, 2025
Hands-On with Agents SDK: Safeguarding Input and Output with Guardrails

Hands-On with Agents SDK: Safeguarding Input and Output with Guardrails A practical exploration of how guardrails safeguard multi-agent systems in Python using OpenAI Agents SDK, Streamlit, and Pydantic The post Hands-On with Agents SDK: Safeguarding Input and Output with Guardrails appeared first on Towards Data Science. Iqbal Rahmadhan Go to original source

September 7, 2025
Extracting Structured Data with LangExtract: A Deep Dive into LLM-Orchestrated Workflows

Extracting Structured Data with LangExtract: A Deep Dive into LLM-Orchestrated Workflows A guide to building modular workflows for structured intelligence The post Extracting Structured Data with LangExtract: A Deep Dive into LLM-Orchestrated Workflows appeared first on Towards Data Science. Subha Ganapathi Go to original source

September 7, 2025
How to Context Engineer to Optimize Question Answering Pipelines

How to Context Engineer to Optimize Question Answering Pipelines Learn how to apply context engineering to enhance your question answering systems. The post How to Context Engineer to Optimize Question Answering Pipelines appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 6, 2025
Showcasing Your Work on HuggingFace Spaces

Showcasing Your Work on HuggingFace Spaces Building an app is exciting – but sharing it is where the real value kicks in. Back when Heroku offered a free tier, deploying demos was effortless. Those days are gone, and finding a simple, free way to showcase machine learning apps has become harder. That’s where Hugging Face…

September 6, 2025
AI Operations Under the Hood: Challenges and Best Practices

AI Operations Under the Hood: Challenges and Best Practices Building robust, reproducible, and reliable GenAI applications requires a framework of continuous improvement, rigorous evaluation, and systematic validation The post AI Operations Under the Hood: Challenges and Best Practices appeared first on Towards Data Science. Erika G. Gonçalves Go to original source

September 6, 2025
Zero-Inflated Data: A Comparison of Regression Models

Zero-Inflated Data: A Comparison of Regression Models How to detect it and which model to choose. The post Zero-Inflated Data: A Comparison of Regression Models appeared first on Towards Data Science. Arnaud Capitaine Go to original source

September 6, 2025
Tool Masking: The Layer MCP Forgot

Tool Masking: The Layer MCP Forgot Tool masking for AI improves AI agents: shape MCP tool surfaces to cut tokens and errors, boost speed and reliability. Start prompt engineering your tools The post Tool Masking: The Layer MCP Forgot appeared first on Towards Data Science. Frank Wittkampf Go to original source

September 6, 2025
Energy-Weighted Flow Matching: Unlocking Continuous Normalizing Flows for Efficient and Scalable Boltzmann Sampling

Energy-Weighted Flow Matching: Unlocking Continuous Normalizing Flows for Efficient and Scalable Boltzmann Sampling arXiv:2509.03726v1 Announce Type: new Abstract: Sampling from unnormalized target distributions, e.g. Boltzmann distributions $mu_{text{target}}(x) propto exp(-E(x)/T)$, is fundamental to many scientific applications yet computationally challenging due to complex, high-dimensional energy landscapes. Existing approaches applying modern generative models to Boltzmann distributions either require…

September 5, 2025
Testing for correlation between network structure and high-dimensional node covariates

Testing for correlation between network structure and high-dimensional node covariates arXiv:2509.03772v1 Announce Type: new Abstract: In many application domains, networks are observed with node-level features. In such settings, a common problem is to assess whether or not nodal covariates are correlated with the network structure itself. Here, we present four novel methods for addressing this…

September 5, 2025
Diffusion Generative Models Meet Compressed Sensing, with Applications to Image Data and Financial Time Series

Diffusion Generative Models Meet Compressed Sensing, with Applications to Image Data and Financial Time Series arXiv:2509.03898v1 Announce Type: new Abstract: This paper develops dimension reduction techniques for accelerating diffusion model inference in the context of synthetic data generation. The idea is to integrate compressed sensing into diffusion models: (i) compress the data into a latent…

September 5, 2025
Batched Stochastic Matching Bandits

Batched Stochastic Matching Bandits arXiv:2509.04194v1 Announce Type: new Abstract: In this study, we introduce a novel bandit framework for stochastic matching based on the Multi-nomial Logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its…

September 5, 2025
An invertible generative model for forward and inverse problems

An invertible generative model for forward and inverse problems arXiv:2509.03910v1 Announce Type: new Abstract: We formulate the inverse problem in a Bayesian framework and aim to train a generative model that allows us to simulate (i.e., sample from the likelihood) and do inference (i.e., sample from the posterior). We review the use of triangular normalizing…

September 5, 2025
Should We Use LLMs As If They Were Swiss Knives?

Should We Use LLMs As If They Were Swiss Knives? A logic game performance comparison between popular LLMs and a custom-made algorithm The post Should We Use LLMs As If They Were Swiss Knives? appeared first on Towards Data Science. Nicolas Garcia Aramouni Go to original source

September 5, 2025
A Visual Guide to Tuning Random Forest Hyperparameters

A Visual Guide to Tuning Random Forest Hyperparameters How hyperparameter tuning visually changes random forests The post A Visual Guide to Tuning Random Forest Hyperparameters appeared first on Towards Data Science. James Gibbins Go to original source

September 5, 2025
MobileNetV1 Paper Walkthrough: The Tiny Giant

MobileNetV1 Paper Walkthrough: The Tiny Giant Understanding and implementing MobileNetV1 from scratch with PyTorch The post MobileNetV1 Paper Walkthrough: The Tiny Giant appeared first on Towards Data Science. Muhammad Ardi Go to original source

September 5, 2025
Using LangGraph and MCP Servers to Create My Own Voice Assistant

Using LangGraph and MCP Servers to Create My Own Voice Assistant Built over 14 days, all locally run, no API keys, cloud services, or subscription fees. The post Using LangGraph and MCP Servers to Create My Own Voice Assistant appeared first on Towards Data Science. Benjamin Lee Go to original source

September 5, 2025
Boosting Your Anomaly Detection With LLMs

Boosting Your Anomaly Detection With LLMs The 7 emerging application patterns you should know The post Boosting Your Anomaly Detection With LLMs appeared first on Towards Data Science. Shuai Guo Go to original source

September 5, 2025
Fast kernel methods: Sobolev, physics-informed, and additive models

Fast kernel methods: Sobolev, physics-informed, and additive models arXiv:2509.02649v1 Announce Type: new Abstract: Kernel methods are powerful tools in statistical learning, but their cubic complexity in the sample size n limits their use on large-scale datasets. In this work, we introduce a scalable framework for kernel regression with O(n log n) complexity, fully leveraging GPU…

September 4, 2025
Gaussian process surrogate with physical law-corrected prior for multi-coupled PDEs defined on irregular geometry

Gaussian process surrogate with physical law-corrected prior for multi-coupled PDEs defined on irregular geometry arXiv:2509.02617v1 Announce Type: new Abstract: Parametric partial differential equations (PDEs) are fundamental mathematical tools for modeling complex physical systems, yet their numerical evaluation across parameter spaces remains computationally intensive when using conventional high-fidelity solvers. To address this challenge, we propose a…

September 4, 2025
Scale-Adaptive Generative Flows for Multiscale Scientific Data

Scale-Adaptive Generative Flows for Multiscale Scientific Data arXiv:2509.02971v1 Announce Type: new Abstract: Flow-based generative models can face significant challenges when modeling scientific data with multiscale Fourier spectra, often producing large errors in fine-scale features. We address this problem within the framework of stochastic interpolants, via principled design of noise distributions and interpolation schedules. The key…

September 4, 2025
Bayesian Additive Regression Trees for functional ANOVA model

Bayesian Additive Regression Trees for functional ANOVA model arXiv:2509.03317v1 Announce Type: new Abstract: Bayesian Additive Regression Trees (BART) is a powerful statistical model that leverages the strengths of Bayesian inference and regression trees. It has received significant attention for capturing complex non-linear relationships and interactions among predictors. However, the accuracy of BART often comes at…

September 4, 2025
Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization

Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization arXiv:2509.03378v1 Announce Type: new Abstract: As an adaptive method, Shampoo employs a structured second-moment estimation, and its effectiveness has attracted growing attention. Prior work has primarily analyzed its estimation scheme through the Frobenius norm. Motivated by the natural connection between the second moment and a covariance…

September 4, 2025
Useful Python Libraries You Might Not Have Heard Of: Freezegun

Useful Python Libraries You Might Not Have Heard Of: Freezegun Bring time to a standstill in your Python tests The post Useful Python Libraries You Might Not Have Heard Of: Freezegun appeared first on Towards Data Science. Thomas Reid Go to original source

September 4, 2025
AI FOMO, Shadow AI, and Other Business Problems

AI FOMO, Shadow AI, and Other Business Problems What’s the state of AI in business these days, and how much does it cost us? The post AI FOMO, Shadow AI, and Other Business Problems appeared first on Towards Data Science. Stephanie Kirmer Go to original source

September 4, 2025
Hands On Time Series Modeling of Rare Events, with Python

Hands On Time Series Modeling of Rare Events, with Python This is how to model rare events occurrences in a time series in a few lines of code The post Hands On Time Series Modeling of Rare Events, with Python appeared first on Towards Data Science. Piero Paialunga Go to original source

September 4, 2025
Stochastic Differential Equations and Temperature — NASA Climate Data pt. 2

Stochastic Differential Equations and Temperature — NASA Climate Data pt. 2 The Ornstein-Uhlenbeck process in Python The post Stochastic Differential Equations and Temperature — NASA Climate Data pt. 2 appeared first on Towards Data Science. Marco Hening Tallarico Go to original source

September 4, 2025
What Being a Data Scientist at a Startup Really Looks Like

What Being a Data Scientist at a Startup Really Looks Like What I learned about growth, visibility, and chaos over the past five years The post What Being a Data Scientist at a Startup Really Looks Like appeared first on Towards Data Science. Yu Dong Go to original source

September 4, 2025
Simulation-based inference of yeast centromeres

Simulation-based inference of yeast centromeres arXiv:2509.00200v1 Announce Type: new Abstract: The chromatin folding and the spatial arrangement of chromosomes in the cell play a crucial role in DNA replication and genes expression. An improper chromatin folding could lead to malfunctions and, over time, diseases. For eukaryotes, centromeres are essential for proper chromosome segregation and folding.…

September 3, 2025
Assessing One-Dimensional Cluster Stability by Extreme-Point Trimming

Assessing One-Dimensional Cluster Stability by Extreme-Point Trimming arXiv:2509.00258v1 Announce Type: new Abstract: We develop a probabilistic method for assessing the tail behavior and geometric stability of one-dimensional n i.i.d. samples by tracking how their span contracts when the most extreme points are trimmed. Central to our approach is the diameter-shrinkage ratio, that quantifies the relative…

September 3, 2025
Probit Monotone BART

Probit Monotone BART arXiv:2509.00263v1 Announce Type: new Abstract: Bayesian Additive Regression Trees (BART) of Chipman et al. (2010) has proven to be a powerful tool for nonparametric modeling and prediction. Monotone BART (Chipman et al., 2022) is a recent development that allows BART to be more precise in estimating monotonic functions. We further these developments…

September 3, 2025
The Nondecreasing Rank

The Nondecreasing Rank arXiv:2509.00265v1 Announce Type: new Abstract: In this article the notion of the nondecreasing (ND) rank of a matrix or tensor is introduced. A tensor has an ND rank of r if it can be represented as a sum of r outer products of vectors, with each vector satisfying a monotonicity constraint. It…

September 3, 2025
Partial Functional Dynamic Backdoor Diffusion-based Causal Model

Partial Functional Dynamic Backdoor Diffusion-based Causal Model arXiv:2509.00472v1 Announce Type: new Abstract: We introduce a Partial Functional Dynamic Backdoor Diffusion-based Causal Model (PFD-BDCM), specifically designed for causal inference in the presence of unmeasured confounders with spatial heterogeneity and temporal dependency. The proposed PFD-BDCM framework addresses the restrictions of the existing approaches by uniquely integrating models…

September 3, 2025
Implementing the Caesar Cipher in Python

Implementing the Caesar Cipher in Python Julius Caesar was a Roman ruler known for his military strategies and excellent leadership. Named after him, the Caesar Cipher is a fascinating cryptographic technique that Julius Caesar employed to send secret signals and messages to his military personnel. The Caesar Cipher is quite basic in its working. It…

September 3, 2025
A Deep Dive into RabbitMQ & Python’s Celery: How to Optimise Your Queues

A Deep Dive into RabbitMQ & Python’s Celery: How to Optimise Your Queues Key lessons I’ve learned running RabbitMQ + Celery in production The post A Deep Dive into RabbitMQ & Python’s Celery: How to Optimise Your Queues appeared first on Towards Data Science. Clara Chong Go to original source

September 3, 2025
How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques

How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques Optimize your AI search with RAG, contextual retrieval and evaluations The post How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

September 3, 2025
What is Universality in LLMs? How to Find Universal Neurons

What is Universality in LLMs? How to Find Universal Neurons How independently trained transformers form same the neurons The post What is Universality in LLMs? How to Find Universal Neurons appeared first on Towards Data Science. Shuyang Go to original source

September 3, 2025
3 Greedy Algorithms for Decision Trees, Explained with Examples

3 Greedy Algorithms for Decision Trees, Explained with Examples Learn the inner workings of decision trees The post 3 Greedy Algorithms for Decision Trees, Explained with Examples appeared first on Towards Data Science. Kuriko Iwai Go to original source

September 3, 2025
The Generalist: The New All-Around Type of Data Professional?

The Generalist: The New All-Around Type of Data Professional? Is over-specialization ending and are data generalists on the rise? The post The Generalist: The New All-Around Type of Data Professional? appeared first on Towards Data Science. Loizos Loizou Go to original source

September 2, 2025
Quantum-inspired probability metrics define a complete, universal space for statistical learning

Quantum-inspired probability metrics define a complete, universal space for statistical learning arXiv:2508.21086v1 Announce Type: new Abstract: Comparing probability distributions is a core challenge across the natural, social, and computational sciences. Existing methods, such as Maximum Mean Discrepancy (MMD), struggle in high-dimensional and non-compact domains. Here we introduce quantum probability metrics (QPMs), derived by embedding probability…

September 1, 2025
Weighted Support Points from Random Measures: An Interpretable Alternative for Generative Modeling

Weighted Support Points from Random Measures: An Interpretable Alternative for Generative Modeling arXiv:2508.21255v1 Announce Type: new Abstract: Support points summarize a large dataset through a smaller set of representative points that can be used for data operations, such as Monte Carlo integration, without requiring access to the full dataset. In this sense, support points offer…

September 1, 2025
Adaptive generative moment matching networks for improved learning of dependence structures

Adaptive generative moment matching networks for improved learning of dependence structures arXiv:2508.21531v1 Announce Type: new Abstract: An adaptive bandwidth selection procedure for the mixture kernel in the maximum mean discrepancy (MMD) for fitting generative moment matching networks (GMMNs) is introduced, and its ability to improve the learning of copula random number generators is demonstrated. Based…

September 1, 2025
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

Privacy Auditing Synthetic Data Release through Local Likelihood Attacks arXiv:2508.21146v1 Announce Type: cross Abstract: Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and…

September 1, 2025
BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design arXiv:2508.21184v1 Announce Type: cross Abstract: We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to…

September 1, 2025
Weekly Entering & Transitioning – Thread 01 Sep, 2025 – 08 Sep, 2025

Weekly Entering & Transitioning – Thread 01 Sep, 2025 – 08 Sep, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

September 1, 2025
How do I prepare for my data science job as a new grad?

How do I prepare for my data science job as a new grad? I just graduated from my bachelors in May. Recently, I’ve been fortunate enough to receive an offer as a data scientist I at a unicorn where most of the people on the ds team have PhDs. My job starts in a month…

September 1, 2025
Let’s Build Something Together

Let’s Build Something Together Hey everyone, After my last post about my struggles in finding a remote job, I was honestly blown away. I got over 50 messages not with job offers, but with stories, frustrations, and suggestions. The common theme? Many of us are stuck. Some are trying to break into the market, others…

September 1, 2025
Advice for DS/AS/MLE interviews

Advice for DS/AS/MLE interviews I am looking for data scientist (ML heavy), applied scientist or ML engineer roles in product based companies. For my interview preperation, I am unsure about which book or resources to pick so that I can cover the rigor of ML rounds in these interviews. I have background in CS and…

September 1, 2025
Career Dilemma

Career Dilemma submitted by /u/NervousVictory1792 [link] [comments] /u/NervousVictory1792 Go to original source

September 1, 2025
How to Develop a Bilingual Voice Assistant

How to Develop a Bilingual Voice Assistant Exploring ways to make voice assistants more personal The post How to Develop a Bilingual Voice Assistant appeared first on Towards Data Science. Deepak Krishnamurthy Go to original source

September 1, 2025
The Machine Learning Lessons I’ve Learned This Month

The Machine Learning Lessons I’ve Learned This Month August 2025: logging, lab notebooks, overnight runs The post The Machine Learning Lessons I’ve Learned This Month appeared first on Towards Data Science. Pascal Janetzky Go to original source

September 1, 2025
Understanding Matrices | Part 4: Matrix Inverse

Understanding Matrices | Part 4: Matrix Inverse The physical meaning of matrix inversion, related formulas, and how inversion behaves on several special types of matrices. The post Understanding Matrices | Part 4: Matrix Inverse appeared first on Towards Data Science. Tigran Hayrapetyan Go to original source

August 31, 2025
Crafting a Custom Voice Assistant with Perplexity

Crafting a Custom Voice Assistant with Perplexity How to build a fully functional, hands-free voice assistant on a Raspberry Pi The post Crafting a Custom Voice Assistant with Perplexity appeared first on Towards Data Science. Deepak Krishnamurthy Go to original source

August 31, 2025
Marginal Effect of Hyperparameter Tuning with XGBoost

Marginal Effect of Hyperparameter Tuning with XGBoost Demystifying Bayesian hyperparameter optimization and comparing hyperparameter tuning paradigms The post Marginal Effect of Hyperparameter Tuning with XGBoost appeared first on Towards Data Science. Noah Swan Go to original source

August 30, 2025
Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks

Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks This research answered the question: How can machine learning and artificial intelligence help us to unlearn bias? The post Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks appeared first on Towards Data Science. Celia Banks…

August 30, 2025
Unlocking Multimodal Video Transcription with Gemini

Unlocking Multimodal Video Transcription with Gemini Explore how to transcribe videos with speaker identification in a single prompt The post Unlocking Multimodal Video Transcription with Gemini appeared first on Towards Data Science. Laurent Picard Go to original source

August 30, 2025
How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker

How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker From VOC to JSON: Importing pre-annotations made simple The post How to Import Pre-Annotated Data into Label Studio and Run the Full Stack with Docker appeared first on Towards Data Science. Yagmur Gulec Go to original source

August 30, 2025
Stochastic Gradients under Nuisances

Stochastic Gradients under Nuisances arXiv:2508.20326v1 Announce Type: new Abstract: Stochastic gradient optimization is the dominant learning paradigm for a variety of scenarios, from classical supervised learning to modern self-supervised learning. We consider stochastic gradient algorithms for learning problems whose objectives rely on unknown nuisance parameters, and establish non-asymptotic convergence guarantees. Our results show that, while…

August 29, 2025
Towards Trustworthy Amortized Bayesian Model Comparison

Towards Trustworthy Amortized Bayesian Model Comparison arXiv:2508.20614v1 Announce Type: new Abstract: Amortized Bayesian model comparison (BMC) enables fast probabilistic ranking of models via simulation-based training of neural surrogates. However, the reliability of neural surrogates deteriorates when simulation models are misspecified – the very case where model comparison is most needed. Thus, we supplement simulation-based training…

August 29, 2025
Polynomial Chaos Expansion for Operator Learning

Polynomial Chaos Expansion for Operator Learning arXiv:2508.20886v1 Announce Type: new Abstract: Operator learning (OL) has emerged as a powerful tool in scientific machine learning (SciML) for approximating mappings between infinite-dimensional functional spaces. One of its main applications is learning the solution operator of partial differential equations (PDEs). While much of the progress in this area…

August 29, 2025
Transfer Learning for Classification under Decision Rule Drift with Application to Optimal Individualized Treatment Rule Estimation

Transfer Learning for Classification under Decision Rule Drift with Application to Optimal Individualized Treatment Rule Estimation arXiv:2508.20942v1 Announce Type: new Abstract: In this paper, we extend the transfer learning classification framework from regression function-based methods to decision rules. We propose a novel methodology for modeling posterior drift through Bayes decision rules. By exploiting the geometric…

August 29, 2025
Discovering equations from data: symbolic regression in dynamical systems

Discovering equations from data: symbolic regression in dynamical systems arXiv:2508.20257v1 Announce Type: cross Abstract: The process of discovering equations from data lies at the heart of physics and in many other areas of research, including mathematical ecology and epidemiology. Recently, machine learning methods known as symbolic regression have automated this process. As several methods are…

August 29, 2025
Implementing the Hangman Game in Python

Implementing the Hangman Game in Python A beginner-friendly project to understand variables, loops, and conditions in Python The post Implementing the Hangman Game in Python appeared first on Towards Data Science. Mahnoor Javed Go to original source

August 29, 2025
Stepwise Selection Made Simple: Improve Your Regression Models in Python

Stepwise Selection Made Simple: Improve Your Regression Models in Python Dimensionality reduction in linear regression: classical stepwise methods and a Python application on real-world data The post Stepwise Selection Made Simple: Improve Your Regression Models in Python appeared first on Towards Data Science. JUNIOR JUMBONG Go to original source

August 29, 2025
Graph Coloring for Data Science: A Comprehensive Guide

Graph Coloring for Data Science: A Comprehensive Guide From theoretical puzzles to practical applications The post Graph Coloring for Data Science: A Comprehensive Guide appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

August 29, 2025
A Visual Guide to Tuning Decision-Tree Hyperparameters

A Visual Guide to Tuning Decision-Tree Hyperparameters How hyperparameter tuning visually changes decision trees The post A Visual Guide to Tuning Decision-Tree Hyperparameters appeared first on Towards Data Science. James Gibbins Go to original source

August 29, 2025
Air for Tomorrow: Why Openness in Air Quality Research and Implementation Matters for Global Equity

Air for Tomorrow: Why Openness in Air Quality Research and Implementation Matters for Global Equity Understand how open source can help you unravel air quality The post Air for Tomorrow: Why Openness in Air Quality Research and Implementation Matters for Global Equity appeared first on Towards Data Science. Prithviraj Pramanik Go to original source

August 29, 2025
Fractal Flow: Hierarchical and Interpretable Normalizing Flow via Topic Modeling and Recursive Strategy

Fractal Flow: Hierarchical and Interpretable Normalizing Flow via Topic Modeling and Recursive Strategy arXiv:2508.19750v1 Announce Type: new Abstract: Normalizing Flows provide a principled framework for high-dimensional density estimation and generative modeling by constructing invertible transformations with tractable Jacobian determinants. We propose Fractal Flow, a novel normalizing flow architecture that enhances both expressiveness and interpretability through…

August 28, 2025
Conditional Normalizing Flow Surrogate for Monte Carlo Prediction of Radiative Properties in Nanoparticle-Embedded Layers

Conditional Normalizing Flow Surrogate for Monte Carlo Prediction of Radiative Properties in Nanoparticle-Embedded Layers arXiv:2508.19841v1 Announce Type: new Abstract: We present a probabilistic, data-driven surrogate model for predicting the radiative properties of nanoparticle embedded scattering media. The model uses conditional normalizing flows, which learn the conditional distribution of optical outputs, including reflectance, absorbance, and transmittance,…

August 28, 2025
The Information Dynamics of Generative Diffusion

The Information Dynamics of Generative Diffusion arXiv:2508.19897v1 Announce Type: new Abstract: Generative diffusion models have emerged as a powerful class of models in machine learning, yet a unified theoretical understanding of their operation is still developing. This perspective paper provides an integrated perspective on generative diffusion by connecting their dynamic, information-theoretic, and thermodynamic properties under…

August 28, 2025
Track Component Failure Detection Using Data Analytics over existing STDS Track Circuit data

Track Component Failure Detection Using Data Analytics over existing STDS Track Circuit data arXiv:2508.11693v1 Announce Type: cross Abstract: Track Circuits (TC) are the main signalling devices used to detect the presence of a train on a rail track. It has been used since the 19th century and nowadays there are many types depending on the…

August 28, 2025
Physics-Informed Regression: Parameter Estimation in Parameter-Linear Nonlinear Dynamic Models

Physics-Informed Regression: Parameter Estimation in Parameter-Linear Nonlinear Dynamic Models arXiv:2508.19249v1 Announce Type: cross Abstract: We present a new efficient hybrid parameter estimation method based on the idea, that if nonlinear dynamic models are stated in terms of a system of equations that is linear in terms of the parameters, then regularized ordinary least squares can…

August 28, 2025
Get AI-Ready: How to Prepare for a World of Agentic AI as Tech Professionals

Get AI-Ready: How to Prepare for a World of Agentic AI as Tech Professionals Explore how Agentic AI is reshaping the tech careers, from data to decision-making, and how professionals can prepare for the future of work The post Get AI-Ready: How to Prepare for a World of Agentic AI as Tech Professionals appeared first…

August 28, 2025
Everything I Studied to Become a Machine Learning Engineer (No CS Background)

Everything I Studied to Become a Machine Learning Engineer (No CS Background) The books, courses, and resources I used in my journey. The post Everything I Studied to Become a Machine Learning Engineer (No CS Background) appeared first on Towards Data Science. Egor Howell Go to original source

August 28, 2025
Time Series Forecasting Made Simple (Part 4.1): Understanding Stationarity in a Time Series

Time Series Forecasting Made Simple (Part 4.1): Understanding Stationarity in a Time Series An intuitive guide to stationarity in a time series The post Time Series Forecasting Made Simple (Part 4.1): Understanding Stationarity in a Time Series appeared first on Towards Data Science. Nikhil Dasari Go to original source

August 28, 2025
A Brief History of GPT Through Papers

A Brief History of GPT Through Papers Language models are becoming really good. But where did they come from? The post A Brief History of GPT Through Papers appeared first on Towards Data Science. Rohit Pandey Go to original source

August 28, 2025
The Math You Need to Pan and Tilt 360° Images

The Math You Need to Pan and Tilt 360° Images Panning a spherical image is just a horizontal roll, but tilting it vertically is much trickier. Let’s see the math! The post The Math You Need to Pan and Tilt 360° Images appeared first on Towards Data Science. Thomas Rouch Go to original source

August 28, 2025
Deterministic Coreset Construction via Adaptive Sensitivity Trimming

Deterministic Coreset Construction via Adaptive Sensitivity Trimming arXiv:2508.18340v1 Announce Type: new Abstract: We develop a rigorous framework for deterministic coreset construction in empirical risk minimization (ERM). Our central contribution is the Adaptive Deterministic Uniform-Weight Trimming (ADUWT) algorithm, which constructs a coreset by excising points with the lowest sensitivity bounds and applying a data-dependent uniform weight…

August 27, 2025
Revisiting Follow-the-Perturbed-Leader with Unbounded Perturbations in Bandit Problems

Revisiting Follow-the-Perturbed-Leader with Unbounded Perturbations in Bandit Problems arXiv:2508.18604v1 Announce Type: new Abstract: Follow-the-Regularized-Leader (FTRL) policies have achieved Best-of-Both-Worlds (BOBW) results in various settings through hybrid regularizers, whereas analogous results for Follow-the-Perturbed-Leader (FTPL) remain limited due to inherent analytical challenges. To advance the analytical foundations of FTPL, we revisit classical FTRL-FTPL duality for unbounded perturbations…

August 27, 2025
Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits

Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits arXiv:2508.18768v1 Announce Type: new Abstract: We introduce the first best-of-both-worlds algorithm for contextual combinatorial semi-bandits that simultaneously guarantees $widetilde{mathcal{O}}(sqrt{T})$ regret in the adversarial regime and $widetilde{mathcal{O}}(ln T)$ regret in the corrupted stochastic regime. Our approach builds on the Follow-the-Regularized-Leader (FTRL) framework equipped with a Shannon entropy regularizer, yielding…

August 27, 2025