Category: aimldsaimlds

  • 3 Steps to Context Engineering a Crystal-Clear Project

    3 Steps to Context Engineering a Crystal-Clear Project Learn three easy steps for gaining an intelligent picture for any project by using the skill of context engineering. The post 3 Steps to Context Engineering a Crystal-Clear Project appeared first on Towards Data Science. Kory Becker Go to original source

  • The Power of Building from Scratch

    The Power of Building from Scratch Mauro Di Pietro discusses building AI agents with open-source tools, bridging theory and practice, and why he’s still nostalgic for scikit-learn. The post The Power of Building from Scratch appeared first on Towards Data Science. TDS Editors Go to original source

  • TaylorPODA: A Taylor Expansion-Based Method to Improve Post-Hoc Attributions for Opaque Models

    TaylorPODA: A Taylor Expansion-Based Method to Improve Post-Hoc Attributions for Opaque Models arXiv:2507.10643v1 Announce Type: new Abstract: Existing post-hoc model-agnostic methods generate external explanations for opaque models, primarily by locally attributing the model output to its input features. However, they often lack an explicit and systematic framework for quantifying the contribution of individual features. Building…

  • Robust Multi-Manifold Clustering via Simplex Paths

    Robust Multi-Manifold Clustering via Simplex Paths arXiv:2507.10710v1 Announce Type: new Abstract: This article introduces a novel, geometric approach for multi-manifold clustering (MMC), i.e. for clustering a collection of potentially intersecting, d-dimensional manifolds into the individual manifold components. We first compute a locality graph on d-simplices, using the dihedral angle in between adjacent simplices as the…

  • GOLFS: Feature Selection via Combining Both Global and Local Information for High Dimensional Clustering

    GOLFS: Feature Selection via Combining Both Global and Local Information for High Dimensional Clustering arXiv:2507.10956v1 Announce Type: new Abstract: It is important to identify the discriminative features for high dimensional clustering. However, due to the lack of cluster labels, the regularization methods developed for supervised feature selection can not be directly applied. To learn the…

  • Interpretable Bayesian Tensor Network Kernel Machines with Automatic Rank and Feature Selection

    Interpretable Bayesian Tensor Network Kernel Machines with Automatic Rank and Feature Selection arXiv:2507.11136v1 Announce Type: new Abstract: Tensor Network (TN) Kernel Machines speed up model learning by representing parameters as low-rank TNs, reducing computation and memory use. However, most TN-based Kernel methods are deterministic and ignore parameter uncertainty. Further, they require manual tuning of model…

  • How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction

    How does Labeling Error Impact Contrastive Learning? A Perspective from Data Dimensionality Reduction arXiv:2507.11161v1 Announce Type: new Abstract: In recent years, contrastive learning has achieved state-of-the-art performance in the territory of self-supervised representation learning. Many previous works have attempted to provide the theoretical understanding underlying the success of contrastive learning. Almost all of them rely…

  • Do You Really Need a Foundation Model?

    Do You Really Need a Foundation Model? LLM or custom model: how should you choose the right solution? The post Do You Really Need a Foundation Model? appeared first on Towards Data Science. Vincent Vandenbussche Go to original source

  • How to Ensure Reliability in LLM Applications

    How to Ensure Reliability in LLM Applications Learn how to make your LLM applications more robust The post How to Ensure Reliability in LLM Applications appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes

    How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes When numbers lie — and your metrics mislead you The post How Metrics (and LLMs) Can Trick You: A Field Guide to Paradoxes appeared first on Towards Data Science. Subha Ganapathi Go to original source

  • Deploy a Streamlit App to AWS

    Deploy a Streamlit App to AWS Using the Elastic Beanstalk service The post Deploy a Streamlit App to AWS appeared first on Towards Data Science. Thomas Reid Go to original source

  • From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment

    From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment Using optimal transport to weight what matters most In LLM-generated responses The post From Equal Weights to Smart Weights: OTPO’s Approach to Better LLM Alignment appeared first on Towards Data Science. Sudheer Singh Go to original source

  • The Bayesian Approach to Continual Learning: An Overview

    The Bayesian Approach to Continual Learning: An Overview arXiv:2507.08922v1 Announce Type: new Abstract: Continual learning is an online paradigm where a learner continually accumulates knowledge from different tasks encountered over sequential time steps. Importantly, the learner is required to extend and update its knowledge without forgetting about the learning experience acquired from the past, and…

  • Physics-informed machine learning: A mathematical framework with applications to time series forecasting

    Physics-informed machine learning: A mathematical framework with applications to time series forecasting arXiv:2507.08906v1 Announce Type: new Abstract: Physics-informed machine learning (PIML) is an emerging framework that integrates physical knowledge into machine learning models. This physical prior often takes the form of a partial differential equation (PDE) system that the regression function must satisfy. In the…

  • Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization

    Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization arXiv:2507.09093v1 Announce Type: new Abstract: We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts.…

  • Fixed-Confidence Multiple Change Point Identification under Bandit Feedback

    Fixed-Confidence Multiple Change Point Identification under Bandit Feedback arXiv:2507.08994v1 Announce Type: new Abstract: Piecewise constant functions describe a variety of real-world phenomena in domains ranging from chemistry to manufacturing. In practice, it is often required to confidently identify the locations of the abrupt changes in these functions as quickly as possible. For this, we introduce…

  • CoVAE: Consistency Training of Variational Autoencoders

    CoVAE: Consistency Training of Variational Autoencoders arXiv:2507.09103v1 Announce Type: new Abstract: Current state-of-the-art generative approaches frequently rely on a two-stage training procedure, where an autoencoder (often a VAE) first performs dimensionality reduction, followed by training a generative model on the learned latent space. While effective, this introduces computational overhead and increased sampling times. We challenge…

  • What Can the History of Data Tell Us About the Future of AI?

    What Can the History of Data Tell Us About the Future of AI? A 40-Year Look at Data, Business Models, and the Forces Shaping Intelligent Systems The post What Can the History of Data Tell Us About the Future of AI? appeared first on Towards Data Science. Steve Hedden Go to original source

  • Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

    Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need A deep dive into advanced evaluation for data scientists The post Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need appeared first on Towards Data Science. Pol Marin Go to original source

  • Topic Model Labelling with LLMs

    Topic Model Labelling with LLMs Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini. The post Topic Model Labelling with LLMs appeared first on Towards Data Science. Petr Koráb Go to original source

  • There and Back Again: An AI Career Journey

    There and Back Again: An AI Career Journey A full circle moment 30 years in the making The post There and Back Again: An AI Career Journey appeared first on Towards Data Science. David Martin Go to original source

  • Dynamic Inventory Optimization with Censored Demand

    Dynamic Inventory Optimization with Censored Demand A sequential decision framework with Bayesian learning The post Dynamic Inventory Optimization with Censored Demand appeared first on Towards Data Science. Mert Ersoz Go to original source

  • Mallows Model with Learned Distance Metrics: Sampling and Maximum Likelihood Estimation

    Mallows Model with Learned Distance Metrics: Sampling and Maximum Likelihood Estimation arXiv:2507.08108v1 Announce Type: new Abstract: textit{Mallows model} is a widely-used probabilistic framework for learning from ranking data, with applications ranging from recommendation systems and voting to aligning language models with human preferences~cite{chen2024mallows, kleinberg2021algorithmic, rafailov2024direct}. Under this model, observed rankings are noisy perturbations of a…

  • CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk

    CLEAR: Calibrated Learning for Epistemic and Aleatoric Risk arXiv:2507.08150v1 Announce Type: new Abstract: Accurate uncertainty quantification is critical for reliable predictive modeling, especially in regression tasks. Existing methods typically address either aleatoric uncertainty from measurement noise or epistemic uncertainty from limited data, but not necessarily both in a balanced way. We propose CLEAR, a calibration…

  • MIRRAMS: Towards Training Models Robust to Missingness Distribution Shifts

    MIRRAMS: Towards Training Models Robust to Missingness Distribution Shifts arXiv:2507.08280v1 Announce Type: new Abstract: In real-world data analysis, missingness distributional shifts between training and test input datasets frequently occur, posing a significant challenge to achieving robust prediction performance. In this study, we propose a novel deep learning framework designed to address such shifts in missingness…

  • Admissibility of Stein Shrinkage for Batch Normalization in the Presence of Adversarial Attacks

    Admissibility of Stein Shrinkage for Batch Normalization in the Presence of Adversarial Attacks arXiv:2507.08261v1 Announce Type: new Abstract: Batch normalization (BN) is a ubiquitous operation in deep neural networks used primarily to achieve stability and regularization during network training. BN involves feature map centering and scaling using sample means and variances, respectively. Since these statistics…

  • Optimal and Practical Batched Linear Bandit Algorithm

    Optimal and Practical Batched Linear Bandit Algorithm arXiv:2507.08438v1 Announce Type: new Abstract: We study the linear bandit problem under limited adaptivity, known as the batched linear bandit. While existing approaches can achieve near-optimal regret in theory, they are often computationally prohibitive or underperform in practice. We propose texttt{BLAE}, a novel batched algorithm that integrates arm…

  • Weekly Entering & Transitioning – Thread 14 Jul, 2025 – 21 Jul, 2025

    Weekly Entering & Transitioning – Thread 14 Jul, 2025 – 21 Jul, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • How much DSA for FAANG+ ?

    How much DSA for FAANG+ ? Hello all, I am going to be graduating in 6 months and have been practicing Leetcode as I believe this to be my weakest point. I have solved 250 LC with 130 Easy and 120 Hard, covering concepts like arrays, hashing, binary trees, SQL, linked list, two pointers, stack,…

  • Toto: A Foundation Time-Series Model Optimized for Observability Data

    Toto: A Foundation Time-Series Model Optimized for Observability Data Datadog open-sourced Toto (Time Series Optimized Transformer for Observability), a model purpose-built for observability data. Toto is currently the most extensively pretrained time-series foundation model: The pretraining corpus contains 2.36 trillion tokens, with ~70% coming from Datadog’s private telemetry dataset. Also, Toto currently ranks 2nd in…

  • How do you efficiently traverse hundreds of features in the dataset?

    How do you efficiently traverse hundreds of features in the dataset? Currently, working on a fintech classification algorithm, with close to a thousand features which is very tiresome. I’m not a domain expert, so creating sensible hypotesis is difficult. How do you tackle EDA and forming reasonable hypotesis in these cases? Even with proper documentation…

  • The right questions to find clusters (tangles)

    The right questions to find clusters (tangles) Hey everyone, I’m currently working on my bachelor’s thesis and I’m hitting a creative block on a central part – maybe you have some ideas or impulses for me. My dataset consists of 100,000 cleaned job postings from Kaggle (title + description). The goal of my thesis is…

  • Are You Being Unfair to LLMs?

    Are You Being Unfair to LLMs? They may deserve better. The post Are You Being Unfair to LLMs? appeared first on Towards Data Science. Julian Mendel Go to original source

  • Hitchhiker’s Guide to RAG: From Tiny Files to Tolstoy with OpenAI’s API and LangChain

    Hitchhiker’s Guide to RAG: From Tiny Files to Tolstoy with OpenAI’s API and LangChain Scaling a simple RAG pipeline from simple notes to full books The post Hitchhiker’s Guide to RAG: From Tiny Files to Tolstoy with OpenAI’s API and LangChain appeared first on Towards Data Science. Maria Mouschoutzi Go to original source

  • Topological Machine Learning with Unreduced Persistence Diagrams

    Topological Machine Learning with Unreduced Persistence Diagrams arXiv:2507.07156v1 Announce Type: new Abstract: Supervised machine learning pipelines trained on features derived from persistent homology have been experimentally observed to ignore much of the information contained in a persistence diagram. Computing persistence diagrams is often the most computationally demanding step in such a pipeline, however. To explore…

  • Class conditional conformal prediction for multiple inputs by p-value aggregation

    Class conditional conformal prediction for multiple inputs by p-value aggregation arXiv:2507.07150v1 Announce Type: new Abstract: Conformal prediction methods are statistical tools designed to quantify uncertainty and generate predictive sets with guaranteed coverage probabilities. This work introduces an innovative refinement to these methods for classification tasks, specifically tailored for scenarios where multiple observations (multi-inputs) of a…

  • Bayesian Double Descent

    Bayesian Double Descent arXiv:2507.07338v1 Announce Type: new Abstract: Double descent is a phenomenon of over-parameterized statistical models. Our goal is to view double descent from a Bayesian perspective. Over-parameterized models such as deep neural networks have an interesting re-descending property in their risk characteristics. This is a recent phenomenon in machine learning and has been…

  • Hess-MC2: Sequential Monte Carlo Squared using Hessian Information and Second Order Proposals

    Hess-MC2: Sequential Monte Carlo Squared using Hessian Information and Second Order Proposals arXiv:2507.07461v1 Announce Type: new Abstract: When performing Bayesian inference using Sequential Monte Carlo (SMC) methods, two considerations arise: the accuracy of the posterior approximation and computational efficiency. To address computational demands, Sequential Monte Carlo Squared (SMC$^2$) is well-suited for high-performance computing (HPC) environments.…

  • Galerkin-ARIMA: A Two-Stage Polynomial Regression Framework for Fast Rolling One-Step-Ahead Forecasting

    Galerkin-ARIMA: A Two-Stage Polynomial Regression Framework for Fast Rolling One-Step-Ahead Forecasting arXiv:2507.07469v1 Announce Type: new Abstract: Time-series models like ARIMA remain widely used for forecasting but limited to linear assumptions and high computational cost in large and complex datasets. We propose Galerkin-ARIMA that generalizes the AR component of ARIMA and replace it with a flexible…

  • Building a Сustom MCP Chatbot

    Building a Сustom MCP Chatbot Understanding all the details of the model context protocol The post Building a Сustom MCP Chatbot appeared first on Towards Data Science. Mariya Mansurova Go to original source

  • Reducing Time to Value for Data Science Projects: Part 3

    Reducing Time to Value for Data Science Projects: Part 3 Setting up a robust experimentation process The post Reducing Time to Value for Data Science Projects: Part 3 appeared first on Towards Data Science. Kristopher McGlinchey Go to original source

  • Scene Understanding in Action: Real-World Validation of Multimodal AI Integration

    Scene Understanding in Action: Real-World Validation of Multimodal AI Integration A deep dive into real-world case studies: from indoor space and urban streets to world-famous landmarks The post Scene Understanding in Action: Real-World Validation of Multimodal AI Integration appeared first on Towards Data Science. Eric Chung Go to original source

  • Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare

    Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare How metrics and monitoring combine with human expertise to build trustworthy AI in healthcare. The post Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare appeared first on Towards Data Science. Robert Martin-Short Go to original source

  • On the Hardness of Unsupervised Domain Adaptation: Optimal Learners and Information-Theoretic Perspective

    On the Hardness of Unsupervised Domain Adaptation: Optimal Learners and Information-Theoretic Perspective arXiv:2507.06552v1 Announce Type: new Abstract: This paper studies the hardness of unsupervised domain adaptation (UDA) under covariate shift. We model the uncertainty that the learner faces by a distribution $pi$ in the ground-truth triples $(p, q, f)$ — which we call a UDA…

  • Semi-parametric Functional Classification via Path Signatures Logistic Regression

    Semi-parametric Functional Classification via Path Signatures Logistic Regression arXiv:2507.06637v1 Announce Type: new Abstract: We propose Path Signatures Logistic Regression (PSLR), a semi-parametric framework for classifying vector-valued functional data with scalar covariates. Classical functional logistic regression models rely on linear assumptions and fixed basis expansions, which limit flexibility and degrade performance under irregular sampling. PSLR overcomes…

  • Fast Gaussian Processes under Monotonicity Constraints

    Fast Gaussian Processes under Monotonicity Constraints arXiv:2507.06677v1 Announce Type: new Abstract: Gaussian processes (GPs) are widely used as surrogate models for complicated functions in scientific and engineering applications. In many cases, prior knowledge about the function to be approximated, such as monotonicity, is available and can be leveraged to improve model fidelity. Incorporating such constraints…

  • Conformal Prediction for Long-Tailed Classification

    Conformal Prediction for Long-Tailed Classification arXiv:2507.06867v1 Announce Type: new Abstract: Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii)…

  • Adaptive collaboration for online personalized distributed learning with heterogeneous clients

    Adaptive collaboration for online personalized distributed learning with heterogeneous clients arXiv:2507.06844v1 Announce Type: new Abstract: We study the problem of online personalized decentralized learning with $N$ statistically heterogeneous clients collaborating to accelerate local training. An important challenge in this setting is to select relevant collaborators to reduce gradient variance while mitigating the introduced bias. To…

  • The Crucial Role of NUMA Awareness in High-Performance Deep Learning

    The Crucial Role of NUMA Awareness in High-Performance Deep Learning PyTorch model performance analysis and optimization — Part 10 The post The Crucial Role of NUMA Awareness in High-Performance Deep Learning appeared first on Towards Data Science. Chaim Rand Go to original source

  • Work Data Is the Next Frontier for GenAI

    Work Data Is the Next Frontier for GenAI 9 reasons why work data is the single most valuable data source for LLM training, uniquely capable of propelling LLM performance to unprecedented heights. The post Work Data Is the Next Frontier for GenAI appeared first on Towards Data Science. Zsombor Varnagy-Toth Go to original source

  • Recap of all types of LLM Agents

    Recap of all types of LLM Agents Regular, ReAct, Chain-of-Thought, Reflexion, ToT, GoT, PoT The post Recap of all types of LLM Agents appeared first on Towards Data Science. Mauro Di Pietro Go to original source

  • AI Agents Are Shaping the Future of Work Task by Task, Not Job by Job

    AI Agents Are Shaping the Future of Work Task by Task, Not Job by Job What two groundbreaking studies reveal about the future of human-AI collaboration, and the enterprise playbook for thriving in the AI agent era The post AI Agents Are Shaping the Future of Work Task by Task, Not Job by Job appeared…

  • How to Perform Effective Data Cleaning for Machine Learning

    How to Perform Effective Data Cleaning for Machine Learning Learn how you can improve your machine learning models using effective data cleaning The post How to Perform Effective Data Cleaning for Machine Learning appeared first on Towards Data Science. Eivind Kjosbakken Go to original source

  • Enjoying Non-linearity in Multinomial Logistic Bandits

    Enjoying Non-linearity in Multinomial Logistic Bandits arXiv:2507.05306v1 Announce Type: new Abstract: We consider the multinomial logistic bandit problem, a variant of generalized linear bandits where a learner interacts with an environment by selecting actions to maximize expected rewards based on probabilistic feedback from multiple possible outcomes. In the binary setting, recent work has focused on…

  • Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting

    Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting arXiv:2507.05470v1 Announce Type: new Abstract: We propose Temporal Conformal Prediction (TCP), a novel framework for constructing prediction intervals in financial time-series with guaranteed finite-sample validity. TCP integrates quantile regression with a conformal calibration layer that adapts online via a decaying…

  • A Malliavin calculus approach to score functions in diffusion generative models

    A Malliavin calculus approach to score functions in diffusion generative models arXiv:2507.05550v1 Announce Type: new Abstract: Score-based diffusion generative models have recently emerged as a powerful tool for modelling complex data distributions. These models aim at learning the score function, which defines a map from a known probability distribution to the target data distribution via…

  • Property Elicitation on Imprecise Probabilities

    Property Elicitation on Imprecise Probabilities arXiv:2507.05857v1 Announce Type: new Abstract: Property elicitation studies which attributes of a probability distribution can be determined by minimising a risk. We investigate a generalisation of property elicitation to imprecise probabilities (IP). This investigation is motivated by multi-distribution learning, which takes the classical machine learning paradigm of minimising a single…

  • Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis

    Best-of-N through the Smoothing Lens: KL Divergence and Regret Analysis arXiv:2507.05913v1 Announce Type: new Abstract: A simple yet effective method for inference-time alignment of generative models is Best-of-$N$ (BoN), where $N$ outcomes are sampled from a reference policy, evaluated using a proxy reward model, and the highest-scoring one is selected. While prior work argues that…

  • How to Fine-Tune Small Language Models to Think with Reinforcement Learning

    How to Fine-Tune Small Language Models to Think with Reinforcement Learning A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch The post How to Fine-Tune Small Language Models to Think with Reinforcement Learning appeared first on Towards Data Science. Avishek Biswas Go to original source

  • What I Learned in my First 18 Months as a Freelance Data Scientist

    What I Learned in my First 18 Months as a Freelance Data Scientist The taxes and health insurance edition The post What I Learned in my First 18 Months as a Freelance Data Scientist appeared first on Towards Data Science. CJ Sullivan Go to original source

  • Build Interactive Machine Learning Apps with Gradio

    Build Interactive Machine Learning Apps with Gradio Create a fun text-to-speech demo in minutes The post Build Interactive Machine Learning Apps with Gradio appeared first on Towards Data Science. Ehssan Khan Go to original source

  • Microsoft’s Revolutionary Diagnostic Medical AI, Explained

    Microsoft’s Revolutionary Diagnostic Medical AI, Explained Microsoft’s latest paper discusses a path to medical superintelligence. How close are we, really? The post Microsoft’s Revolutionary Diagnostic Medical AI, Explained appeared first on Towards Data Science. Ryan D’Cunha Go to original source

  • Beyond SEO: A Transformer-Based Approach for Reinventing Web Content Optimisation

    Beyond SEO: A Transformer-Based Approach for Reinventing Web Content Optimisation arXiv:2507.03169v1 Announce Type: new Abstract: The rise of generative AI search engines is disrupting traditional SEO, with Gartner predicting 25% reduction in conventional search usage by 2026. This necessitates new approaches for web content visibility in AI-driven search environments. We present a domain-specific fine-tuning approach…

  • LILI clustering algorithm: Limit Inferior Leaf Interval Integrated into Causal Forest for Causal Interference

    LILI clustering algorithm: Limit Inferior Leaf Interval Integrated into Causal Forest for Causal Interference arXiv:2507.03271v1 Announce Type: new Abstract: Causal forest methods are powerful tools in causal inference. Similar to traditional random forest in machine learning, causal forest independently considers each causal tree. However, this independence consideration increases the likelihood that classification errors in one…

  • Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data

    Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data arXiv:2507.03681v1 Announce Type: new Abstract: Randomized trials are typically designed to detect average treatment effects but often lack the statistical power to uncover effect heterogeneity over patient characteristics, limiting their value for personalized decision-making. To address this, we propose the QR-learner, a model-agnostic…

  • Determination of Particle-Size Distributions from Light-Scattering Measurement Using Constrained Gaussian Process Regression

    Determination of Particle-Size Distributions from Light-Scattering Measurement Using Constrained Gaussian Process Regression arXiv:2507.03736v1 Announce Type: new Abstract: In this work, we propose a novel methodology for robustly estimating particle size distributions from optical scattering measurements using constrained Gaussian process regression. The estimation of particle size distributions is commonly formulated as a Fredholm integral equation of…

  • Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis

    Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis arXiv:2507.03756v1 Announce Type: new Abstract: The success of denoising diffusion models raises important questions regarding their generalisation behaviour, particularly in high-dimensional settings. Notably, it has been shown that when training and sampling are performed perfectly, these models memorise training data — implying that some form of…

  • Run Your Python Code up to 80x Faster Using the Cython Library

    Run Your Python Code up to 80x Faster Using the Cython Library A four-step plan for C language speed where it matters most The post Run Your Python Code up to 80x Faster Using the Cython Library appeared first on Towards Data Science. Thomas Reid Go to original source

  • The Five-Second Fingerprint: Inside Shazam’s Instant Song ID

    The Five-Second Fingerprint: Inside Shazam’s Instant Song ID How Shazam recognizes songs in seconds The post The Five-Second Fingerprint: Inside Shazam’s Instant Song ID appeared first on Towards Data Science. Ashton Gribble Go to original source

  • Your Personal Analytics Toolbox

    Your Personal Analytics Toolbox Leveraging MCP for automating your daily routine The post Your Personal Analytics Toolbox appeared first on Towards Data Science. Mariya Mansurova Go to original source

  • POSET Representations in Python Can Have a Huge Impact on Business

    POSET Representations in Python Can Have a Huge Impact on Business Discover how POSET indicators transform data into coherent scoring systems, enabling meaningful comparisons while preserving the data’s multi-dimensional semantic structure. The post POSET Representations in Python Can Have a Huge Impact on Business appeared first on Towards Data Science. Andrea D’Agostino Go to original…

  • Build Algorithm-Agnostic ML Pipelines in a Breeze

    Build Algorithm-Agnostic ML Pipelines in a Breeze The framework is now an open-source Python package for streamlined ML workflows The post Build Algorithm-Agnostic ML Pipelines in a Breeze appeared first on Towards Data Science. Mena Wang Go to original source

  • Weekly Entering & Transitioning – Thread 07 Jul, 2025 – 14 Jul, 2025

    Weekly Entering & Transitioning – Thread 07 Jul, 2025 – 14 Jul, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Long-timers at companies — what’s your secret?

    Long-timers at companies — what’s your secret? Hi everyone, I’ve been a job hopper throughout my career—never stayed at one place for more than 1-2 years, usually for various reasons. Now, I’m entering a phase where I want to get more settled. I’m about to start a new job and would love to hear from…

  • Reliable DS Adjacent Fields Hiring for Bachelor’s Degree?

    Reliable DS Adjacent Fields Hiring for Bachelor’s Degree? Hello all. To try and condense a lot of context for this question, I am an adult who went back to school to complete my bachelor’s, in order to support myself and my partner on one income. Admittedly, I did this because I heard how good data…

  • A Brief Guide to UV

    A Brief Guide to UV Python has been largely devoid of easy to use environment and package management tooling, with various developers employing their own cocktail of pip, virtualenv, poetry, and conda to get the job done. However, it looks like uv is rapidly emerging to be a standard in the industry, and I’m super…

  • With Generative AI looking so ominous, would there be any further research in any other domains like Computer Vision or NLP or Graph Analytics ever?

    With Generative AI looking so ominous, would there be any further research in any other domains like Computer Vision or NLP or Graph Analytics ever? So as the title suggest, last few years have been just Generative AI all over the place. Every new research is somehow focussed towards it. So does this mean other…

  • Rethinking Data Science Interviews in the Age of AI

    Rethinking Data Science Interviews in the Age of AI How AI is transforming data science interviews—and what hiring managers and candidates should do to adapt The post Rethinking Data Science Interviews in the Age of AI appeared first on Towards Data Science. Yu Dong Go to original source

  • My Honest Advice for Aspiring Machine Learning Engineers

    My Honest Advice for Aspiring Machine Learning Engineers What it really takes to become a machine learning engineer The post My Honest Advice for Aspiring Machine Learning Engineers appeared first on Towards Data Science. Egor Howell Go to original source

  • Change-Aware Data Validation with Column-Level Lineage

    Change-Aware Data Validation with Column-Level Lineage Data transformation tools like dbt make constructing SQL data pipelines easy and systematic. But even with the added structure and clearly defined data models, pipelines can still become complex, which makes debugging issues and validating changes to data models difficult. The post Change-Aware Data Validation with Column-Level Lineage appeared…

  • Explainable Anomaly Detection with RuleFit: An Intuitive Guide

    Explainable Anomaly Detection with RuleFit: An Intuitive Guide Creating interpretable rules to characterize the identified anomalies The post Explainable Anomaly Detection with RuleFit: An Intuitive Guide appeared first on Towards Data Science. Shuai Guo Go to original source

  • Hybrid least squares for learning functions from highly noisy data

    Hybrid least squares for learning functions from highly noisy data arXiv:2507.02215v1 Announce Type: new Abstract: Motivated by the need for efficient estimation of conditional expectations, we consider a least-squares function approximation problem with heavily polluted data. Existing methods that are powerful in the small noise regime are suboptimal when large noise is present. We propose…

  • Adaptive Iterative Soft-Thresholding Algorithm with the Median Absolute Deviation

    Adaptive Iterative Soft-Thresholding Algorithm with the Median Absolute Deviation arXiv:2507.02084v1 Announce Type: new Abstract: The adaptive Iterative Soft-Thresholding Algorithm (ISTA) has been a popular algorithm for finding a desirable solution to the LASSO problem without explicitly tuning the regularization parameter $lambda$. Despite that the adaptive ISTA is a successful practical algorithm, few theoretical results exist.…

  • Transfer Learning for Matrix Completion

    Transfer Learning for Matrix Completion arXiv:2507.02248v1 Announce Type: new Abstract: In this paper, we explore the knowledge transfer under the setting of matrix completion, which aims to enhance the estimation of a low-rank target matrix with auxiliary data available. We propose a transfer learning procedure given prior information on which source datasets are favorable. We…

  • It’s Hard to Be Normal: The Impact of Noise on Structure-agnostic Estimation

    It’s Hard to Be Normal: The Impact of Noise on Structure-agnostic Estimation arXiv:2507.02275v1 Announce Type: new Abstract: Structure-agnostic causal inference studies how well one can estimate a treatment effect given black-box machine learning estimates of nuisance functions (like the impact of confounders on treatment and outcomes). Here, we find that the answer depends in a…

  • Sparse Gaussian Processes: Structured Approximations and Power-EP Revisited

    Sparse Gaussian Processes: Structured Approximations and Power-EP Revisited arXiv:2507.02377v1 Announce Type: new Abstract: Inducing-point-based sparse variational Gaussian processes have become the standard workhorse for scaling up GP models. Recent advances show that these methods can be improved by introducing a diagonal scaling matrix to the conditional posterior density given the inducing points. This paper first…

  • Fairness Pruning: Precision Surgery to Reduce Bias in LLMs

    Fairness Pruning: Precision Surgery to Reduce Bias in LLMs From unjustified shootings to neutral stories: how to fix toxic narratives with selective pruning The post Fairness Pruning: Precision Surgery to Reduce Bias in LLMs appeared first on Towards Data Science. Pere Martra Go to original source

  • GraphRAG in Action: A Simple Agent for Know-Your-Customer Investigations

    GraphRAG in Action: A Simple Agent for Know-Your-Customer Investigations This blog post provides a hands-on guide for AI engineers and developers on how to build an initial KYC agent prototype with the OpenAI Agents SDK. We’ll explore how to equip our agent with a suite of tools (including MCP Server tools) to uncover and investigate potential…

  • Asymptotic convexity of wide and shallow neural networks

    Asymptotic convexity of wide and shallow neural networks arXiv:2507.01044v1 Announce Type: new Abstract: For a simple model of shallow and wide neural networks, we show that the epigraph of its input-output map as a function of the network parameters approximates epigraph of a. convex function in a precise sense. This leads to a plausible explanation…

  • Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles

    Parsimonious Gaussian mixture models with piecewise-constant eigenvalue profiles arXiv:2507.01542v1 Announce Type: new Abstract: Gaussian mixture models (GMMs) are ubiquitous in statistical learning, particularly for unsupervised problems. While full GMMs suffer from the overparameterization of their covariance matrices in high-dimensional spaces, spherical GMMs (with isotropic covariance matrices) certainly lack flexibility to fit certain anisotropic distributions. Connecting…

  • A generative modeling / Physics-Informed Neural Network approach to random differential equations

    A generative modeling / Physics-Informed Neural Network approach to random differential equations arXiv:2507.01687v1 Announce Type: new Abstract: The integration of Scientific Machine Learning (SciML) techniques with uncertainty quantification (UQ) represents a rapidly evolving frontier in computational science. This work advances Physics-Informed Neural Networks (PINNs) by incorporating probabilistic frameworks to effectively model uncertainty in complex systems.…

  • When Less Is More: Binary Feedback Can Outperform Ordinal Comparisons in Ranking Recovery

    When Less Is More: Binary Feedback Can Outperform Ordinal Comparisons in Ranking Recovery arXiv:2507.01613v1 Announce Type: new Abstract: Paired comparison data, where users evaluate items in pairs, play a central role in ranking and preference learning tasks. While ordinal comparison data intuitively offer richer information than binary comparisons, this paper challenges that conventional wisdom. We…

  • Proof of a perfect platonic representation hypothesis

    Proof of a perfect platonic representation hypothesis arXiv:2507.01098v1 Announce Type: cross Abstract: In this note, we elaborate on and explain in detail the proof given by Ziyin et al. (2025) of the “perfect” Platonic Representation Hypothesis (PRH) for the embedded deep linear network model (EDLN). We show that if trained with SGD, two EDLNs with…

  • Taking ResNet to the Next Level

    Taking ResNet to the Next Level Understanding how ResNeXt improves upon ResNet, with a comprehensive PyTorch implementation guide The post Taking ResNet to the Next Level appeared first on Towards Data Science. Muhammad Ardi Go to original source

  • Software Engineering in the LLM Era

    Software Engineering in the LLM Era On growing new software engineers, even when it’s inefficient The post Software Engineering in the LLM Era appeared first on Towards Data Science. Stephanie Kirmer Go to original source

  • Interactive Data Exploration for Computer Vision Projects with Rerun

    Interactive Data Exploration for Computer Vision Projects with Rerun Analyse dynamic signals in a computer vision pipeline in Python using OpenCV and Rerun The post Interactive Data Exploration for Computer Vision Projects with Rerun appeared first on Towards Data Science. Florian Trautweiler Go to original source

  • Four AI Minds in Concert: A Deep Dive into Multimodal AI Fusion

    Four AI Minds in Concert: A Deep Dive into Multimodal AI Fusion Introduction: From System Architecture to Algorithmic Execution In my previous article, I explored the architectural foundations of the VisionScout multimodal AI system, tracing its evolution from a simple object detection model into a modular framework. There, I highlighted how careful layering, module boundaries,…

  • Why We Should Focus on AI for Women

    Why We Should Focus on AI for Women A simulation study on gender disparities entrenched in AI. The post Why We Should Focus on AI for Women appeared first on Towards Data Science. Shuyang Go to original source

  • Enhancing Interpretability in Generative Modeling: Statistically Disentangled Latent Spaces Guided by Generative Factors in Scientific Datasets

    Enhancing Interpretability in Generative Modeling: Statistically Disentangled Latent Spaces Guided by Generative Factors in Scientific Datasets arXiv:2507.00298v1 Announce Type: new Abstract: This study addresses the challenge of statistically extracting generative factors from complex, high-dimensional datasets in unsupervised or semi-supervised settings. We investigate encoder-decoder-based generative models for nonlinear dimensionality reduction, focusing on disentangling low-dimensional latent variables…

  • Disentangled Feature Importance

    Disentangled Feature Importance arXiv:2507.00260v1 Announce Type: new Abstract: Feature importance quantification faces a fundamental challenge: when predictors are correlated, standard methods systematically underestimate their contributions. We prove that major existing approaches target identical population functionals under squared-error loss, revealing why they share this correlation-induced bias. To address this limitation, we introduce emph{Disentangled Feature Importance (DFI)},…