Category: aimldsaimlds

  • Backward Conformal Prediction

    Backward Conformal Prediction arXiv:2505.13732v1 Announce Type: new Abstract: We introduce $textit{Backward Conformal Prediction}$, a method that guarantees conformal coverage while providing flexible control over the size of prediction sets. Unlike standard conformal prediction, which fixes the coverage level and allows the conformal set size to vary, our approach defines a rule that constrains how prediction…

  • What the Most Detailed Peer-Reviewed Study on AI in the Classroom Taught Us

    What the Most Detailed Peer-Reviewed Study on AI in the Classroom Taught Us The rapid proliferation and superb capabilities of widely available LLMs has ignited intense debate within the educational sector. On one side they offer students a 24/7 tutor who is always available to help; but then of course students can use LLMs to…

  • I Teach Data Viz with a Bag of Rocks

    I Teach Data Viz with a Bag of Rocks Last Thursday, my co-instructor and I showed up to the Data Visualization course we teach at the University of Washington with a bag of rocks. The bag consisted of a fairly diverse collection that I myself put together across a set of treks in various regions…

  • Optimizing Multi-Objective Problems with Desirability Functions

    Optimizing Multi-Objective Problems with Desirability Functions When working in Data Science, it is not uncommon to encounter problems with competing objectives. Whether designing products, tuning algorithms or optimizing portfolios, we often need to balance several metrics to get the best possible outcome. Sometimes, maximizing one metrics comes at the expense of another, making it hard…

  • The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations

    The Stochastic Occupation Kernel (SOCK) Method for Learning Stochastic Differential Equations arXiv:2505.11622v1 Announce Type: new Abstract: We present a novel kernel-based method for learning multivariate stochastic differential equations (SDEs). The method follows a two-step procedure: we first estimate the drift term function, then the (matrix-valued) diffusion function given the drift. Occupation kernels are integral functionals…

  • Humble your Overconfident Networks: Unlearning Overfitting via Sequential Monte Carlo Tempered Deep Ensembles

    Humble your Overconfident Networks: Unlearning Overfitting via Sequential Monte Carlo Tempered Deep Ensembles arXiv:2505.11671v1 Announce Type: new Abstract: Sequential Monte Carlo (SMC) methods offer a principled approach to Bayesian uncertainty quantification but are traditionally limited by the need for full-batch gradient evaluations. We introduce a scalable variant by incorporating Stochastic Gradient Hamiltonian Monte Carlo (SGHMC)…

  • Missing Data Imputation by Reducing Mutual Information with Rectified Flows

    Missing Data Imputation by Reducing Mutual Information with Rectified Flows arXiv:2505.11749v1 Announce Type: new Abstract: This paper introduces a novel iterative method for missing data imputation that sequentially reduces the mutual information between data and their corresponding missing mask. Inspired by GAN-based approaches, which train generators to decrease the predictability of missingness patterns, our method…

  • Thompson Sampling-like Algorithms for Stochastic Rising Bandits

    Thompson Sampling-like Algorithms for Stochastic Rising Bandits arXiv:2505.12092v1 Announce Type: new Abstract: Stochastic rising rested bandit (SRRB) is a setting where the arms’ expected rewards increase as they are pulled. It models scenarios in which the performances of the different options grow as an effect of an underlying learning process (e.g., online model selection). Even…

  • Multi-Attribute Graph Estimation with Sparse-Group Non-Convex Penalties

    Multi-Attribute Graph Estimation with Sparse-Group Non-Convex Penalties arXiv:2505.11984v1 Announce Type: new Abstract: We consider the problem of inferring the conditional independence graph (CIG) of high-dimensional Gaussian vectors from multi-attribute data. Most existing methods for graph estimation are based on single-attribute models where one associates a scalar random variable with each node. In multi-attribute graphical models,…

  • An Exponential Averaging Process with Strong Convergence Properties

    An Exponential Averaging Process with Strong Convergence Properties arXiv:2505.10605v1 Announce Type: new Abstract: Averaging, or smoothing, is a fundamental approach to obtain stable, de-noised estimates from noisy observations. In certain scenarios, observations made along trajectories of random dynamical systems are of particular interest. One popular smoothing technique for such a scenario is exponential moving averaging…

  • Minimax learning rates for estimating binary classifiers under margin conditions

    Minimax learning rates for estimating binary classifiers under margin conditions arXiv:2505.10628v1 Announce Type: new Abstract: We study classification problems using binary estimators where the decision boundary is described by horizon functions and where the data distribution satisfies a geometric margin condition. We establish upper and lower bounds for the minimax learning rate over broad function…

  • Inexact Column Generation for Bayesian Network Structure Learning via Difference-of-Submodular Optimization

    Inexact Column Generation for Bayesian Network Structure Learning via Difference-of-Submodular Optimization arXiv:2505.11089v1 Announce Type: new Abstract: In this paper, we consider a score-based Integer Programming (IP) approach for solving the Bayesian Network Structure Learning (BNSL) problem. State-of-the-art BNSL IP formulations suffer from the exponentially large number of variables and constraints. A standard approach in IP…

  • Supervised Models Can Generalize Also When Trained on Random Label

    Supervised Models Can Generalize Also When Trained on Random Label arXiv:2505.11006v1 Announce Type: new Abstract: The success of unsupervised learning raises the question of whether also supervised models can be trained without using the information in the output $y$. In this paper, we demonstrate that this is indeed possible. The key step is to formulate…

  • Nash: Neural Adaptive Shrinkage for Structured High-Dimensional Regression

    Nash: Neural Adaptive Shrinkage for Structured High-Dimensional Regression arXiv:2505.11143v1 Announce Type: new Abstract: Sparse linear regression is a fundamental tool in data analysis. However, traditional approaches often fall short when covariates exhibit structure or arise from heterogeneous sources. In biomedical applications, covariates may stem from distinct modalities or be structured according to an underlying graph.…

  • Weekly Entering & Transitioning – Thread 19 May, 2025 – 26 May, 2025

    Weekly Entering & Transitioning – Thread 19 May, 2025 – 26 May, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • How to Set the Number of Trees in Random Forest

    How to Set the Number of Trees in Random Forest Scientific publication T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich (2025). optRF: Optimising random forest stability by determining the optimal number of trees. BMC bioinformatics, 26(1), 95. Follow this LINK to the original publication. Random Forest — A Powerful Tool for Anyone…

  • How to Build an AI Journal with LlamaIndex

    How to Build an AI Journal with LlamaIndex This post will share how to build an AI journal with the LlamaIndex. We will cover one essential function of this AI journal: asking for advice. We will start with the most basic implementation and iterate from there. We can see significant improvements for this function when…

  • The Automation Trap: Why Low-Code AI Models Fail When You Scale

    The Automation Trap: Why Low-Code AI Models Fail When You Scale In the beginning, building Machine Learning models was a skill only data scientists with knowledge of Python could master. However, low-code AI platforms have made things much easier now. Anyone can now directly make a model, link it to data, and publish it as…

  • Agentic AI 102: Guardrails and Agent Evaluation

    Agentic AI 102: Guardrails and Agent Evaluation Introduction In the first post of this series (Agentic AI 101: Starting Your Journey Building AI Agents), we talked about the fundamentals of creating AI Agents and introduced concepts like reasoning, memory, and tools. Of course, that first post touched only the surface of this new area of…

  • On Measuring Intrinsic Causal Attributions in Deep Neural Networks

    On Measuring Intrinsic Causal Attributions in Deep Neural Networks arXiv:2505.09660v1 Announce Type: new Abstract: Quantifying the causal influence of input features within neural networks has become a topic of increasing interest. Existing approaches typically assess direct, indirect, and total causal effects. This work treats NNs as structural causal models (SCMs) and extends our focus to…

  • LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data

    LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data arXiv:2505.09803v1 Announce Type: new Abstract: In many scientific and industrial applications, we are given a handful of instances (a ‘small ensemble’) of a spatially distributed quantity (a ‘field’) but would like to acquire many more. For example, a large ensemble of global temperature sensitivity fields…

  • Learning Multi-Attribute Differential Graphs with Non-Convex Penalties

    Learning Multi-Attribute Differential Graphs with Non-Convex Penalties arXiv:2505.09748v1 Announce Type: new Abstract: We consider the problem of estimating differences in two multi-attribute Gaussian graphical models (GGMs) which are known to have similar structure, using a penalized D-trace loss function with non-convex penalties. The GGM structure is encoded in its precision (inverse covariance) matrix. Existing methods…

  • A Scalable Gradient-Based Optimization Framework for Sparse Minimum-Variance Portfolio Selection

    A Scalable Gradient-Based Optimization Framework for Sparse Minimum-Variance Portfolio Selection arXiv:2505.10099v1 Announce Type: new Abstract: Portfolio optimization involves selecting asset weights to minimize a risk-reward objective, such as the portfolio variance in the classical minimum-variance framework. Sparse portfolio selection extends this by imposing a cardinality constraint: only $k$ assets from a universe of $p$ may…

  • Path Gradients after Flow Matching

    Path Gradients after Flow Matching arXiv:2505.10139v1 Announce Type: new Abstract: Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting. Recently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize…

  • Google’s AlphaEvolve Is Evolving New Algorithms — And It Could Be a Game Changer

    Google’s AlphaEvolve Is Evolving New Algorithms — And It Could Be a Game Changer AlphaEvolve imagined as a genetic algorithm coupled to a large language model. Picture created by the author using various tools including Dall-E3 via ChatGPT. Large Language Models have undeniably revolutionized how many of us approach coding, but they’re often more like a super-powered…

  • Understanding Random Forest using Python (scikit-learn)

    Understanding Random Forest using Python (scikit-learn) Decision trees are a popular supervised learning algorithm with benefits that include being able to be used for both regression and classification as well as being easy to interpret. However, decision trees aren’t the most performant algorithm and are prone to overfitting due to small variations in the training…

  • How to Learn the Math Needed for Machine Learning

    How to Learn the Math Needed for Machine Learning Maths can be a scary topic for people. Many of you want to work in machine learning, but the maths skills needed may seem overwhelming. I am here to tell you that it’s nowhere as intimidating as you may think and to give you a roadmap, resources,…

  • How To Build a Benchmark for Your Models

    How To Build a Benchmark for Your Models I’ve been working as a data science consultant for the past three years, and I’ve had the opportunity to work on multiple projects across various industries. Yet, I noticed one common denominator among most of the clients I worked with: They rarely have a clear idea of…

  • 🚪🚪🐐 Lessons in Decision Making from the Monty Hall Problem

    🚪🚪🐐 Lessons in Decision Making from the Monty Hall Problem The Monty Hall Problem is a well-known brain teaser from which we can learn important lessons in Decision Making that are useful in general and in particular for data scientists. If you are not familiar with this problem, prepare to be perplexed . If you…

  • Lower Bounds on the MMSE of Adversarially Inferring Sensitive Features

    Lower Bounds on the MMSE of Adversarially Inferring Sensitive Features arXiv:2505.09004v1 Announce Type: new Abstract: We propose an adversarial evaluation framework for sensitive feature inference based on minimum mean-squared error (MMSE) estimation with a finite sample size and linear predictive models. Our approach establishes theoretical lower bounds on the true MMSE of inferring sensitive features…

  • Online Learning of Neural Networks

    Online Learning of Neural Networks arXiv:2505.09167v1 Announce Type: new Abstract: We study online learning of feedforward neural networks with the sign activation function that implement functions from the unit ball in $mathbb{R}^d$ to a finite label set ${1, ldots, Y}$. First, we characterize a margin condition that is sufficient and in some cases necessary for…

  • Risk Bounds For Distributional Regression

    Risk Bounds For Distributional Regression arXiv:2505.09075v1 Announce Type: new Abstract: This work examines risk bounds for nonparametric distributional regression estimators. For convex-constrained distributional regression, general upper bounds are established for the continuous ranked probability score (CRPS) and the worst-case mean squared error (MSE) across the domain. These theoretical results are applied to isotonic and trend…

  • Optimal Transport-Based Domain Adaptation for Rotated Linear Regression

    Optimal Transport-Based Domain Adaptation for Rotated Linear Regression arXiv:2505.09229v1 Announce Type: new Abstract: Optimal Transport (OT) has proven effective for domain adaptation (DA) by aligning distributions across domains with differing statistical properties. Building on the approach of Courty et al. (2016), who mapped source data to the target domain for improved model transfer, we focus…

  • Fairness-aware Bayes optimal functional classification

    Fairness-aware Bayes optimal functional classification arXiv:2505.09471v1 Announce Type: new Abstract: Algorithmic fairness has become a central topic in machine learning, and mitigating disparities across different subpopulations has emerged as a rapidly growing research area. In this paper, we systematically study the classification of functional data under fairness constraints, ensuring the disparity level of the classifier…

  • Boost 2-Bit LLM Accuracy with EoRA

    Boost 2-Bit LLM Accuracy with EoRA Quantization is one of the key techniques for reducing the memory footprint of large language models (LLMs). It works by converting the data type of model parameters from higher-precision formats such as 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) to lower-precision integer formats, typically INT8 or INT4.…

  • The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated

    The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big…

  • Strength in Numbers: Ensembling Models with Bagging and Boosting

    Strength in Numbers: Ensembling Models with Bagging and Boosting Bagging and boosting are two powerful ensemble techniques in machine learning – they are must-knows for data scientists! After reading this article, you are going to have a solid understanding of how bagging and boosting work and when to use them. We’ll cover the following topics,…

  • Efficient Graph Storage for Entity Resolution Using Clique-Based Compression

    Efficient Graph Storage for Entity Resolution Using Clique-Based Compression In the world of entity resolution (ER), one of the central challenges is managing and maintaining the complex relationships between records. At its core, Tilores models entities as graphs: each node represents a record, and edges represent rule-based matches between those records. This approach gives us…

  • Parquet File Format – Everything You Need to Know!

    Parquet File Format – Everything You Need to Know! With the amount of Data growing exponentially in the last few years, one of the biggest challenges has become finding the most optimal way to store various data flavors. Unlike in the (not so far) past, when relational databases were considered the only way to go,…

  • Wasserstein Distributionally Robust Nonparametric Regression

    Wasserstein Distributionally Robust Nonparametric Regression arXiv:2505.07967v1 Announce Type: new Abstract: Distributionally robust optimization has become a powerful tool for prediction and decision-making under model uncertainty. By focusing on the local worst-case risk, it enhances robustness by identifying the most unfavorable distribution within a predefined ambiguity set. While extensive research has been conducted in parametric settings,…

  • Diffusion-based supervised learning of generative models for efficient sampling of multimodal distributions

    Diffusion-based supervised learning of generative models for efficient sampling of multimodal distributions arXiv:2505.07825v1 Announce Type: new Abstract: We propose a hybrid generative model for efficient sampling of high-dimensional, multimodal probability distributions for Bayesian inference. Traditional Monte Carlo methods, such as the Metropolis-Hastings and Langevin Monte Carlo sampling methods, are effective for sampling from single-mode distributions…

  • Sharp Gaussian approximations for Decentralized Federated Learning

    Sharp Gaussian approximations for Decentralized Federated Learning arXiv:2505.08125v1 Announce Type: new Abstract: Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation…

  • SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation

    SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation arXiv:2505.08198v1 Announce Type: new Abstract: Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs,…

  • Lie Group Symmetry Discovery and Enforcement Using Vector Fields

    Lie Group Symmetry Discovery and Enforcement Using Vector Fields arXiv:2505.08219v1 Announce Type: new Abstract: Symmetry-informed machine learning can exhibit advantages over machine learning which fails to account for symmetry. Additionally, recent attention has been given to continuous symmetry discovery using vector fields which serve as infinitesimal generators for Lie group symmetries. In this paper, we…

  • Survival Analysis When No One Dies: A Value-Based Approach

    Survival Analysis When No One Dies: A Value-Based Approach Survival Analysis is a statistical approach used to answer the question: “How long will something last?” That “something” could range from a patient’s lifespan to the durability of a machine component or the duration of a user’s subscription. One of the most widely used tools in…

  • Get Started with Rust: Installation and Your First CLI Tool – A Beginner’s Guide

    Get Started with Rust: Installation and Your First CLI Tool – A Beginner’s Guide Rust has become a popular programming language in recent years as it combines security and high performance and can be used in many applications. It combines the positive characteristics of C and C++ with the modern syntax and simplicity of other…

  • Non-Parametric Density Estimation: Theory and Applications

    Non-Parametric Density Estimation: Theory and Applications In this article, we’ll talk about what Density Estimation is and the role it plays in statistical analysis. We’ll analyze two popular density estimation methods, histograms and kernel density estimators, and analyze their theoretical properties as well as how they perform in practice. Finally, we’ll look at how density…

  • Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware

    Rethinking the Environmental Costs of Training AI — Why We Should Look Beyond Hardware Summary of This Study Hardware choices – specifically hardware type and its quantity – along with training time, have a significant positive impact on energy, water, and carbon footprints during AI model training, whereas architecture-related factors do not. The interaction between…

  • TDS Authors Can Now Receive Payments Via Stripe

    TDS Authors Can Now Receive Payments Via Stripe We launched the TDS Author Payment Program back in February, and have since announced a major update—with a far more inclusive earning tier—just last month. Today we’re happy to share that the Program, which has already seen dozens of authors receive their first payout, is now a…

  • Fair Representation Learning for Continuous Sensitive Attributes using Expectation of Integral Probability Metrics

    Fair Representation Learning for Continuous Sensitive Attributes using Expectation of Integral Probability Metrics arXiv:2505.06435v1 Announce Type: new Abstract: AI fairness, also known as algorithmic fairness, aims to ensure that algorithms operate without bias or discrimination towards any individual or group. Among various AI algorithms, the Fair Representation Learning (FRL) approach has gained significant interest in…

  • High-Dimensional Importance-Weighted Information Criteria: Theory and Optimality

    High-Dimensional Importance-Weighted Information Criteria: Theory and Optimality arXiv:2505.06531v1 Announce Type: new Abstract: Imori and Ing (2025) proposed the importance-weighted orthogonal greedy algorithm (IWOGA) for model selection in high-dimensional misspecified regression models under covariate shift. To determine the number of IWOGA iterations, they introduced the high-dimensional importance-weighted information criterion (HDIWIC). They argued that the combined use…

  • Optimal Transport for Machine Learners

    Optimal Transport for Machine Learners arXiv:2505.06589v1 Announce Type: new Abstract: Optimal Transport is a foundational mathematical theory that connects optimization, partial differential equations, and probability. It offers a powerful framework for comparing probability distributions and has recently become an important tool in machine learning, especially for designing and evaluating generative models. These course notes cover…

  • Learning Guarantee of Reward Modeling Using Deep Neural Networks

    Learning Guarantee of Reward Modeling Using Deep Neural Networks arXiv:2505.06601v1 Announce Type: new Abstract: In this work, we study the learning theory of reward modeling with pairwise comparison data using deep neural networks. We establish a novel non-asymptotic regret bound for deep reward estimators in a non-parametric setting, which depends explicitly on the network architecture.…

  • Feature Representation Transferring to Lightweight Models via Perception Coherence

    Feature Representation Transferring to Lightweight Models via Perception Coherence arXiv:2505.06595v1 Announce Type: new Abstract: In this paper, we propose a method for transferring feature representation to lightweight student models from larger teacher models. We mathematically define a new notion called textit{perception coherence}. Based on this notion, we propose a loss function, which takes into account…

  • Empowering LLMs to Think Deeper by Erasing Thoughts

    Empowering LLMs to Think Deeper by Erasing Thoughts Introduction Recent large language models (LLMs) — such as OpenAI’s o1/o3, DeepSeek’s R1 and Anthropic’s Claude 3.7 — demonstrate that allowing the model to think deeper and longer at test time can significantly enhance model’s reasoning capability. The core approach underlying their deep thinking capability is called…

  • How I Finally Understood MCP — and Got It Working in Real Life

    How I Finally Understood MCP — and Got It Working in Real Life Table of Content Introduction: Why I Wrote This The Evolution of Tool Integration with LLMs What Is Model Context Protocol (MCP), Really? Wait, MCP sounds like RAG… but is it? In an MCP-based setup In a traditional RAG system Traditional RAG Implementation MCP Implementation…

  • The Westworld Blunder

    The Westworld Blunder We’re entering an interesting moment in AI development. AI systems are getting memory, reasoning chains, self-critiques, and long-context recall. These capabilities are exactly some of the things that I’ve previously written would be prerequisites for an AI system to be conscious. Just to be clear, I don’t believe today’s AI systems are self-aware, but…

  • Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack

    Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack Have you ever wanted to pause an automated workflow to wait for a human decision? Maybe you need approval before provisioning cloud resources, promoting a machine learning model to production, or charging a customer’s credit card. In many data science and machine learning…

  • Running Python Programs in Your Browser

    Running Python Programs in Your Browser In recent years, WebAssembly (often abbreviated as WASM) has emerged as an interesting technology that extends web browsers’ capabilities far beyond the traditional realms of HTML, CSS, and JavaScript.  As a Python developer, one particularly exciting application is the ability to run Python code directly in the browser. In this…

  • Optimal Regret of Bernoulli Bandits under Global Differential Privacy

    Optimal Regret of Bernoulli Bandits under Global Differential Privacy arXiv:2505.05613v1 Announce Type: new Abstract: As sequential learning algorithms are increasingly applied to real life, ensuring data privacy while maintaining their utilities emerges as a timely question. In this context, regret minimisation in stochastic bandits under $epsilon$-global Differential Privacy (DP) has been widely studied. Unlike bandits…

  • An Efficient Transport-Based Dissimilarity Measure for Time Series Classification under Warping Distortions

    An Efficient Transport-Based Dissimilarity Measure for Time Series Classification under Warping Distortions arXiv:2505.05676v1 Announce Type: cross Abstract: Time Series Classification (TSC) is an important problem with numerous applications in science and technology. Dissimilarity-based approaches, such as Dynamic Time Warping (DTW), are classical methods for distinguishing time series when time deformations are confounding information. In this…

  • DaringFed: A Dynamic Bayesian Persuasion Pricing for Online Federated Learning under Two-sided Incomplete Information

    DaringFed: A Dynamic Bayesian Persuasion Pricing for Online Federated Learning under Two-sided Incomplete Information arXiv:2505.05842v1 Announce Type: cross Abstract: Online Federated Learning (OFL) is a real-time learning paradigm that sequentially executes parameter aggregation immediately for each random arriving client. To motivate clients to participate in OFL, it is crucial to offer appropriate incentives to offset…

  • Safe-EF: Error Feedback for Nonsmooth Constrained Optimization

    Safe-EF: Error Feedback for Nonsmooth Constrained Optimization arXiv:2505.06053v1 Announce Type: cross Abstract: Federated learning faces severe communication bottlenecks due to the high dimensionality of model updates. Communication compression with contractive compressors (e.g., Top-K) is often preferable in practice but can degrade performance without proper handling. Error feedback (EF) mitigates such issues but has been largely…

  • Mixed-Integer Optimization for Responsible Machine Learning

    Mixed-Integer Optimization for Responsible Machine Learning arXiv:2505.05857v1 Announce Type: cross Abstract: In the last few decades, Machine Learning (ML) has achieved significant success across domains ranging from healthcare, sustainability, and the social sciences, to criminal justice and finance. But its deployment in increasingly sophisticated, critical, and sensitive areas affecting individuals, the groups they belong to,…

  • Weekly Entering & Transitioning – Thread 12 May, 2025 – 19 May, 2025

    Weekly Entering & Transitioning – Thread 12 May, 2025 – 19 May, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • What My GPT Stylist Taught Me About Prompting Better

    What My GPT Stylist Taught Me About Prompting Better When I built a GPT-powered fashion assistant, I expected runway looks—not memory loss, hallucinations, or semantic déjà vu. But what unfolded became a lesson in how prompting really works—and why LLMs are more like wild animals than tools. This article builds on my previous article on…

  • Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Analysis

    Log Link vs Log Transformation in R — The Difference that Misleads Your Entire Data Analysis Although normal distributions are the most commonly used, a lot of real-world data unfortunately is not normal. When faced with extremely skewed data, it’s tempting for us to utilize log transformations to normalize the distribution and stabilize the variance. I…

  • Time Series Forecasting Made Simple (Part 2): Customizing Baseline Models

    Time Series Forecasting Made Simple (Part 2): Customizing Baseline Models Thank you for the kind response to Part 1, it’s been encouraging to see so many readers interested in time series forecasting. In Part 1 of this series, we broke down time series data into trend, seasonality, and noise, discussed when to use additive versus…

  • A Review of AccentFold: One of the Most Important Papers on African ASR

    A Review of AccentFold: One of the Most Important Papers on African ASR I really enjoyed reading this paper, not because I’ve met some of the authors before, but because it felt necessary. Most of the papers I’ve written about so far have made waves in the broader ML community, which is great. This one, though,…

  • How Not to Write an MCP Server

    How Not to Write an MCP Server I recently had the chance to create an MCP server for an observability application in order to provide the AI agent with dynamic code analysis capabilities. Because of its potential to transform applications, MCP is a technology I’m even more ecstatic about than I originally was about genAI…

  • Generalization Analysis for Contrastive Representation Learning under Non-IID Settings

    Generalization Analysis for Contrastive Representation Learning under Non-IID Settings arXiv:2505.04937v1 Announce Type: new Abstract: Contrastive Representation Learning (CRL) has achieved impressive success in various domains in recent years. Nevertheless, the theoretical understanding of the generalization behavior of CRL is limited. Moreover, to the best of our knowledge, the current literature only analyzes generalization bounds under…

  • Learning Linearized Models from Nonlinear Systems under Initialization Constraints with Finite Data

    Learning Linearized Models from Nonlinear Systems under Initialization Constraints with Finite Data arXiv:2505.04954v1 Announce Type: new Abstract: The identification of a linear system model from data has wide applications in control theory. The existing work that provides finite sample guarantees for linear system identification typically uses data from a single long system trajectory under i.i.d.…

  • Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach

    Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach arXiv:2505.04986v1 Announce Type: new Abstract: Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of…

  • A Two-Sample Test of Text Generation Similarity

    A Two-Sample Test of Text Generation Similarity arXiv:2505.05269v1 Announce Type: new Abstract: The surge in digitized text data requires reliable inferential methods on observed textual patterns. This article proposes a novel two-sample text test for comparing similarity between two groups of documents. The hypothesis is whether the probabilistic mapping generating the textual data is identical…

  • Boosting Statistic Learning with Synthetic Data from Pretrained Large Models

    Boosting Statistic Learning with Synthetic Data from Pretrained Large Models arXiv:2505.04992v1 Announce Type: new Abstract: The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose…

  • Clustering Eating Behaviors in Time: A Machine Learning Approach to Preventive Health

    Clustering Eating Behaviors in Time: A Machine Learning Approach to Preventive Health It’s well known that what we eat matters — but what if when and how often we eat matters just as much? In the midst of ongoing scientific debate around the benefits of intermittent fasting, this question becomes even more intriguing. As someone passionate about machine learning and healthy living,…

  • Model Compression: Make Your Machine Learning Models Lighter and Faster

    Model Compression: Make Your Machine Learning Models Lighter and Faster Introduction Whether you’re preparing for interviews or building Machine Learning systems at your job, model compression has become a must-have skill. In the era of LLMs, where models are getting larger and larger, the challenges around compressing these models to make them more efficient, smaller,…

  • ACP: The Internet Protocol for AI Agents

    ACP: The Internet Protocol for AI Agents With ACP (Agent Communication Protocol), AI agents can collaborate freely across teams, frameworks, technologies, and organizations. It’s a universal protocol that transforms the fragmented landscape of today’s AI Agents into inter-connected team mates. This unlocks new levels of interoperability, reuse, and scale. As an open-source standard with open…

  • The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics

    The Dangers of Deceptive Data Part 2–Base Proportions and Bad Statistics This is a follow-up to my earlier article: The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines. My first article focused on how visualizations can be used to mislead, diving into a form of data presentation widely used in public matters. In this article,…

  • The Shadow Side of AutoML: When No-Code Tools Hurt More Than Help

    The Shadow Side of AutoML: When No-Code Tools Hurt More Than Help Automl has become the gateway drug to machine learning for many organizations. It promises exactly what teams under pressure want to hear: you bring the data, and we’ll handle the modeling. There are no pipelines to manage, no hyperparameters to tune, and no…

  • Categorical and geometric methods in statistical, manifold, and machine learning

    Categorical and geometric methods in statistical, manifold, and machine learning arXiv:2505.03862v1 Announce Type: new Abstract: We present and discuss applications of the category of probabilistic morphisms, initially developed in cite{Le2023}, as well as some geometric methods to several classes of problems in statistical, machine and manifold learning which shall be, along with many other topics,…

  • Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs

    Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs arXiv:2505.03814v1 Announce Type: new Abstract: As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of…

  • Variational Formulation of the Particle Flow Particle Filter

    Variational Formulation of the Particle Flow Particle Filter arXiv:2505.04007v1 Announce Type: new Abstract: This paper provides a formulation of the particle flow particle filter from the perspective of variational inference. We show that the transient density used to derive the particle flow particle filter follows a time-scaled trajectory of the Fisher-Rao gradient flow in the…

  • A Tutorial on Discriminative Clustering and Mutual Information

    A Tutorial on Discriminative Clustering and Mutual Information arXiv:2505.04484v1 Announce Type: new Abstract: To cluster data is to separate samples into distinctive groups that should ideally have some cohesive properties. Today, numerous clustering algorithms exist, and their differences lie essentially in what can be perceived as “cohesive properties”. Therefore, hypotheses on the nature of clusters…

  • From Two Sample Testing to Singular Gaussian Discrimination

    From Two Sample Testing to Singular Gaussian Discrimination arXiv:2505.04613v1 Announce Type: new Abstract: We establish that testing for the equality of two probability measures on a general separable and compact metric space is equivalent to testing for the singularity between two corresponding Gaussian measures on a suitable Reproducing Kernel Hilbert Space. The corresponding Gaussians are…

  • Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents

    Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents Introduction Every company I worked for until today, there it was: the resilient MS Excel. Excel was first released in 1985 and has remained strong until today. It has survived the rise of relational databases, the evolution of many programming languages, the Internet with…

  • A Practical Guide to BERTopic for Transformer-Based Topic Modeling

    A Practical Guide to BERTopic for Transformer-Based Topic Modeling Topic modeling has a wide range of use cases in the natural language processing (NLP) domain, such as document tagging, survey analysis, and content organization. It falls under the realm of unsupervised learning technique, making it a very cost-effective technique that reduces the resources required to…

  • Real-Time Interactive Sentiment Analysis in Python

    Real-Time Interactive Sentiment Analysis in Python You know what the best part of being an engineer is? You can just build stuff. It’s like a superpower. One rainy afternoon I had this random idea of creating a sentiment visualization of a text input with a smiley face that changes it’s expression base on how positive…

  • Uh-Uh, Not Guilty

    Uh-Uh, Not Guilty When the six merry murderesses of the Cook County Jail climbed the stage in the Chicago musical, they were aligned on the message:  They had it coming, they had it coming all along. I didn’t do it. But if I’d done it, how could you tell me that I was wrong? And the part of…

  • From RGB to HSV — and Back Again

    From RGB to HSV — and Back Again Introduction A fundamental concept in Computer Vision is understanding how images are stored and represented. On disk, image files are encoded in various ways, from lossy, compressed JPEG files to lossless PNG files. Once you load an image into a program and decode it from the respective…

  • Modeling Spatial Extremes using Non-Gaussian Spatial Autoregressive Models via Convolutional Neural Networks

    Modeling Spatial Extremes using Non-Gaussian Spatial Autoregressive Models via Convolutional Neural Networks arXiv:2505.03034v1 Announce Type: new Abstract: Data derived from remote sensing or numerical simulations often have a regular gridded structure and are large in volume, making it challenging to find accurate spatial models that can fill in missing grid cells or simulate the process…

  • GeoERM: Geometry-Aware Multi-Task Representation Learning on Riemannian Manifolds

    GeoERM: Geometry-Aware Multi-Task Representation Learning on Riemannian Manifolds arXiv:2505.02972v1 Announce Type: new Abstract: Multi-Task Learning (MTL) seeks to boost statistical power and learning efficiency by discovering structure shared across related tasks. State-of-the-art MTL representation methods, however, usually treat the latent representation matrix as a point in ordinary Euclidean space, ignoring its often non-Euclidean geometry, thus…

  • A Symbolic and Statistical Learning Framework to Discover Bioprocessing Regulatory Mechanism: Cell Culture Example

    A Symbolic and Statistical Learning Framework to Discover Bioprocessing Regulatory Mechanism: Cell Culture Example arXiv:2505.03177v1 Announce Type: new Abstract: Bioprocess mechanistic modeling is essential for advancing intelligent digital twin representation of biomanufacturing, yet challenges persist due to complex intracellular regulation, stochastic system behavior, and limited experimental data. This paper introduces a symbolic and statistical learning…

  • Weighted Average Gradients for Feature Attribution

    Weighted Average Gradients for Feature Attribution arXiv:2505.03201v1 Announce Type: new Abstract: In explainable AI, Integrated Gradients (IG) is a widely adopted technique for assessing the significance of feature attributes of the input on model outputs by evaluating contributions from a baseline input to the current input. The choice of the baseline input significantly influences the…

  • Lower Bounds for Greedy Teaching Set Constructions

    Lower Bounds for Greedy Teaching Set Constructions arXiv:2505.03223v1 Announce Type: new Abstract: A fundamental open problem in learning theory is to characterize the best-case teaching dimension $operatorname{TS}_{min}$ of a concept class $mathcal{C}$ with finite VC dimension $d$. Resolving this problem will, in particular, settle the conjectured upper bound on Recursive Teaching Dimension posed by [Simon…

  • Regression Discontinuity Design: How It Works and When to Use It

    Regression Discontinuity Design: How It Works and When to Use It Regression Discontinuity Design: How It Works and When to Use It You’re an avid data scientist and experimenter. You know that randomisation is the summit of Mount Evidence Credibility, and you also know that when you can’t randomise, you resort to observational data and…

  • We Need a Fourth Law of Robotics in the Age of AI

    We Need a Fourth Law of Robotics in the Age of AI Artificial Intelligence has become a mainstay of our daily lives, revolutionizing industries, accelerating scientific discoveries, and reshaping how we communicate. Yet, alongside its undeniable benefits, AI has also ignited a range of ethical and social dilemmas that our existing regulatory frameworks have struggled…

  • Retrieval Augmented Classification: Improving Text Classification with External Knowledge

    Retrieval Augmented Classification: Improving Text Classification with External Knowledge Text Classification stands as one of the most basic yet most important applications of natural language processing. It has a vital role in many real-world applications that go from filtering unwanted emails like spam, detecting product categories or classifying user intent in a chat-bot application. The…

  • How I Built Business-Automating Workflows with AI Agents

    How I Built Business-Automating Workflows with AI Agents AI agents and automation are no longer just a trend — they are transforming how companies operate. In a previous article, I shared several case studies of AI Agents supporting the sustainability roadmaps of small, medium and large companies. AI Agents for Sustainability — (Image by Samir Saci) This is part of a…

  • The Total Derivative: Correcting the Misconception of Backpropagation’s Chain Rule

    The Total Derivative: Correcting the Misconception of Backpropagation’s Chain Rule This article uses concepts from this brilliant paper. For a deeper understanding of the mathematics please refer to the paper. Here we try to present the math in a more intuitive and explicit way, with some important nuances highlighted. 1 Introduction Discussions about Backpropagation often…