Category: aimldsaimlds
-
Optimal Scheduling of Dynamic Transport
Optimal Scheduling of Dynamic Transport arXiv:2504.14425v1 Announce Type: new Abstract: Flow-based methods for sampling and generative modeling use continuous-time dynamical systems to represent a {transport map} that pushes forward a source measure to a target measure. The introduction of a time axis provides considerable design freedom, and a central question is how to exploit this…
-
Expected Free Energy-based Planning as Variational Inference
Expected Free Energy-based Planning as Variational Inference arXiv:2504.14898v1 Announce Type: new Abstract: We address the problem of planning under uncertainty, where an agent must choose actions that not only achieve desired outcomes but also reduce uncertainty. Traditional methods often treat exploration and exploitation as separate objectives, lacking a unified inferential foundation. Active inference, grounded in…
-
On the Tunability of Random Survival Forests Model for Predictive Maintenance
On the Tunability of Random Survival Forests Model for Predictive Maintenance arXiv:2504.14744v1 Announce Type: new Abstract: This paper investigates the tunability of the Random Survival Forest (RSF) model in predictive maintenance, where accurate time-to-failure estimation is crucial. Although RSF is widely used due to its flexibility and ability to handle censored data, its performance is…
-
Advanced posterior analyses of hidden Markov models: finite Markov chain imbedding and hybrid decoding
Advanced posterior analyses of hidden Markov models: finite Markov chain imbedding and hybrid decoding arXiv:2504.15156v1 Announce Type: new Abstract: Two major tasks in applications of hidden Markov models are to (i) compute distributions of summary statistics of the hidden state sequence, and (ii) decode the hidden state sequence. We describe finite Markov chain imbedding (FMCI)…
-
Building a Personal API for Your Data Projects with FastAPI
Building a Personal API for Your Data Projects with FastAPI How many times have you had a messy Jupyter Notebook filled with copy-pasted code just to re-use some data wrangling logic? Whether you do it for passion or for work, if you code a lot, then you’ve probably answered something like “way too many”. You’re…
-
Beginner’s Guide to Creating a S3 Storage on AWS
Beginner’s Guide to Creating a S3 Storage on AWS Introduction AWS is a well-known cloud provider whose primary goal is to allocate server resources for software engineers to deploy their applications. AWS offers many services, one of which is EC2, providing virtual machines for running software applications in the cloud. However, for data-intensive applications, storing…
-
Retrieval Augmented Generation (RAG) — An Introduction
Retrieval Augmented Generation (RAG) — An Introduction The model hallucinated! It was giving me OK answers and then it just started hallucinating. We’ve all heard or experienced it. Natural Language Generation models can sometimes hallucinate, i.e., they start generating text that is not quite accurate for the prompt provided. In layman’s terms, they start making…
-
Beyond the Code: Unconventional Lessons from Empathetic Interviewing
Beyond the Code: Unconventional Lessons from Empathetic Interviewing Recently, I’ve been interviewing Computer Science students applying for data science and engineering internships with a 4-day turnaround from CV vetting to final decisions. With a small local office of 10 and no in-house HR, hiring managers handle the entire process. This article reflects on the lessons…
-
How to Write Queries for Tabular Models with DAX
How to Write Queries for Tabular Models with DAX Introduction EVALUATE is the statement to query tabular models. Unfortunately, knowing SQL or any other query language doesn’t help as EVALUATE follows a different concept. EVALUATE has only two “Parameters”: A table to show A sort order (ORDER BY) You can pass a third parameter (START…
-
Predicting Forced Responses of Probability Distributions via the Fluctuation-Dissipation Theorem and Generative Modeling
Predicting Forced Responses of Probability Distributions via the Fluctuation-Dissipation Theorem and Generative Modeling arXiv:2504.13333v1 Announce Type: new Abstract: We present a novel data-driven framework for estimating the response of higher-order moments of nonlinear stochastic systems to small external perturbations. The classical Generalized Fluctuation-Dissipation Theorem (GFDT) links the unperturbed steady-state distribution to the system’s linear response.…
-
Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems
Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems arXiv:2504.13320v1 Announce Type: new Abstract: We introduce a gradient-free framework for Bayesian Optimal Experimental Design (BOED) in sequential settings, aimed at complex systems where gradient information is unavailable. Our method combines Ensemble Kalman Inversion (EKI) for design optimization with the Affine-Invariant Langevin Dynamics (ALDI) sampler for…
-
On the minimax optimality of Flow Matching through the connection to kernel density estimation
On the minimax optimality of Flow Matching through the connection to kernel density estimation arXiv:2504.13336v1 Announce Type: new Abstract: Flow Matching has recently gained attention in generative modeling as a simple and flexible alternative to diffusion models, the current state of the art. While existing statistical guarantees adapt tools from the analysis of diffusion models,…
-
On the Convergence of Irregular Sampling in Reproducing Kernel Hilbert Spaces
On the Convergence of Irregular Sampling in Reproducing Kernel Hilbert Spaces arXiv:2504.13623v1 Announce Type: new Abstract: We analyse the convergence of sampling algorithms for functions in reproducing kernel Hilbert spaces (RKHS). To this end, we discuss approximation properties of kernel regression under minimalistic assumptions on both the kernel and the input data. We first prove…
-
Near-optimal algorithms for private estimation and sequential testing of collision probability
Near-optimal algorithms for private estimation and sequential testing of collision probability arXiv:2504.13804v1 Announce Type: new Abstract: We present new algorithms for estimating and testing emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(alpha, beta)$-local differential privacy and…
-
Weekly Entering & Transitioning – Thread 21 Apr, 2025 – 28 Apr, 2025
Weekly Entering & Transitioning – Thread 21 Apr, 2025 – 28 Apr, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
Load-Testing LLMs Using LLMPerf
Load-Testing LLMs Using LLMPerf Deploying your Large Language Model (LLM) is not necessarily the final step in productionizing your Generative AI application. An often forgotten, yet crucial part of the MLOPs lifecycle is properly load testing your LLM and ensuring it is ready to withstand your expected production traffic. Load testing at a high level…
-
Robust and Scalable Variational Bayes
Robust and Scalable Variational Bayes arXiv:2504.12528v1 Announce Type: new Abstract: We propose a robust and scalable framework for variational Bayes (VB) that effectively handles outliers and contamination of arbitrary nature in large datasets. Our approach divides the dataset into disjoint subsets, computes the posterior for each subset, and applies VB approximation independently to these posteriors.…
-
Resonances in reflective Hamiltonian Monte Carlo
Resonances in reflective Hamiltonian Monte Carlo arXiv:2504.12374v1 Announce Type: new Abstract: In high dimensions, reflective Hamiltonian Monte Carlo with inexact reflections exhibits slow mixing when the particle ensemble is initialised from a Dirac delta distribution and the uniform distribution is targeted. By quantifying the instantaneous non-uniformity of the distribution with the Sinkhorn divergence, we elucidate…
-
Spectral Algorithms under Covariate Shift
Spectral Algorithms under Covariate Shift arXiv:2504.12625v1 Announce Type: new Abstract: Spectral algorithms leverage spectral regularization techniques to analyze and process data, providing a flexible framework for addressing supervised learning problems. To deepen our understanding of their performance in real-world scenarios where the distributions of training and test data may differ, we conduct a rigorous investigation…
-
When do Random Forests work?
When do Random Forests work? arXiv:2504.12860v1 Announce Type: new Abstract: We study the effectiveness of randomizing split-directions in random forests. Prior literature has shown that, on the one hand, randomization can reduce variance through decorrelation, and, on the other hand, randomization regularizes and works in low signal-to-noise ratio (SNR) environments. First, we bring together and…
-
Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time
Propagation of Chaos in One-hidden-layer Neural Networks beyond Logarithmic Time arXiv:2504.13110v1 Announce Type: new Abstract: We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential…
-
When Physics Meets Finance: Using AI to Solve Black-Scholes
When Physics Meets Finance: Using AI to Solve Black-Scholes DISCLAIMER: This is not financial advice. I’m a PhD in Aerospace Engineering with a strong focus on Machine Learning: I’m not a financial advisor. This article is intended solely to demonstrate the power of Physics-Informed Neural Networks (PINNs) in a financial context. When I was 16,…
-
Google’s New AI System Outperforms Physicians in Complex Diagnoses
Google’s New AI System Outperforms Physicians in Complex Diagnoses Imagine going to the doctor with a baffling set of symptoms. Getting the right diagnosis quickly is crucial, but sometimes even experienced physicians face challenges piecing together the puzzle. Sometimes it might not be something serious at all; others a deep investigation might be required. No…
-
The Good-Enough Truth
The Good-Enough Truth Could Shopify be right in requiring teams to demonstrate why AI can’t do a job before approving new human hires? Will companies that prioritize AI solutions eventually evolve into AI entities with significantly fewer employees? These are open-ended questions that have puzzled me about where such transformations might leave us in our quest for…
-
FEAT: Free energy Estimators with Adaptive Transport
FEAT: Free energy Estimators with Adaptive Transport arXiv:2504.11516v1 Announce Type: new Abstract: We present Free energy Estimators with Adaptive Transport (FEAT), a novel framework for free energy estimation — a critical challenge across scientific domains. FEAT leverages learned transports implemented via stochastic interpolants and provides consistent, minimum-variance estimators based on escorted Jarzynski equality and controlled…
-
Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations
Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations arXiv:2504.11554v1 Announce Type: new Abstract: Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference…
-
Towards Interpretable Deep Generative Models via Causal Representation Learning
Towards Interpretable Deep Generative Models via Causal Representation Learning arXiv:2504.11609v1 Announce Type: new Abstract: Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods’ surprising performance is due in part to their ability to learn implicit “representations”…
-
Discrimination-free Insurance Pricing with Privatized Sensitive Attributes
Discrimination-free Insurance Pricing with Privatized Sensitive Attributes arXiv:2504.11775v1 Announce Type: new Abstract: Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes…
-
Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations
Generalized probabilistic canonical correlation analysis for multi-modal data integration with full or partial observations arXiv:2504.11610v1 Announce Type: new Abstract: Background: The integration and analysis of multi-modal data are increasingly essential across various domains including bioinformatics. As the volume and complexity of such data grow, there is a pressing need for computational models that not only…
-
When Predictors Collide: Mastering VIF in Multicollinear Regression
When Predictors Collide: Mastering VIF in Multicollinear Regression In regression models, the independent variables must be not or only slightly dependent on each other, i.e. that they are not correlated. However, if such a dependency exists, this is referred to as Multicollinearity and leads to unstable models and results that are difficult to interpret. The…
-
AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse
AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse arXiv:2504.10540v1 Announce Type: new Abstract: Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack…
-
Differentially Private Geodesic and Linear Regression
Differentially Private Geodesic and Linear Regression arXiv:2504.11304v1 Announce Type: new Abstract: In statistical applications it has become increasingly common to encounter data structures that live on non-linear spaces such as manifolds. Classical linear regression, one of the most fundamental methodologies of statistical learning, captures the relationship between an independent variable and a response variable which…
-
Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks
Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks arXiv:2504.10598v1 Announce Type: new Abstract: We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error — a…
-
Formalising Anti-Discrimination Law in Automated Decision Systems
Formalising Anti-Discrimination Law in Automated Decision Systems arXiv:2407.00400v2 Announce Type: cross Abstract: Algorithmic discrimination is a critical concern as machine learning models are used in high-stakes decision-making in legally protected contexts. Although substantial research on algorithmic bias and discrimination has led to the development of fairness metrics, several critical legal issues remain unaddressed in practice.…
-
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling arXiv:2504.10612v1 Announce Type: cross Abstract: Generative models often map noise to data by matching flows or scores, but these approaches become cumbersome for incorporating partial observations or additional priors. Inspired by recent advances in Wasserstein gradient flows, we propose Energy Matching, a framework that…
-
An Unbiased Review of Snowflake’s Document AI
An Unbiased Review of Snowflake’s Document AI As data professionals, we’re comfortable with tabular data… Tabular data. Image by Author. We can also handle words, json, xml feeds, and pictures of cats. But what about a cardboard box full of things like this? (Image by Annie Spratt, Unsplash) The info on this receipt wants so…
-
Plotly’s AI Tools Are Redefining Data Science Workflows
Plotly’s AI Tools Are Redefining Data Science Workflows Is there anything more frustrating than building a powerful data model but then struggling to turn it into a tool stakeholders can use to achieve their desired outcome? Data Science has never been short on potential but is also never short on complexity. You can refine algorithms…
-
Double Machine Learning for Causal Inference under Shared-State Interference
Double Machine Learning for Causal Inference under Shared-State Interference arXiv:2504.08836v1 Announce Type: new Abstract: Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain shared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling…
-
An Incremental Non-Linear Manifold Approximation Method
An Incremental Non-Linear Manifold Approximation Method arXiv:2504.09068v1 Announce Type: new Abstract: Analyzing high-dimensional data presents challenges due to the “curse of dimensionality”, making computations intensive. Dimension reduction techniques, categorized as linear or non-linear, simplify such data. Non-linear methods are particularly essential for efficiently visualizing and processing complex data structures in interactive and graphical applications. This…
-
Improving the evaluation of samplers on multi-modal targets
Improving the evaluation of samplers on multi-modal targets arXiv:2504.08916v1 Announce Type: new Abstract: Addressing multi-modality constitutes one of the major challenges of sampling. In this reflection paper, we advocate for a more systematic evaluation of samplers towards two sources of difficulty that are mode separation and dimension. For this, we propose a synthetic experimental setting…
-
Dose-finding design based on level set estimation in phase I cancer clinical trials
Dose-finding design based on level set estimation in phase I cancer clinical trials arXiv:2504.09157v1 Announce Type: new Abstract: The primary objective of phase I cancer clinical trials is to evaluate the safety of a new experimental treatment and to find the maximum tolerated dose (MTD). We show that the MTD estimation problem can be regarded…
-
No-Regret Generative Modeling via Parabolic Monge-Amp`ere PDE
No-Regret Generative Modeling via Parabolic Monge-Amp`ere PDE arXiv:2504.09279v1 Announce Type: new Abstract: We introduce a novel generative modeling framework based on a discretized parabolic Monge-Amp`ere PDE, which emerges as a continuous limit of the Sinkhorn algorithm commonly used in optimal transport. Our method performs iterative refinement in the space of Brenier maps using a mirror…
-
An LLM-Based Workflow for Automated Tabular Data Validation
An LLM-Based Workflow for Automated Tabular Data Validation This article is part of a series of articles on automating data cleaning for any tabular dataset: Effortless Spreadsheet Normalisation With LLM You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration. What…
-
Layers of the AI Stack, Explained Simply
Layers of the AI Stack, Explained Simply This is the first in a multi-part series on creating web applications with Generative Ai integration. Table of Contents Introduction The Virtues of the Application Layer Thick Wrappers The Return of Clippy Getting Stuff Done While You Sleep Introduction The AI space is a vast and complicated landscape. Matt…
-
Weekly Entering & Transitioning – Thread 14 Apr, 2025 – 21 Apr, 2025
Weekly Entering & Transitioning – Thread 14 Apr, 2025 – 21 Apr, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech
Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech Recently, Sesame AI published a demo of their latest Speech-to-Speech model. A conversational AI agent who is really good at speaking, they provide relevant answers, they speak with expressions, and honestly, they are just very fun and interactive to play with. Note that a…
-
Learnings from a Machine Learning Engineer — Part 6: The Human Side
Learnings from a Machine Learning Engineer — Part 6: The Human Side In my previous articles, I have spent a lot of time talking about the technical aspects of an Image Classification problem from data collection, model evaluation, performance optimization, and a detailed look at model training. These elements require a certain degree of in-depth expertise, and they (usually) have well-defined…
-
Are You Sure Your Posterior Makes Sense?
Are You Sure Your Posterior Makes Sense? This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support. Introduction Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to…
-
Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond
Can SGD Select Good Fishermen? Local Convergence under Self-Selection Biases and Beyond arXiv:2504.07133v1 Announce Type: new Abstract: We revisit the problem of estimating $k$ linear regressors with self-selection bias in $d$ dimensions with the maximum selection criterion, as introduced by Cherapanamjeri, Daskalakis, Ilyas, and Zampetakis [CDIZ23, STOC’23]. Our main result is a $operatorname{poly}(d,k,1/varepsilon) + {k}^{O(k)}$…
-
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents arXiv:2504.07347v1 Announce Type: new Abstract: As demand for Large Language Models (LLMs) and AI agents rapidly grows, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little is explored through a mathematical modeling and queuing perspective. In this paper, we…
-
Performance of Rank-One Tensor Approximation on Incomplete Data
Performance of Rank-One Tensor Approximation on Incomplete Data arXiv:2504.07818v1 Announce Type: new Abstract: We are interested in the estimation of a rank-one tensor signal when only a portion $varepsilon$ of its noisy observation is available. We show that the study of this problem can be reduced to that of a random matrix model whose spectral…
-
Gradient-based Sample Selection for Faster Bayesian Optimization
Gradient-based Sample Selection for Faster Bayesian Optimization arXiv:2504.07742v1 Announce Type: new Abstract: Bayesian optimization (BO) is an effective technique for black-box optimization. However, its applicability is typically limited to moderate-budget problems due to the cubic complexity in computing the Gaussian process (GP) surrogate model. In large-budget scenarios, directly employing the standard GP model faces significant…
-
Smoothed Distance Kernels for MMDs and Applications in Wasserstein Gradient Flows
Smoothed Distance Kernels for MMDs and Applications in Wasserstein Gradient Flows arXiv:2504.07820v1 Announce Type: new Abstract: Negative distance kernels $K(x,y) := – |x-y|$ were used in the definition of maximum mean discrepancies (MMDs) in statistics and lead to favorable numerical results in various applications. In particular, so-called slicing techniques for handling high-dimensional kernel summations profit…
-
The Basis of Cognitive Complexity: Teaching CNNs to See Connections
The Basis of Cognitive Complexity: Teaching CNNs to See Connections Liberating education consists in acts of cognition, not transferrals of information. Paulo freire One of the most heated discussions around artificial intelligence is: What aspects of human learning is it capable of capturing? Many authors suggest that artificial intelligence models do not possess the same…
-
The Invisible Revolution: How Vectors Are (Re)defining Business Success
The Invisible Revolution: How Vectors Are (Re)defining Business Success In a world that focuses more on data, business leaders must understand vector thinking. At first, vectors may appear as complicated as algebra was in school, but they serve as a fundamental building block. Vectors are as essential as algebra for tasks like sharing a bill…
-
How to Measure Real Model Accuracy When Labels Are Noisy
How to Measure Real Model Accuracy When Labels Are Noisy Ground truth is never perfect. From scientific measurements to human annotations used to train deep learning models, ground truth always has some amount of errors. ImageNet, arguably the most well-curated image dataset has 0.3% errors in human annotations. Then, how can we evaluate predictive models…
-
Ivory Tower Notes: The Problem
Ivory Tower Notes: The Problem Did you ever spend months on a Machine Learning project, only to discover you never defined the “correct” problem at the start? If so, or even if not, and you are only starting with the data science or AI field, welcome to my first Ivory Tower Note, where I will address…
-
Deep spatio-temporal point processes: Advances and new directions
Deep spatio-temporal point processes: Advances and new directions arXiv:2504.06364v1 Announce Type: new Abstract: Spatio-temporal point processes (STPPs) model discrete events distributed in time and space, with important applications in areas such as criminology, seismology, epidemiology, and social networks. Traditional models often rely on parametric kernels, limiting their ability to capture heterogeneous, nonstationary dynamics. Recent innovations…
-
Sparsified-Learning for Heavy-Tailed Locally Stationary Processes
Sparsified-Learning for Heavy-Tailed Locally Stationary Processes arXiv:2504.06477v1 Announce Type: new Abstract: Sparsified Learning is ubiquitous in many machine learning tasks. It aims to regularize the objective function by adding a penalization term that considers the constraints made on the learned parameters. This paper considers the problem of learning heavy-tailed LSP. We develop a flexible and…
-
Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks
Deep Fair Learning: A Unified Framework for Fine-tuning Representations with Sufficient Networks arXiv:2504.06470v1 Announce Type: new Abstract: Ensuring fairness in machine learning is a critical and challenging task, as biased data representations often lead to unfair predictions. To address this, we propose Deep Fair Learning, a framework that integrates nonlinear sufficient dimension reduction with deep…
-
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization arXiv:2504.05804v1 Announce Type: cross Abstract: The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present StealthRank, a novel adversarial ranking attack that manipulates LLM-driven product recommendation systems while maintaining textual fluency and stealth. Unlike existing…
-
A Metropolis-Adjusted Langevin Algorithm for Sampling Jeffreys Prior
A Metropolis-Adjusted Langevin Algorithm for Sampling Jeffreys Prior arXiv:2504.06372v1 Announce Type: cross Abstract: Inference and estimation are fundamental aspects of statistics, system identification and machine learning. For most inference problems, prior knowledge is available on the system to be modeled, and Bayesian analysis is a natural framework to impose such prior information in the form…
-
Why CatBoost Works So Well: The Engineering Behind the Magic
Why CatBoost Works So Well: The Engineering Behind the Magic Gradient boosting is a cornerstone technique for modeling tabular data due to its speed and simplicity. It delivers great results without any fuss. When you look around you’ll see multiple options like LightGBM, XGBoost, etc. Catboost is one such variant. In this post, we will…
-
Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o
Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o Introduction I’ve always been fascinated by debates—the strategic framing, the sharp retorts, and the carefully timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I started wondering: could we replicate that dynamic using AI agents—having them debate each…
-
Time Series Forecasting Made Simple (Part 1): Decomposition and Baseline Models
Time Series Forecasting Made Simple (Part 1): Decomposition and Baseline Models I used to avoid time series analysis. Every time I took an online course, I’d see a module titled “Time Series Analysis” with subtopics like Fourier Transforms, autocorrelation functions and other intimidating terms. I don’t know why, but I always found a reason to avoid…
-
Mining Rules from Data
Mining Rules from Data Working with products, we might face a need to introduce some “rules”. Let me explain what I mean by “rules” in practical examples: Imagine that we’re seeing a massive wave of fraud in our product, and we want to restrict onboarding for a particular segment of customers to lower this risk. For…
-
Hyperflows: Pruning Reveals the Importance of Weights
Hyperflows: Pruning Reveals the Importance of Weights arXiv:2504.05349v1 Announce Type: new Abstract: Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most existing methods struggle to accurately assess the importance of individual weights due to their inherent interrelatedness, leading to poor performance, especially at extreme sparsity levels. We…
-
Survey on Algorithms for multi-index models
Survey on Algorithms for multi-index models arXiv:2504.05426v1 Announce Type: new Abstract: We review the literature on algorithms for estimating the index space in a multi-index model. The primary focus is on computationally efficient (polynomial-time) algorithms in Gaussian space, the assumptions under which consistency is guaranteed by these methods, and their sample complexity. In many cases,…
-
Actuarial Learning for Pension Fund Mortality Forecasting
Actuarial Learning for Pension Fund Mortality Forecasting arXiv:2504.05881v1 Announce Type: new Abstract: For the assessment of the financial soundness of a pension fund, it is necessary to take into account mortality forecasting so that longevity risk is consistently incorporated into future cash flows. In this article, we employ machine learning models applied to actuarial science…
-
Improved Inference of Inverse Ising Problems under Missing Observations in Restricted Boltzmann Machines
Improved Inference of Inverse Ising Problems under Missing Observations in Restricted Boltzmann Machines arXiv:2504.05643v1 Announce Type: new Abstract: Restricted Boltzmann machines (RBMs) are energy-based models analogous to the Ising model and are widely applied in statistical machine learning. The standard inverse Ising problem with a complete dataset requires computing both data and model expectations and…
-
Matched Topological Subspace Detector
Matched Topological Subspace Detector arXiv:2504.05892v1 Announce Type: new Abstract: Topological spaces, represented by simplicial complexes, capture richer relationships than graphs by modeling interactions not only between nodes but also among higher-order entities, such as edges or triangles. This motivates the representation of information defined in irregular domains as topological signals. By leveraging the spectral dualities…
-
A Data Scientist’s Guide to Docker Containers
A Data Scientist’s Guide to Docker Containers For a ML model to be useful it needs to run somewhere. This somewhere is most likely not your local machine. A not-so-good model that runs in a production environment is better than a perfect model that never leaves your local machine. However, the production machine is usually…
-
Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation
Unlock the Power of ROC Curves: Intuitive Insights for Better Model Evaluation We’ve all been in that moment, right? Staring at a chart as if it’s some ancient script, wondering how we’re supposed to make sense of it all. That’s exactly how I felt when I was asked to explain the AUC for the ROC…
-
Circuit Tracing: A Step Closer to Understanding Large Language Models
Circuit Tracing: A Step Closer to Understanding Large Language Models Context Over the years, Transformer-based large language models (LLMs) have made substantial progress across a wide range of tasks evolving from simple information retrieval systems to sophisticated agents capable of coding, writing, conducting research, and much more. But despite their capabilities, these models are still largely…
-
Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization
Batch Bayesian Optimization for High-Dimensional Experimental Design: Simulation and Visualization arXiv:2504.03943v1 Announce Type: new Abstract: Bayesian Optimization (BO) is increasingly used to guide experimental optimization tasks. To elucidate BO behavior in noisy and high-dimensional settings typical for materials science applications, we perform batch BO of two six-dimensional test functions: an Ackley function representing a needle-in-a-haystack…
-
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning arXiv:2504.03784v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies…
-
Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows
Spatially-Heterogeneous Causal Bayesian Networks for Seismic Multi-Hazard Estimation: A Variational Approach with Gaussian Processes and Normalizing Flows arXiv:2504.04013v1 Announce Type: new Abstract: Post-earthquake hazard and impact estimation are critical for effective disaster response, yet current approaches face significant limitations. Traditional models employ fixed parameters regardless of geographical context, misrepresenting how seismic effects vary across diverse…
-
Computational Efficient Informative Nonignorable Matrix Completion: A Row- and Column-Wise Matrix U-Statistic Pseudo-Likelihood Approach
Computational Efficient Informative Nonignorable Matrix Completion: A Row- and Column-Wise Matrix U-Statistic Pseudo-Likelihood Approach arXiv:2504.04016v1 Announce Type: new Abstract: In this study, we establish a unified framework to deal with the high dimensional matrix completion problem under flexible nonignorable missing mechanisms. Although the matrix completion problem has attracted much attention over the years, there are…
-
Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes
Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes arXiv:2504.04105v1 Announce Type: new Abstract: We study $textit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $eta$. We show that after at most $1/gamma^2$ burn-in steps, GD…
-
Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations
Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations When you’re about to buy a home, whether you’re an everyday buyer looking for your dream house or a seasoned property investor, there’s a good chance you’ve encountered automated valuation models, or AVMs. These clever tools use massive datasets filled with past property transactions to…
-
How to Optimize your Python Program for Slowness
How to Optimize your Python Program for Slowness Also available: A Rust version of this article. Everyone talks about making Python programs faster [1, 2, 3], but what if we pursue the opposite goal? Let’s explore how to make them slower — absurdly slower. Along the way, we’ll examine the nature of computation, the role of memory,…
-
Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together
Let’s Call a Spade a Spade: RDF and LPG — Cousins Who Should Learn to Live Together In recent years, there has been a proliferation of articles, LinkedIn posts, and marketing materials presenting graph data models from different perspectives. This article will refrain from discussing specific products and instead focus solely on the comparison of…
-
ConfEviSurrogate: A Conformalized Evidential Surrogate Model for Uncertainty Quantification
ConfEviSurrogate: A Conformalized Evidential Surrogate Model for Uncertainty Quantification arXiv:2504.02919v1 Announce Type: new Abstract: Surrogate models, crucial for approximating complex simulation data across sciences, inherently carry uncertainties that range from simulation noise to model prediction errors. Without rigorous uncertainty quantification, predictions become unreliable and hence hinder analysis. While methods like Monte Carlo dropout and ensemble…
-
High-dimensional ridge regression with random features for non-identically distributed data with a variance profile
High-dimensional ridge regression with random features for non-identically distributed data with a variance profile arXiv:2504.03035v1 Announce Type: new Abstract: The behavior of the random feature model in the high-dimensional regression framework has become a popular issue of interest in the machine learning literature}. This model is generally considered for feature vectors $x_i = Sigma^{1/2} x_i’$,…
-
A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials
A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials arXiv:2504.03097v1 Announce Type: new Abstract: In this paper, we study the problem of multivariate shuffled linear regression, where the correspondence between predictors and responses in a linear model is obfuscated by a latent permutation. Specifically, we investigate the model $Y=tfrac{1}{sqrt{1+sigma^2}}(Pi_* X Q_* +…
-
Accelerating Particle-based Energetic Variational Inference
Accelerating Particle-based Energetic Variational Inference arXiv:2504.03158v1 Announce Type: new Abstract: In this work, we propose a novel particle-based variational inference (ParVI) method that accelerates the EVI-Im. Inspired by energy quadratization (EQ) and operator splitting techniques for gradient flows, our approach efficiently drives particles towards the target distribution. Unlike EVI-Im, which employs the implicit Euler method…
-
Bayesian Optimization of Robustness Measures Using Randomized GP-UCB-based Algorithms under Input Uncertainty
Bayesian Optimization of Robustness Measures Using Randomized GP-UCB-based Algorithms under Input Uncertainty arXiv:2504.03172v1 Announce Type: new Abstract: Bayesian optimization based on Gaussian process upper confidence bound (GP-UCB) has a theoretical guarantee for optimizing black-box functions. Black-box functions often have input uncertainty, but even in this case, GP-UCB can be extended to optimize evaluation measures called…
-
Weekly Entering & Transitioning – Thread 07 Apr, 2025 – 14 Apr, 2025
Weekly Entering & Transitioning – Thread 07 Apr, 2025 – 14 Apr, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
How I Would Learn To Code (If I Could Start Over)
How I Would Learn To Code (If I Could Start Over) According to various sources, the average salary for Coding jobs is ~£47.5k in the UK, which is ~35% higher than the median salary of about £35k. So, coding is a very valuable skill that will earn you more money, not to mention it’s really fun.…
-
Creating an AI Agent to Write Blog Posts with CrewAI
Creating an AI Agent to Write Blog Posts with CrewAI Introduction I love writing. You may notice that if you follow me or my blog. For that reason, I am constantly producing new content and talking about Data Science and Artificial Intelligence. I discovered this passion a couple of years ago when I was just…
-
Analytical Discovery of Manifold with Machine Learning
Analytical Discovery of Manifold with Machine Learning arXiv:2504.02511v1 Announce Type: new Abstract: Understanding low-dimensional structures within high-dimensional data is crucial for visualization, interpretation, and denoising in complex datasets. Despite the advancements in manifold learning techniques, key challenges-such as limited global insight and the lack of interpretable analytical descriptions-remain unresolved. In this work, we introduce a…
-
Dynamic Assortment Selection and Pricing with Censored Preference Feedback
Dynamic Assortment Selection and Pricing with Censored Preference Feedback arXiv:2504.02324v1 Announce Type: new Abstract: In this study, we investigate the problem of dynamic multi-product selection and pricing by introducing a novel framework based on a textit{censored multinomial logit} (C-MNL) choice model. In this model, sellers present a set of products with prices, and buyers filter…
-
Online Multivariate Regularized Distributional Regression for High-dimensional Probabilistic Electricity Price Forecasting
Online Multivariate Regularized Distributional Regression for High-dimensional Probabilistic Electricity Price Forecasting arXiv:2504.02518v1 Announce Type: new Abstract: Probabilistic electricity price forecasting (PEPF) is a key task for market participants in short-term electricity markets. The increasing availability of high-frequency data and the need for real-time decision-making in energy markets require online estimation methods for efficient model updating.…
-
On Model Protection in Federated Learning against Eavesdropping Attacks
On Model Protection in Federated Learning against Eavesdropping Attacks arXiv:2504.02114v1 Announce Type: cross Abstract: In this study, we investigate the protection offered by federated learning algorithms against eavesdropping adversaries. In our model, the adversary is capable of intercepting model updates transmitted from clients to the server, enabling it to create its own estimate of the…
-
Towards Interpretable Soft Prompts
Towards Interpretable Soft Prompts arXiv:2504.02144v1 Announce Type: cross Abstract: Soft prompts have been popularized as a cheap and easy way to improve task-specific LLM performance beyond few-shot prompts. Despite their origin as an automated prompting method, however, soft prompts and other trainable prompts remain a black-box method with no immediately interpretable connections to prompting. We…
-
Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data
Are We Watching More Ads Than Content? Analyzing YouTube Sponsor Data I’m definitely not the only person who feels that YouTube sponsor segments have become longer and more frequent recently. Sometimes, I watch videos that seem to be trying to sell me something every couple of seconds. On one hand, it’s great that both small and…
-
Linear Programming: Managing Multiple Targets with Goal Programming
Linear Programming: Managing Multiple Targets with Goal Programming This is the sixth (and likely last) part of a Linear Programming series I’ve been writing. With the core concepts covered by the prior articles, this article focuses on goal programming which is a less frequent linear programming (LP) use case. Goal programming is a specific linear…
-
Kernel Case Study: Flash Attention
Kernel Case Study: Flash Attention The attention mechanism is at the core of modern day transformers. But scaling the context window of these transformers was a major challenge, and it still is even though we are in the era of a million tokens + context window (Qwen 2.5 [1]). There are both considerable compute and memory…
-
Fair Sufficient Representation Learning
Fair Sufficient Representation Learning arXiv:2504.01030v1 Announce Type: new Abstract: The main objective of fair statistical modeling and machine learning is to minimize or eliminate biases that may arise from the data or the model itself, ensuring that predictions and decisions are not unjustly influenced by sensitive attributes such as race, gender, age, or other protected…
-
Estimating Unbounded Density Ratios: Applications in Error Control under Covariate Shift
Estimating Unbounded Density Ratios: Applications in Error Control under Covariate Shift arXiv:2504.01031v1 Announce Type: new Abstract: The density ratio is an important metric for evaluating the relative likelihood of two probability distributions, with extensive applications in statistics and machine learning. However, existing estimation theories for density ratios often depend on stringent regularity conditions, mainly focusing…