Category: aimldsaimlds

  • TV-SurvCaus: Dynamic Representation Balancing for Causal Survival Analysis

    TV-SurvCaus: Dynamic Representation Balancing for Causal Survival Analysis arXiv:2505.01785v1 Announce Type: new Abstract: Estimating the causal effect of time-varying treatments on survival outcomes is a challenging task in many domains, particularly in medicine where treatment protocols adapt over time. While recent advances in representation learning have improved causal inference for static treatments, extending these methods…

  • Fast Likelihood-Free Parameter Estimation for L’evy Processes

    Fast Likelihood-Free Parameter Estimation for L’evy Processes arXiv:2505.01639v1 Announce Type: new Abstract: L’evy processes are widely used in financial modeling due to their ability to capture discontinuities and heavy tails, which are common in high-frequency asset return data. However, parameter estimation remains a challenge when associated likelihoods are unavailable or costly to compute. We propose…

  • Bayesian learning of the optimal action-value function in a Markov decision process

    Bayesian learning of the optimal action-value function in a Markov decision process arXiv:2505.01859v1 Announce Type: new Abstract: The Markov Decision Process (MDP) is a popular framework for sequential decision-making problems, and uncertainty quantification is an essential component of it to learn optimal decision-making strategies. In particular, a Bayesian framework is used to maintain beliefs about…

  • Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks

    Extended Fiducial Inference for Individual Treatment Effects via Deep Neural Networks arXiv:2505.01995v1 Announce Type: new Abstract: Individual treatment effect estimation has gained significant attention in recent data science literature. This work introduces the Double Neural Network (Double-NN) method to address this problem within the framework of extended fiducial inference (EFI). In the proposed method, deep…

  • Learning the Simplest Neural ODE

    Learning the Simplest Neural ODE arXiv:2505.02019v1 Announce Type: new Abstract: Since the advent of the “Neural Ordinary Differential Equation (Neural ODE)” paper, learning ODEs with deep learning has been applied to system identification, time-series forecasting, and related areas. Exploiting the diffeomorphic nature of ODE solution maps, neural ODEs has also enabled their use in generative…

  • Think. Know. Act. How AI’s Core Capabilities Will Shape the Future of Work

    Think. Know. Act. How AI’s Core Capabilities Will Shape the Future of Work “It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change.” – Charles Darwin, Originator of Evolutionary Theory Not long ago, I came across an article about a CEO, who was visibly…

  • The CNN That Challenges ViT

    The CNN That Challenges ViT Introduction The invention of ViT (Vision Transformer) causes us to think that CNNs are obsolete.  But is this really true? It is widely believed that the impressive performance of ViT comes primarily from its transformer-based architecture. However, researchers from Meta argued that it’s not entirely true. If we take a closer…

  • Diffusion Models, Explained Simply

    Diffusion Models, Explained Simply Introduction Generative AI is one of the most popular terms we hear today. Recently, there has been a surge in generative AI applications involving text, image, audio, and video generation. When it comes to image creation, Diffusion models have emerged as a state-of-the-art technique for content generation. Although they were first introduced…

  • Making Sense of KPI Changes

    Making Sense of KPI Changes As analysts, we are usually monitoring metrics. Quite often, metrics change. And when they do, it’s our job to figure out what’s going on: why did the conversion rate suddenly drop, or what is driving consistent revenue growth? I started my journey in data analytics as a Kpi analyst. For almost…

  • Fine-Tuning vLLMs for Document Understanding

    Fine-Tuning vLLMs for Document Understanding In this article, I discuss how you can fine-tune VLMs (visual large language models, often called vLLMs) like Qwen 2.5 VL 7B. I will introduce you to a dataset of handwritten digits, which the base version of Qwen 2.5 VL struggles with. We will then inspect the dataset, annotate it,…

  • On the emergence of numerical instabilities in Next Generation Reservoir Computing

    On the emergence of numerical instabilities in Next Generation Reservoir Computing arXiv:2505.00846v1 Announce Type: new Abstract: Next Generation Reservoir Computing (NGRC) is a low-cost machine learning method for forecasting chaotic time series from data. However, ensuring the dynamical stability of NGRC models during autonomous prediction remains a challenge. In this work, we uncover a key…

  • DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

    DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects arXiv:2505.00961v1 Announce Type: new Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) for contextual bandit policies leverage historical data to evaluate and optimize a target policy. Most existing OPE/OPL methods–based on importance weighting or imputation–assume common support between the target and logging policies. When this assumption…

  • Gaussian Differential Private Bootstrap by Subsampling

    Gaussian Differential Private Bootstrap by Subsampling arXiv:2505.01197v1 Announce Type: new Abstract: Bootstrap is a common tool for quantifying uncertainty in data analysis. However, besides additional computational costs in the application of the bootstrap on massive data, a challenging problem in bootstrap based inference under Differential Privacy consists in the fact that it requires repeated access…

  • Characterization and Learning of Causal Graphs from Hard Interventions

    Characterization and Learning of Causal Graphs from Hard Interventions arXiv:2505.01037v1 Announce Type: new Abstract: A fundamental challenge in the empirical sciences involves uncovering causal structure through observation and experimentation. Causal discovery entails linking the conditional independence (CI) invariances in observational data to their corresponding graphical constraints via d-separation. In this paper, we consider a general…

  • Provable Efficiency of Guidance in Diffusion Models for General Data Distribution

    Provable Efficiency of Guidance in Diffusion Models for General Data Distribution arXiv:2505.01382v1 Announce Type: new Abstract: Diffusion models have emerged as a powerful framework for generative modeling, with guidance techniques playing a crucial role in enhancing sample quality. Despite their empirical success, a comprehensive theoretical understanding of the guidance effect remains limited. Existing studies only…

  • Weekly Entering & Transitioning – Thread 05 May, 2025 – 12 May, 2025

    Weekly Entering & Transitioning – Thread 05 May, 2025 – 12 May, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • From a Point to L∞

    From a Point to L∞ Why you should read this  As someone who did a Bachelors in Mathematics I was first introduced to L¹ and L² as a measure of Distance… now it seems to be a measure of error — where have we gone wrong? But jokes aside, there seems to be this misconception that L₁ and L₂…

  • Build and Query Knowledge Graphs with LLMs

    Build and Query Knowledge Graphs with LLMs Knowledge Graphs are relevant A Knowledge Graph could be defined as a structured representation of information that connects concepts, entities, and their relationships in a way that mimics human understanding. It is often used to organise and integrate data from various sources, enabling machines to reason, infer, and retrieve relevant…

  • Attaining LLM Certainty with AI Decision Circuits

    Attaining LLM Certainty with AI Decision Circuits The promise of AI agents has taken the world by storm. Agents can interact with the world around them, write articles (not this one though), take actions on your behalf, and generally make the difficult part of automating any task easy and approachable.  Agents take aim at the most…

  • The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79%

    The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79% TL;DR A fast‑growing SaaS woke up to a silent auto‑scale from M20 → M60, adding 20 % to their cloud bill overnight. In a frantic 48‑hour sprint we: flattened N + 1 waterfalls with $lookup , tamed unbounded cursors with projection,…

  • Why I stopped Using Cursor and Reverted to VSCode

    Why I stopped Using Cursor and Reverted to VSCode Introduction In December 2024, I wrote an article sharing my experience using VSCode (GitHub Copilot) and Cursor (Claude 3.5 Sonnet) from the perspective of a Data Scientist. Should you switch from VSCode to Cursor? I concluded the article by stating: After using Cursor for the past two…

  • Inference for max-linear Bayesian networks with noise

    Inference for max-linear Bayesian networks with noise arXiv:2505.00229v1 Announce Type: new Abstract: Max-Linear Bayesian Networks (MLBNs) provide a powerful framework for causal inference in extreme-value settings; we consider MLBNs with noise parameters with a given topology in terms of the max-plus algebra by taking its logarithm. Then, we show that an estimator of a parameter…

  • On the expressivity of deep Heaviside networks

    On the expressivity of deep Heaviside networks arXiv:2505.00110v1 Announce Type: new Abstract: We show that deep Heaviside networks (DHNs) have limited expressiveness but that this can be overcome by including either skip connections or neurons with linear activation. We provide lower and upper bounds for the Vapnik-Chervonenkis (VC) dimensions and approximation rates of these network…

  • Reinforcement Learning with Continuous Actions Under Unmeasured Confounding

    Reinforcement Learning with Continuous Actions Under Unmeasured Confounding arXiv:2505.00304v1 Announce Type: new Abstract: This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we…

  • Statistical Learning for Heterogeneous Treatment Effects: Pretraining, Prognosis, and Prediction

    Statistical Learning for Heterogeneous Treatment Effects: Pretraining, Prognosis, and Prediction arXiv:2505.00310v1 Announce Type: new Abstract: Robust estimation of heterogeneous treatment effects is a fundamental challenge for optimal decision-making in domains ranging from personalized medicine to educational policy. In recent years, predictive machine learning has emerged as a valuable toolbox for causal estimation, enabling more flexible…

  • Hypothesis-free discovery from epidemiological data by automatic detection and local inference for tree-based nonlinearities and interactions

    Hypothesis-free discovery from epidemiological data by automatic detection and local inference for tree-based nonlinearities and interactions arXiv:2505.00571v1 Announce Type: new Abstract: In epidemiological settings, Machine Learning (ML) is gaining popularity for hypothesis-free discovery of risk (or protective) factors. Although ML is strong at discovering non-linearities and interactions, this power is currently compromised by a lack…

  • Talking to Kids About AI

    Talking to Kids About AI I’ve had the pleasant opportunity recently to be involved with a program called Skype a Scientist, which pairs scientists of various types (biologists, botanists, engineers, computer scientists, etc) with classrooms of kids to talk about our work and answer their questions. I’m pretty familiar with discussing AI and machine learning with…

  • Agentic AI 101: Starting Your Journey Building AI Agents

    Agentic AI 101: Starting Your Journey Building AI Agents Introduction The Artificial Intelligence industry is moving fast. It is impressive and many times overwhelming. I have been studying, learning, and building my foundations in this area of Data Science because I believe that the future of Data Science is strongly correlated with the development of…

  • Rust for Python Developers: Why You Should Take a Look at the Rust Programming Language

    Rust for Python Developers: Why You Should Take a Look at the Rust Programming Language The programming language Rust is now appearing in many feeds as it offers a performant and secure way to write programs and places great emphasis on performance. If you come from the Python world of Pandas, Jupyter or Flask, you might think that…

  • A Farewell to APMs — The Future of Observability is MCP tools

    A Farewell to APMs — The Future of Observability is MCP tools Image generated using Midjourney The past years have been an absolute rollercoaster (or joyride) of rapidly evolving generative AI technologies. In the twenty-five years I’ve counted myself a software developer, I cannot recall a tectonic shift of a similar magnitude, one that is already fundamentally changing…

  • Step-by-Step Guide to Build and Deploy an LLM-Powered Chat with Memory in Streamlit

    Step-by-Step Guide to Build and Deploy an LLM-Powered Chat with Memory in Streamlit In this post, I’ll show you step by step how to build and deploy a chat powered with LLM — Gemini — in Streamlit and monitor the API usage on Google Cloud Console. Streamlit is a Python framework that makes it super easy to turn your…

  • Kernel Density Machines

    Kernel Density Machines arXiv:2504.21419v1 Announce Type: new Abstract: We introduce kernel density machines (KDM), a novel density ratio estimator in a reproducing kernel Hilbert space setting. KDM applies to general probability measures on countably generated measurable spaces without restrictive assumptions on continuity, or the existence of a Lebesgue density. For computational efficiency, we incorporate a…

  • Generate-then-Verify: Reconstructing Data from Limited Published Statistics

    Generate-then-Verify: Reconstructing Data from Limited Published Statistics arXiv:2504.21199v1 Announce Type: new Abstract: We study the problem of reconstructing tabular data from aggregate statistics, in which the attacker aims to identify interesting claims about the sensitive data that can be verified with 100% certainty given the aggregates. Successful attempts in prior work have conducted studies in…

  • Wasserstein-Aitchison GAN for angular measures of multivariate extremes

    Wasserstein-Aitchison GAN for angular measures of multivariate extremes arXiv:2504.21438v1 Announce Type: new Abstract: Economically responsible mitigation of multivariate extreme risks — extreme rainfall in a large area, huge variations of many stock prices, widespread breakdowns in transportation systems — requires estimates of the probabilities that such risks will materialize in the future. This paper develops…

  • A comparison of generative deep learning methods for multivariate angular simulation

    A comparison of generative deep learning methods for multivariate angular simulation arXiv:2504.21505v1 Announce Type: new Abstract: With the recent development of new geometric and angular-radial frameworks for multivariate extremes, reliably simulating from angular variables in moderate-to-high dimensions is of increasing importance. Empirical approaches have the benefit of simplicity, and work reasonably well in low dimensions,…

  • Balancing Interpretability and Flexibility in Modeling Diagnostic Trajectories with an Embedded Neural Hawkes Process Model

    Balancing Interpretability and Flexibility in Modeling Diagnostic Trajectories with an Embedded Neural Hawkes Process Model arXiv:2504.21795v1 Announce Type: new Abstract: The Hawkes process (HP) is commonly used to model event sequences with self-reinforcing dynamics, including electronic health records (EHRs). Traditional HPs capture self-reinforcement via parametric impact functions that can be inspected to understand how each…

  • How Would I Learn to Code with ChatGPT if I Had to Start Again

    How Would I Learn to Code with ChatGPT if I Had to Start Again Coding has been a part of my life since I was 10. From modifying HTML & CSS for my Friendster profile during the simple internet days to exploring SQL injections for the thrill, building a three-legged robot for fun, and lately…

  • Why Are Convolutional Neural Networks Great For Images?

    Why Are Convolutional Neural Networks Great For Images? The Universal Approximation Theorem states that a neural network with a single hidden layer and a nonlinear activation function can approximate any continuous function.  Practical issues aside, such that the number of neurons in this hidden layer would grow enormously large, we do not need other network architectures. A simple…

  • Modern GUI Applications for Computer Vision in Python

    Modern GUI Applications for Computer Vision in Python Introduction I’m a huge fan of interactive visualizations. As a computer vision engineer, I deal almost daily with image processing related tasks and more often than not I am iterating on a problem where I need visual feedback to make decisions. Let’s think of a very simple image…

  • Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning

    Beyond Glorified Curve Fitting: Exploring the Probabilistic Foundations of Machine Learning You see a math formula you don’t immediately understand. Your instinct? Stop reading. Don’t. That’s exactly what I told myself when I started reading Probabilistic Machine Learning – An Introduction by Kevin P. Murphy. And it was absolutely worth it. It changed how I…

  • Reinforcement Learning from One Example?

    Reinforcement Learning from One Example? Prompt engineering alone won’t get us to production. Fine-tuning is expensive. And reinforcement learning? That’s been reserved for well-funded labs with massive datasets until now. New research from Microsoft and academic collaborators has overturned that assumption. Using Reinforcement Learning with Verifiable Rewards (RLVR) and just a single training example, researchers…

  • Coreset selection for the Sinkhorn divergence and generic smooth divergences

    Coreset selection for the Sinkhorn divergence and generic smooth divergences arXiv:2504.20194v1 Announce Type: new Abstract: We introduce CO2, an efficient algorithm to produce convexly-weighted coresets with respect to generic smooth divergences. By employing a functional Taylor expansion, we show a local equivalence between sufficiently regular losses and their second order approximations, reducing the coreset selection…

  • Learning and Generalization with Mixture Data

    Learning and Generalization with Mixture Data arXiv:2504.20651v1 Announce Type: new Abstract: In many, if not most, machine learning applications the training data is naturally heterogeneous (e.g. federated learning, adversarial attacks and domain adaptation in neural net training). Data heterogeneity is identified as one of the major challenges in modern day large-scale learning. A classical way…

  • Sobolev norm inconsistency of kernel interpolation

    Sobolev norm inconsistency of kernel interpolation arXiv:2504.20617v1 Announce Type: new Abstract: We study the consistency of minimum-norm interpolation in reproducing kernel Hilbert spaces corresponding to bounded kernels. Our main result give lower bounds for the generalization error of the kernel interpolation measured in a continuous scale of norms that interpolate between $L^2$ and the hypothesis…

  • Preference-centric Bandits: Optimality of Mixtures and Regret-efficient Algorithms

    Preference-centric Bandits: Optimality of Mixtures and Regret-efficient Algorithms arXiv:2504.20877v1 Announce Type: new Abstract: The objective of canonical multi-armed bandits is to identify and repeatedly select an arm with the largest reward, often in the form of the expected value of the arm’s probability distribution. Such a utilitarian perspective and focus on the probability models’ first…

  • Decoding Latent Spaces: Assessing the Interpretability of Time Series Foundation Models for Visual Analytics

    Decoding Latent Spaces: Assessing the Interpretability of Time Series Foundation Models for Visual Analytics arXiv:2504.20099v1 Announce Type: cross Abstract: The present study explores the interpretability of latent spaces produced by time series foundation models, focusing on their potential for visual analysis tasks. Specifically, we evaluate the MOMENT family of models, a set of transformer-based, pre-trained…

  • From FOMO to Opportunity: Analytical AI in the Era of LLM Agents

    From FOMO to Opportunity: Analytical AI in the Era of LLM Agents Are you feeling “fear of missing out” (FOMO) when it comes to LLM agents? Well, that was the case for me for quite a while. In recent months, it feels like my online feeds have been completely bombarded by “LLM Agents”: every other…

  • Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ?

    Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ? If you’ve followed me for a while, you probably know I started my career as a QA engineer before transitioning into the world of data analytics. I didn’t go to school for it, didn’t have a mentor, and didn’t land in a formal training…

  • Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini

    Building a Scalable and Accurate Audio Interview Transcription Pipeline with Google Gemini This article is co-authored by Ugo Pradère and David Haüet How hard can it be to transcribe an interview? You feed the audio to an AI model, wait a few minutes, and boom: perfect transcript, right? Well… not quite. When it comes to…

  • How to Level Up Your Technical Skills in This AI Era

    How to Level Up Your Technical Skills in This AI Era AI-assisted coding is here to stay. Tools like Cursor, V0, and Lovable have dramatically lowered the barrier to entry — building dashboards, pipelines, or entire apps can now be done in a fraction of the time. I use these tools daily, and they’ve definitely made me…

  • AI Agents for a More Sustainable World

    AI Agents for a More Sustainable World As political support for sustainability weakens, the need for long-term sustainable practices has never been more critical. How can we use analytics, boosted by agentic AI, to support companies in their green transformation? For years, the focus of my blog was always on using Supply Chain Analytics methodologies…

  • Statistical Inference for Clustering-based Anomaly Detection

    Statistical Inference for Clustering-based Anomaly Detection arXiv:2504.18633v1 Announce Type: new Abstract: Unsupervised anomaly detection (AD) is a fundamental problem in machine learning and statistics. A popular approach to unsupervised AD is clustering-based detection. However, this method lacks the ability to guarantee the reliability of the detected anomalies. In this paper, we propose SI-CLAD (Statistical Inference…

  • Local Polynomial Lp-norm Regression

    Local Polynomial Lp-norm Regression arXiv:2504.18695v1 Announce Type: new Abstract: The local least squares estimator for a regression curve cannot provide optimal results when non-Gaussian noise is present. Both theoretical and empirical evidence suggests that residuals often exhibit distributional properties different from those of a normal distribution, making it worthwhile to consider estimation based on other…

  • Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: $sqrt{T}$-Regret

    Foundations of Safe Online Reinforcement Learning in the Linear Quadratic Regulator: $sqrt{T}$-Regret arXiv:2504.18657v1 Announce Type: new Abstract: Understanding how to efficiently learn while adhering to safety constraints is essential for using online reinforcement learning in practical applications. However, proving rigorous regret bounds for safety-constrained reinforcement learning is difficult due to the complex interaction between safety,…

  • A Dictionary of Closed-Form Kernel Mean Embeddings

    A Dictionary of Closed-Form Kernel Mean Embeddings arXiv:2504.18830v1 Announce Type: new Abstract: Kernel mean embeddings — integrals of a kernel with respect to a probability distribution — are essential in Bayesian quadrature, but also widely used in other computational tools for numerical integration or for statistical inference based on the maximum mean discrepancy. These methods…

  • ReLU integral probability metric and its applications

    ReLU integral probability metric and its applications arXiv:2504.18897v1 Announce Type: new Abstract: We propose a parametric integral probability metric (IPM) to measure the discrepancy between two probability measures. The proposed IPM leverages a specific parametric family of discriminators, such as single-node neural networks with ReLU activation, to effectively distinguish between distributions, making it applicable in…

  • If I Wanted to Become a Machine Learning Engineer, I’d Do This

    If I Wanted to Become a Machine Learning Engineer, I’d Do This If I wanted to become a machine learning engineer again, this is the exact process I would follow. Let’s get into it! First become a data scientist or software engineer I’ve said it before, but a machine learning engineer is not exactly an entry-level position.…

  • How to Ensure Your AI Solution Does What You Expect iI to Do

    How to Ensure Your AI Solution Does What You Expect iI to Do Generative AI (GenAI) is evolving fast — and it’s no longer just about fun chatbots or impressive image generation. 2025 is the year where the focus is on turning the AI hype into real value. Companies everywhere are looking into ways to…

  • Struggling to Land a Data Role in 2025? These 5 Tips Will Change That

    Struggling to Land a Data Role in 2025? These 5 Tips Will Change That Breaking into the tech world is no longer as easy (or glamorous) as it used to be. Lots of people are finding it difficult to find their way into the current tech market. This can be due to lots of reasons…

  • NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used

    NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used Browsing GitHub the other day, I came across a library I’d never heard of before. It was called NumExpr. I was immediately interested because of some claims made about the library. In particular, it stated that for some complex numerical calculations, it was…

  • When OpenAI Isn’t Always the Answer: Enterprise Risks Behind Wrapper-Based AI Agents

    When OpenAI Isn’t Always the Answer: Enterprise Risks Behind Wrapper-Based AI Agents “Wait… are you sending journal entries to OpenAI?” That was the first thing my friend asked when I showed her Feel-Write, an AI-powered journaling app I built during a hackathon in San Francisco. I shrugged. “It was an AI-themed hackathon, I had to…

  • Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels

    Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels arXiv:2504.18184v1 Announce Type: new Abstract: This paper investigates regularized stochastic gradient descent (SGD) algorithms for estimating nonlinear operators from a Polish space to a separable Hilbert space. We assume that the regression operator lies in a vector-valued reproducing kernel Hilbert space induced by an operator-valued…

  • Learning Enhanced Ensemble Filters

    Learning Enhanced Ensemble Filters arXiv:2504.17836v1 Announce Type: new Abstract: The filtering distribution in hidden Markov models evolves according to the law of a mean-field model in state–observation space. The ensemble Kalman filter (EnKF) approximates this mean-field model with an ensemble of interacting particles, employing a Gaussian ansatz for the joint distribution of the state and…

  • Post-Transfer Learning Statistical Inference in High-Dimensional Regression

    Post-Transfer Learning Statistical Inference in High-Dimensional Regression arXiv:2504.18212v1 Announce Type: new Abstract: Transfer learning (TL) for high-dimensional regression (HDR) is an important problem in machine learning, particularly when dealing with limited sample size in the target task. However, there currently lacks a method to quantify the statistical significance of the relationship between features and the…

  • Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior

    Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior arXiv:2504.18455v1 Announce Type: new Abstract: We study the problem of distributed multi-view representation learning. In this problem, $K$ agents observe each one distinct, possibly statistically correlated, view and independently extracts from it a suitable representation in a manner that a…

  • Enhancing Visual Interpretability and Explainability in Functional Survival Trees and Forests

    Enhancing Visual Interpretability and Explainability in Functional Survival Trees and Forests arXiv:2504.18498v1 Announce Type: new Abstract: Functional survival models are key tools for analyzing time-to-event data with complex predictors, such as functional or high-dimensional inputs. Despite their predictive strength, these models often lack interpretability, which limits their value in practical decision-making and risk analysis. This…

  • Weekly Entering & Transitioning – Thread 28 Apr, 2025 – 05 May, 2025

    Weekly Entering & Transitioning – Thread 28 Apr, 2025 – 05 May, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • A Step-By-Step Guide To Powering Your Application With LLMs

    A Step-By-Step Guide To Powering Your Application With LLMs You might be wondering whether GenAI is just hype or external noise. I also thought this was hype, and I could sit this one out until the dust cleared. Oh, boy, was I wrong. GenAI has real-world applications. It also generates revenue for companies, so we expect…

  • Behind the Magic: How Tensors Drive Transformers

    Behind the Magic: How Tensors Drive Transformers Introduction Transformers have changed the way artificial intelligence works, especially in understanding language and learning from data. At the core of these models are tensors (a generalized type of mathematical matrices that help process information) . As data moves through the different parts of a Transformer, these tensors…

  • LLM Evaluations: from Prototype to Production

    LLM Evaluations: from Prototype to Production Evaluation is the cornerstone of any machine learning product. Investing in quality measurement delivers significant returns. Let’s explore the potential business benefits. As management consultant and writer Peter Drucker once said, “If you can’t measure it, you can’t improve it.” Building a robust evaluation system helps you identify areas…

  • Physics-informed features in supervised machine learning

    Physics-informed features in supervised machine learning arXiv:2504.17112v1 Announce Type: new Abstract: Supervised machine learning involves approximating an unknown functional relationship from a limited dataset of features and corresponding labels. The classical approach to feature-based machine learning typically relies on applying linear regression to standardized features, without considering their physical meaning. This may limit model explainability,…

  • Causal rule ensemble approach for multi-arm data

    Causal rule ensemble approach for multi-arm data arXiv:2504.17166v1 Announce Type: new Abstract: Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions.…

  • Likelihood-Free Variational Autoencoders

    Likelihood-Free Variational Autoencoders arXiv:2504.17622v1 Announce Type: new Abstract: Variational Autoencoders (VAEs) typically rely on a probabilistic decoder with a predefined likelihood, most commonly an isotropic Gaussian, to model the data conditional on latent variables. While convenient for optimization, this choice often leads to likelihood misspecification, resulting in blurry reconstructions and poor data fidelity, especially for…

  • Evaluating Uncertainty in Deep Gaussian Processes

    Evaluating Uncertainty in Deep Gaussian Processes arXiv:2504.17719v1 Announce Type: new Abstract: Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines…

  • (Im)possibility of Automated Hallucination Detection in Large Language Models

    (Im)possibility of Automated Hallucination Detection in Large Language Models arXiv:2504.17004v1 Announce Type: cross Abstract: Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to…

  • AWS: Deploying a FastAPI App on EC2 in Minutes

    AWS: Deploying a FastAPI App on EC2 in Minutes Introduction AWS is a popular cloud provider that enables the deployment and scaling of large applications. Mastering at least one cloud platform is an essential skill for software engineers and data scientists. Running an application locally is not enough to make it usable in production — it…

  • Government Funding Graph RAG

    Government Funding Graph RAG In this article, I present my latest open-source project — Government Funding Graph. The inspiration for this project came from a desire to make better tooling for grant writing, namely to suggest research topics, funding bodies, research institutions, and researchers. I have made Innovate UK grant applications in the past, so I have…

  • Choose the Right One: Evaluating Topic Models for Business Intelligence

    Choose the Right One: Evaluating Topic Models for Business Intelligence Topic models are used in businesses to classify brand-related text datasets (such as product and site reviews, surveys, and social media comments) and to track how customer satisfaction metrics change over time. There is a myriad of recent topic models one can choose from: the…

  • Predicting the NBA Champion with Machine Learning

    Predicting the NBA Champion with Machine Learning Every NBA season, 30 teams compete for something only one will achieve: the legacy of a championship. From power rankings to trade deadline chaos and injuries, fans and analysts alike speculate endlessly about who will raise the Larry O’Brien Trophy. But what if we could go beyond the hot…

  • Covariate-dependent Graphical Model Estimation via Neural Networks with Statistical Guarantees

    Covariate-dependent Graphical Model Estimation via Neural Networks with Statistical Guarantees arXiv:2504.16356v1 Announce Type: new Abstract: Graphical models are widely used in diverse application domains to model the conditional dependencies amongst a collection of random variables. In this paper, we consider settings where the graph structure is covariate-dependent, and investigate a deep neural network-based approach to…

  • Behavior of prediction performance metrics with rare events

    Behavior of prediction performance metrics with rare events arXiv:2504.16185v1 Announce Type: new Abstract: Area under the receiving operator characteristic curve (AUC) is commonly reported alongside binary prediction models. However, there are concerns that AUC might be a misleading measure of prediction performance in the rare event setting. This setting is common since many events of…

  • Towards Accurate Forecasting of Renewable Energy : Building Datasets and Benchmarking Machine Learning Models for Solar and Wind Power in France

    Towards Accurate Forecasting of Renewable Energy : Building Datasets and Benchmarking Machine Learning Models for Solar and Wind Power in France arXiv:2504.16100v1 Announce Type: cross Abstract: Accurate prediction of non-dispatchable renewable energy sources is essential for grid stability and price prediction. Regional power supply forecasts are usually indirect through a bottom-up approach of plant-level forecasts,…

  • Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning

    Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning arXiv:2504.16172v1 Announce Type: cross Abstract: High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time…

  • Probabilistic Emulation of the Community Radiative Transfer Model Using Machine Learning

    Probabilistic Emulation of the Community Radiative Transfer Model Using Machine Learning arXiv:2504.16192v1 Announce Type: cross Abstract: The continuous improvement in weather forecast skill over the past several decades is largely due to the increasing quantity of available satellite observations and their assimilation into operational forecast systems. Assimilating these observations requires observation operators in the form…

  • How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI’s simple-evals

    How to Benchmark DeepSeek-R1 Distilled Models on GPQA Using Ollama and OpenAI’s simple-evals The recent launch of the DeepSeek-R1 model sent ripples across the global AI community. It delivered breakthroughs on par with the reasoning models from Meta and OpenAI, achieving this in a fraction of the time and at a significantly lower cost. Beyond…

  • Exporting MLflow Experiments from Restricted HPC Systems

    Exporting MLflow Experiments from Restricted HPC Systems Many High-Performance Computing (HPC) environments, especially in research and educational institutions, restrict communications to outbound TCP connections. Running a simple command-line ping or curl with the MLflow tracking URL on the HPC bash shell to check packet transfer can be successful. However, communication fails and times out while…

  • An Existential Crisis of a Veteran Researcher in the Age of Generative AI

    An Existential Crisis of a Veteran Researcher in the Age of Generative AI I was a researcher fifteen years ago. A PhD candidate doing Research for long days. I was swamped with many articles, annotations, emails, bookmarks, etc. When I found a citation manager tool, Mendeley, I felt so relaxed. It was like I had…

  • Why Most Cyber Risk Models Fail Before They Begin

    Why Most Cyber Risk Models Fail Before They Begin Cybersecurity leaders are being asked impossible questions. “What’s the likelihood of a breach this year?” “How much would it cost?” And “how much should we spend to stop it?” Yet most risk models used today are still built on guesswork, gut instinct, and colorful heatmaps, not…

  • Data Science: From School to Work, Part IV

    Data Science: From School to Work, Part IV Introduction Let’s start with a simple example that will appeal to most of us. If you want to check if the blinkers of your car are working properly, you sit in the car, turn on the ignition and test a turn signal to see if the front…

  • Transfer Learning for High-dimensional Reduced Rank Time Series Models

    Transfer Learning for High-dimensional Reduced Rank Time Series Models arXiv:2504.15691v1 Announce Type: new Abstract: The objective of transfer learning is to enhance estimation and inference in a target data by leveraging knowledge gained from additional sources. Recent studies have explored transfer learning for independent observations in complex, high-dimensional models assuming sparsity, yet research on time…

  • From predictions to confidence intervals: an empirical study of conformal prediction methods for in-context learning

    From predictions to confidence intervals: an empirical study of conformal prediction methods for in-context learning arXiv:2504.15722v1 Announce Type: new Abstract: Transformers have become a standard architecture in machine learning, demonstrating strong in-context learning (ICL) abilities that allow them to learn from the prompt at inference time. However, uncertainty quantification for ICL remains an open challenge,…

  • How Private is Your Attention? Bridging Privacy with In-Context Learning

    How Private is Your Attention? Bridging Privacy with In-Context Learning arXiv:2504.16000v1 Announce Type: new Abstract: In-context learning (ICL)-the ability of transformer-based models to perform new tasks from examples provided at inference time-has emerged as a hallmark of modern language models. While recent works have investigated the mechanisms underlying ICL, its feasibility under formal privacy constraints…

  • Explainable Unsupervised Anomaly Detection with Random Forest

    Explainable Unsupervised Anomaly Detection with Random Forest arXiv:2504.16075v1 Announce Type: new Abstract: We describe the use of an unsupervised Random Forest for similarity learning and improved unsupervised anomaly detection. By training a Random Forest to discriminate between real data and synthetic data sampled from a uniform distribution over the real data bounds, a distance measure…

  • Significativity Indices for Agreement Values

    Significativity Indices for Agreement Values arXiv:2504.15325v1 Announce Type: cross Abstract: Agreement measures, such as Cohen’s kappa or intraclass correlation, gauge the matching between two or more classifiers. They are used in a wide range of contexts from medicine, where they evaluate the effectiveness of medical treatments and clinical trials, to artificial intelligence, where they can…

  • Explained: How Does L1 Regularization Perform Feature Selection?

    Explained: How Does L1 Regularization Perform Feature Selection? Feature Selection is the process of selecting an optimal subset of features from a given set of features; an optimal feature subset is the one which maximizes the performance of the model on the given task. Feature selection can be a manual or rather explicit process when…

  • Enterprise AI: From Build-or-Buy to Partner-and-Grow

    Enterprise AI: From Build-or-Buy to Partner-and-Grow Not long ago, a cooperation partner casually approached me with an AI use case at their organization. They wanted to make their onboarding process for new staff more efficient by using AI to answer the repetitive questions of newcomers. I suggested a practical chat approach that would integrate their…

  • How to Get Performance Data from Power BI with DAX Studio

    How to Get Performance Data from Power BI with DAX Studio Introduction To put things straight: I will not discuss how to optimize DAX Code today. More articles will follow, concentrating on common mistakes and how to avoid them. But, before we can understand the performance metrics, we need to understand the architecture of the…

  • MapReduce: How It Powers Scalable Data Processing

    MapReduce: How It Powers Scalable Data Processing In this article, I’ll give a brief introduction to the MapReduce programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational…

  • AI Agents Processing Time Series and Large Dataframes

    AI Agents Processing Time Series and Large Dataframes Intro Agents are AI systems, powered by LLMs, that can reason about their objectives and take actions to achieve a final goal. They are designed not just to respond to queries, but to orchestrate a sequence of operations, including processing data (i.e. dataframes and time series). This…

  • Learning over von Mises-Fisher Distributions via a Wasserstein-like Geometry

    Learning over von Mises-Fisher Distributions via a Wasserstein-like Geometry arXiv:2504.14164v1 Announce Type: new Abstract: We introduce a novel, geometry-aware distance metric for the family of von Mises-Fisher (vMF) distributions, which are fundamental models for directional data on the unit hypersphere. Although the vMF distribution is widely employed in a variety of probabilistic learning tasks involving…