Category: aimldsaimlds

Support Collapse of Deep Gaussian Processes with Polynomial Kernels for a Wide Regime of Hyperparameters

Support Collapse of Deep Gaussian Processes with Polynomial Kernels for a Wide Regime of Hyperparameters arXiv:2503.12266v1 Announce Type: new Abstract: We analyze the prior that a Deep Gaussian Process with polynomial kernels induces. We observe that, even for relatively small depths, averaging effects occur within such a Deep Gaussian Process and that the prior can…

March 18, 2025
Nonlinear Principal Component Analysis with Random Bernoulli Features for Process Monitoring

Nonlinear Principal Component Analysis with Random Bernoulli Features for Process Monitoring arXiv:2503.12456v1 Announce Type: new Abstract: The process generates substantial amounts of data with highly complex structures, leading to the development of numerous nonlinear statistical methods. However, most of these methods rely on computations involving large-scale dense kernel matrices. This dependence poses significant challenges in…

March 18, 2025
SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement

SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement arXiv:2503.12760v1 Announce Type: new Abstract: To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail…

March 18, 2025
Learn then Decide: A Learning Approach for Designing Data Marketplaces

Learn then Decide: A Learning Approach for Designing Data Marketplaces arXiv:2503.10773v1 Announce Type: new Abstract: As data marketplaces become increasingly central to the digital economy, it is crucial to design efficient pricing mechanisms that optimize revenue while ensuring fair and adaptive pricing. We introduce the Maximum Auction-to-Posted Price (MAPP) mechanism, a novel two-stage approach that…

March 17, 2025
Exploiting Concavity Information in Gaussian Process Contextual Bandit Optimization

Exploiting Concavity Information in Gaussian Process Contextual Bandit Optimization arXiv:2503.10836v1 Announce Type: new Abstract: The contextual bandit framework is widely used to solve sequential optimization problems where the reward of each decision depends on auxiliary context variables. In settings such as medicine, business, and engineering, the decision maker often possesses additional structural information on the…

March 17, 2025
On the Identifiability of Causal Abstractions

On the Identifiability of Causal Abstractions arXiv:2503.10834v1 Announce Type: new Abstract: Causal representation learning (CRL) enhances machine learning models’ robustness and generalizability by learning structural causal models associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown…

March 17, 2025
Mamba time series forecasting with uncertainty propagation

Mamba time series forecasting with uncertainty propagation arXiv:2503.10873v1 Announce Type: new Abstract: State space models, such as Mamba, have recently garnered attention in time series forecasting due to their ability to capture sequence patterns. However, in electricity consumption benchmarks, Mamba forecasts exhibit a mean error of approximately 8%. Similarly, in traffic occupancy benchmarks, the mean…

March 17, 2025
Clustering Items through Bandit Feedback: Finding the Right Feature out of Many

Clustering Items through Bandit Feedback: Finding the Right Feature out of Many arXiv:2503.11209v1 Announce Type: new Abstract: We study the problem of clustering a set of items based on bandit feedback. Each of the $n$ items is characterized by a feature vector, with a possibly large dimension $d$. The items are partitioned into two unknown…

March 17, 2025
Weekly Entering & Transitioning – Thread 17 Mar, 2025 – 24 Mar, 2025

Weekly Entering & Transitioning – Thread 17 Mar, 2025 – 24 Mar, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

March 17, 2025
The Impact of GenAI and Its Implications for Data Scientists

The Impact of GenAI and Its Implications for Data Scientists GenAI systems affect how we work. This general notion is well known. However, we are still unaware of the exact impact of GenAI. For example, how much do these tools affect our work? Do they have a larger impact on certain tasks? What does this…

March 15, 2025
Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and…

March 15, 2025
Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs

Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs Creating efficient prompts for large language models often starts as a simple task… but it doesn’t always stay that way. Initially, following basic best practices seems sufficient: adopt the persona of a specialist, write clear instructions, require a specific response format, and…

March 15, 2025
Nine Pico PIO Wats with Rust (Part 2)

Nine Pico PIO Wats with Rust (Part 2) This is Part 2 of an exploration into the unexpected quirks of programming the Raspberry Pi Pico PIO with Micropython. If you missed Part 1, we uncovered four Wats that challenge assumptions about register count, instruction slots, the behavior of pull noblock, and smart yet cheap hardware.…

March 15, 2025
Forget About Cloud Computing. On-Premises Is All the Rage Again

Forget About Cloud Computing. On-Premises Is All the Rage Again Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins. The tides are turning though. As much…

March 15, 2025
Power Spectrum Signatures of Graphs

Power Spectrum Signatures of Graphs arXiv:2503.09660v1 Announce Type: new Abstract: Point signatures based on the Laplacian operators on graphs, point clouds, and manifolds have become popular tools in machine learning for graphs, clustering, and shape analysis. In this work, we propose a novel point signature, the power spectrum signature, a measure on $mathbb{R}$ defined as…

March 14, 2025
Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks

Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks arXiv:2503.10496v1 Announce Type: new Abstract: Modeling natural phenomena with artificial neural networks (ANNs) often provides highly accurate predictions. However, ANNs often suffer from over-parameterization, complicating interpretation and raising uncertainty issues. Bayesian neural networks (BNNs) address the latter by representing weights as probability distributions, allowing…

March 14, 2025
Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures

Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures arXiv:2503.10576v1 Announce Type: new Abstract: A common approach to generative modeling is to split model-fitting into two blocks: define first how to sample noise (e.g. Gaussian) and choose next what to do with it (e.g. using a single map or flows). We…

March 14, 2025
Technical Insights and Legal Considerations for Advancing Federated Learning in Bioinformatics

Technical Insights and Legal Considerations for Advancing Federated Learning in Bioinformatics arXiv:2503.09649v1 Announce Type: cross Abstract: Federated learning leverages data across institutions to improve clinical discovery while complying with data-sharing restrictions and protecting patient privacy. As the evolution of biobanks in genetics and systems biology has proved, accessing more extensive and varied data pools leads…

March 14, 2025
Bags of Projected Nearest Neighbours: Competitors to Random Forests?

Bags of Projected Nearest Neighbours: Competitors to Random Forests? arXiv:2503.09651v1 Announce Type: cross Abstract: In this paper we introduce a simple and intuitive adaptive k nearest neighbours classifier, and explore its utility within the context of bootstrap aggregating (“bagging”). The approach is based on finding discriminant subspaces which are computationally efficient to compute, and are…

March 14, 2025
Essential Review Papers on Physics-Informed Neural Networks: A Curated Guide for Practitioners

Essential Review Papers on Physics-Informed Neural Networks: A Curated Guide for Practitioners Staying on top of a fast-growing research field is never easy. I face this challenge firsthand as a practitioner in Physics-Informed Neural Networks (PINNs). New papers, be they algorithmic advancements or cutting-edge applications, are published at an accelerating pace by both academia and…

March 14, 2025
Anatomy of a Parquet File

Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined…

March 14, 2025
Fourier Transform Applications in Literary Analysis

Fourier Transform Applications in Literary Analysis Poetry is often seen as a pure art form, ranging from the rigid structure of a haiku to the fluid, unconstrained nature of free-verse poetry. In analysing these works, though, to what extent can mathematics and Data Analysis be used to glean meaning from this free-flowing literature? Of course,…

March 14, 2025
Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing,…

March 14, 2025
Are You Still Using LoRA to Fine-Tune Your LLM?

Are You Still Using LoRA to Fine-Tune Your LLM? LoRA (Low Rank Adaptation – arxiv.org/abs/2106.09685) is a popular technique for fine-tuning Large Language Models (LLMs) on the cheap. But 2024 has seen an explosion of new parameter-efficient fine-tuning techniques, an alphabet soup of LoRA alternatives: SVF, SVFT, MiLoRA, PiSSA, LoRA-XS … And most are based…

March 14, 2025
Learning Pareto manifolds in high dimensions: How can regularization help?

Learning Pareto manifolds in high dimensions: How can regularization help? arXiv:2503.08849v1 Announce Type: new Abstract: Simultaneously addressing multiple objectives is becoming increasingly important in modern machine learning. At the same time, data is often high-dimensional and costly to label. For a single objective such as prediction risk, conventional regularization techniques are known to improve generalization…

March 13, 2025
A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation

A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation arXiv:2503.08902v1 Announce Type: new Abstract: Mutual Information (MI) is a crucial measure for capturing dependencies between variables, but exact computation is challenging in high dimensions with intractable likelihoods, impacting accuracy and robustness. One idea is to use an auxiliary neural network to train an MI…

March 13, 2025
Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient Algorithms

Risk-sensitive Bandits: Arm Mixture Optimality and Regret-efficient Algorithms arXiv:2503.08896v1 Announce Type: new Abstract: This paper introduces a general framework for risk-sensitive bandits that integrates the notions of risk-sensitive objectives by adopting a rich class of distortion riskmetrics. The introduced framework subsumes the various existing risk-sensitive models. An important and hitherto unknown observation is that for…

March 13, 2025
Self-Consistent Equation-guided Neural Networks for Censored Time-to-Event Data

Self-Consistent Equation-guided Neural Networks for Censored Time-to-Event Data arXiv:2503.09097v1 Announce Type: new Abstract: In survival analysis, estimating the conditional survival function given predictors is often of interest. There is a growing trend in the development of deep learning methods for analyzing censored time-to-event data, especially when dealing with high-dimensional predictors that are complexly interrelated. Many…

March 13, 2025
Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral sampling

Addressing pitfalls in implicit unobserved confounding synthesis using explicit block hierarchical ancestral sampling arXiv:2503.09194v1 Announce Type: new Abstract: Unbiased data synthesis is crucial for evaluating causal discovery algorithms in the presence of unobserved confounding, given the scarcity of real-world datasets. A common approach, implicit parameterization, encodes unobserved confounding by modifying the off-diagonal entries of the…

March 13, 2025
Probabilistic Shielding for Safe Reinforcement Learning

Probabilistic Shielding for Safe Reinforcement Learning arXiv:2503.07671v1 Announce Type: new Abstract: In real-life scenarios, a Reinforcement Learning (RL) agent aiming to maximise their reward, must often also behave in a safe manner, including at training time. Thus, much attention in recent years has been given to Safe RL, where an agent aims to learn an…

March 12, 2025
Personalized Convolutional Dictionary Learning of Physiological Time Series

Personalized Convolutional Dictionary Learning of Physiological Time Series arXiv:2503.07687v1 Announce Type: new Abstract: Human physiological signals tend to exhibit both global and local structures: the former are shared across a population, while the latter reflect inter-individual variability. For instance, kinetic measurements of the gait cycle during locomotion present common characteristics, although idiosyncrasies may be observed…

March 12, 2025
Uncertainty quantification and posterior sampling for network reconstruction

Uncertainty quantification and posterior sampling for network reconstruction arXiv:2503.07736v1 Announce Type: new Abstract: Network reconstruction is the task of inferring the unseen interactions between elements of a system, based only on their behavior or dynamics. This inverse problem is in general ill-posed, and admits many solutions for the same observation. Nevertheless, the vast majority of…

March 12, 2025
Cost-Aware Optimal Pairwise Pure Exploration

Cost-Aware Optimal Pairwise Pure Exploration arXiv:2503.07877v1 Announce Type: new Abstract: Pure exploration is one of the fundamental problems in multi-armed bandits (MAB). However, existing works mostly focus on specific pure exploration tasks, without a holistic view of the general pure exploration problem. This work fills this gap by introducing a versatile framework to study pure…

March 12, 2025
Pure Exploration with Feedback Graphs

Pure Exploration with Feedback Graphs arXiv:2503.07824v1 Announce Type: new Abstract: We study the sample complexity of pure exploration in an online learning problem with a feedback graph. This graph dictates the feedback available to the learner, covering scenarios between full-information, pure bandit feedback, and settings with no feedback on the chosen action. While variants of…

March 12, 2025
7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow

7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow DBeaver is the most powerful open-source SQL IDE, but there are several features people don’t know about. In this post, I will share with you several features to speed up your workflow, with zero fluff. I’ve learned these as I’m currently digging deeper into…

March 12, 2025
How to Switch from Data Analyst to Data Scientist

How to Switch from Data Analyst to Data Scientist Are you a Data Analyst looking to break into data science? If so, this post is for you. Many people start in analytics because it generally has a lower barrier to entry, but as they gain experience, they realize they want to take on more technical…

March 12, 2025
Experiments Illustrated: Can $1 Change Behavior More Than $100?

Experiments Illustrated: Can $1 Change Behavior More Than $100? I currently lead a small data team at a small tech company. With everything small, we have a lot of autonomy over what, when, and how we run experiments. In this series, I’m opening the vault from our years of experimenting, each story highlighting a key…

March 12, 2025
Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become…

March 12, 2025
How to Develop Complex DAX Expressions

How to Develop Complex DAX Expressions At some point or another, any Power BI developer must write complex Dax expressions to analyze data. But nobody tells you how to do it. What’s the process for doing it? What is the best way to do it, and how supportive can a development process be? These are the questions…

March 12, 2025
Analyzing the Role of Permutation Invariance in Linear Mode Connectivity

Analyzing the Role of Permutation Invariance in Linear Mode Connectivity arXiv:2503.06001v1 Announce Type: new Abstract: It was empirically observed in Entezari et al. (2021) that when accounting for the permutation invariance of neural networks, there is likely no loss barrier along the linear interpolation between two SGD solutions — a phenomenon known as linear mode…

March 11, 2025
Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature

Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature arXiv:2503.06079v1 Announce Type: new Abstract: Despite the significance of probabilistic time-series forecasting models, their evaluation metrics often involve intractable integrations. The most widely used metric, the continuous ranked probability score (CRPS), is a strictly proper scoring function; however, its computation requires approximation. We found…

March 11, 2025
On Statistical Estimation of Edge-Reinforced Random Walks

On Statistical Estimation of Edge-Reinforced Random Walks arXiv:2503.06115v1 Announce Type: new Abstract: Reinforced random walks (RRWs), including vertex-reinforced random walks (VRRWs) and edge-reinforced random walks (ERRWs), model random walks where the transition probabilities evolve based on prior visitation history~cite{mgr, fmk, tarres, volkov}. These models have found applications in various areas, such as network representation learning~cite{xzzs},…

March 11, 2025
Double Debiased Machine Learning for Mediation Analysis with Continuous Treatments

Double Debiased Machine Learning for Mediation Analysis with Continuous Treatments arXiv:2503.06156v1 Announce Type: new Abstract: Uncovering causal mediation effects is of significant value to practitioners seeking to isolate the direct treatment effect from the potential mediated effect. We propose a double machine learning (DML) algorithm for mediation analysis that supports continuous treatments. To estimate the…

March 11, 2025
Bayesian Optimization for Robust Identification of Ornstein-Uhlenbeck Model

Bayesian Optimization for Robust Identification of Ornstein-Uhlenbeck Model arXiv:2503.06381v1 Announce Type: new Abstract: This paper deals with the identification of the stochastic Ornstein-Uhlenbeck (OU) process error model, which is characterized by an inverse time constant, and the unknown variances of the process and observation noises. Although the availability of the explicit expression of the log-likelihood…

March 11, 2025
Platform-Mesh, Hub and Spoke, and Centralised | 3 Types of data team

Platform-Mesh, Hub and Spoke, and Centralised | 3 Types of data team Introduction In the “ever rapidly changing landscape of Data and AI” (!), understanding data and AI architecture has never been more critical. However something many leaders overlook is the importance of data team structure. While many of you reading this probably identify as the data…

March 11, 2025
Linear Regression in Time Series: Sources of Spurious Regression

Linear Regression in Time Series: Sources of Spurious Regression 1. Introduction It’s pretty clear that most of our work will be automated by AI in the future. This will be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but…

March 11, 2025
From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities

From Fuzzy to Precise: How a Morphological Feature Extractor Enhances AI’s Recognition Capabilities Introduction: Can AI really distinguish dog breeds like human experts? One day while taking a walk, I saw a fluffy white puppy and wondered, Is that a Bichon Frise or a Maltese? No matter how closely I looked, they seemed almost identical.…

March 11, 2025
Experiments Illustrated: How Random Assignment Saved Us $1M in Marketing Spend

Experiments Illustrated: How Random Assignment Saved Us $1M in Marketing Spend Running cool experiments is easily one of my favorite parts of working in data science. Most experiments don’t deliver big wins, so the winners make for fun stories. We’ve had a few of these at IntelyCare, and I’m sharing each story in a way…

March 11, 2025
Experiments Illustrated: How We Optimized Premium Listings on Our Nursing Job Board

Experiments Illustrated: How We Optimized Premium Listings on Our Nursing Job Board Running experiments is a task that often falls to data scientists. If that’s you, congrats! It can be a rewarding and high-impact area of work, but also requires tools found outside the typical ML-heavy data science curriculum. Even with the best tools, only…

March 11, 2025
A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD

A Practical Introduction to Kernel Discrepancies: MMD, HSIC & KSD arXiv:2503.04820v1 Announce Type: new Abstract: This article provides a practical introduction to kernel discrepancies, focusing on the Maximum Mean Discrepancy (MMD), the Hilbert-Schmidt Independence Criterion (HSIC), and the Kernel Stein Discrepancy (KSD). Various estimators for these discrepancies are presented, including the commonly-used V-statistics and U-statistics,…

March 10, 2025
Boltzmann convolutions and Welford mean-variance layers with an application to time series forecasting and classification

Boltzmann convolutions and Welford mean-variance layers with an application to time series forecasting and classification arXiv:2503.04956v1 Announce Type: new Abstract: In this paper we propose a novel problem called the ForeClassing problem where the loss of a classification decision is only observed at a future time point after the classification decision has to be made.…

March 10, 2025
A characterization of sample adaptivity in UCB data

A characterization of sample adaptivity in UCB data arXiv:2503.04855v1 Announce Type: new Abstract: We characterize a joint CLT of the number of pulls and the sample mean reward of the arms in a stochastic two-armed bandit environment under UCB algorithms. Several implications of this result are in place: (1) a nonstandard CLT of the number…

March 10, 2025
Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits

Empirical Bound Information-Directed Sampling for Norm-Agnostic Bandits arXiv:2503.05098v1 Announce Type: new Abstract: Information-directed sampling (IDS) is a powerful framework for solving bandit problems which has shown strong results in both Bayesian and frequentist settings. However, frequentist IDS, like many other bandit algorithms, requires that one have prior knowledge of a (relatively) tight upper bound on…

March 10, 2025
Topology-Aware Conformal Prediction for Stream Networks

Topology-Aware Conformal Prediction for Stream Networks arXiv:2503.04981v1 Announce Type: new Abstract: Stream networks, a unique class of spatiotemporal graphs, exhibit complex directional flow constraints and evolving dependencies, making uncertainty quantification a critical yet challenging task. Traditional conformal prediction methods struggle in this setting due to the need for joint predictions across multiple interdependent locations and…

March 10, 2025
Weekly Entering & Transitioning – Thread 10 Mar, 2025 – 17 Mar, 2025

Weekly Entering & Transitioning – Thread 10 Mar, 2025 – 17 Mar, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

March 10, 2025
Custom Training Pipeline for Object Detection Models

Custom Training Pipeline for Object Detection Models What if you want to write the whole object detection training pipeline from scratch, so you can understand each step and be able to customize it? That’s what I set out to do. I examined several well-known object detection pipelines and designed one that best suits my needs…

March 8, 2025
Comprehensive Guide to Dependency Management in Python

Comprehensive Guide to Dependency Management in Python Introduction When learning Python, many beginners focus solely on the language and its libraries while completely ignoring virtual environments. As a result, managing Python projects can become a mess: dependencies installed for different projects may have conflicting versions, leading to compatibility issues. Even when I studied Python, nobody…

March 8, 2025
Using GPT-4 for Personal Styling

Using GPT-4 for Personal Styling I’ve always been fascinated by Fashion—collecting unique pieces and trying to blend them in my own way. But let’s just say my closet was more of a work-in-progress avalanche than a curated wonderland. Every time I tried to add something new, I risked toppling my carefully balanced piles. Why this…

March 8, 2025
Image Captioning, Transformer Mode On

Image Captioning, Transformer Mode On Introduction In my previous article, I discussed one of the earliest Deep Learning approaches for image captioning. If you’re interested in reading it, you can find the link to that article at the end of this one. Today, I would like to talk about Image Captioning again, but this time…

March 8, 2025
When You Just Can’t Decide on a Single Action

When You Just Can’t Decide on a Single Action In Game Theory, the players typically have to make assumptions about the other players’ actions. What will the other player do? Will he use rock, paper or scissors? You never know, but in some cases, you might have an idea of the probability of some actions…

March 8, 2025
Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Reheated Gradient-based Discrete Sampling for Combinatorial Optimization arXiv:2503.04047v1 Announce Type: new Abstract: Recently, gradient-based discrete sampling has emerged as a highly efficient, general-purpose solver for various combinatorial optimization (CO) problems, achieving performance comparable to or surpassing the popular data-driven approaches. However, we identify a critical issue in these methods, which we term ”wandering in contours”.…

March 7, 2025
Conformal Prediction with Upper and Lower Bound Models

Conformal Prediction with Upper and Lower Bound Models arXiv:2503.04071v1 Announce Type: new Abstract: This paper studies a Conformal Prediction (CP) methodology for building prediction intervals in a regression setting, given only deterministic lower and upper bounds on the target variable. It proposes a new CP mechanism (CPUL) that goes beyond post-processing by adopting a model…

March 7, 2025
Generalization in Federated Learning: A Conditional Mutual Information Framework

Generalization in Federated Learning: A Conditional Mutual Information Framework arXiv:2503.04091v1 Announce Type: new Abstract: Federated Learning (FL) is a widely adopted privacy-preserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two components: the out-of-sample gap, which measures the gap between the empirical…

March 7, 2025
Learning Causal Response Representations through Direct Effect Analysis

Learning Causal Response Representations through Direct Effect Analysis arXiv:2503.04358v1 Announce Type: new Abstract: We propose a novel approach for learning causal response representations. Our method aims to extract directions in which a multidimensional outcome is most directly caused by a treatment variable. By bridging conditional independence testing with causal representation learning, we formulate an optimisation…

March 7, 2025
Time-varying Factor Augmented Vector Autoregression with Grouped Sparse Autoencoder

Time-varying Factor Augmented Vector Autoregression with Grouped Sparse Autoencoder arXiv:2503.04386v1 Announce Type: new Abstract: Recent economic events, including the global financial crisis and COVID-19 pandemic, have exposed limitations in linear Factor Augmented Vector Autoregressive (FAVAR) models for forecasting and structural analysis. Nonlinear dimension techniques, particularly autoencoders, have emerged as promising alternatives in a FAVAR framework,…

March 7, 2025
How to Spot and Prevent Model Drift Before it Impacts Your Business

How to Spot and Prevent Model Drift Before it Impacts Your Business Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business…

March 7, 2025
Applications of Entropy in Data Analysis and Machine Learning: A Review

Applications of Entropy in Data Analysis and Machine Learning: A Review arXiv:2503.02921v1 Announce Type: new Abstract: Since its origin in the thermodynamics of the 19th century, the concept of entropy has also permeated other fields of physics and mathematics, such as Classical and Quantum Statistical Mechanics, Information Theory, Probability Theory, Ergodic Theory and the Theory…

March 6, 2025
LAPD: Langevin-Assisted Bayesian Active Learning for Physical Discovery

LAPD: Langevin-Assisted Bayesian Active Learning for Physical Discovery arXiv:2503.02983v1 Announce Type: new Abstract: Discovering physical laws from data is a fundamental challenge in scientific research, particularly when high-quality data are scarce or costly to obtain. Traditional methods for identifying dynamical systems often struggle with noise sensitivity, inefficiency in data usage, and the inability to quantify…

March 6, 2025
PAC Learning with Improvements

PAC Learning with Improvements arXiv:2503.03184v1 Announce Type: new Abstract: One of the most basic lower bounds in machine learning is that in nearly any nontrivial setting, it takes $textit{at least}$ $1/epsilon$ samples to learn to error $epsilon$ (and more, if the classifier being learned is complex). However, suppose that data points are agents who have…

March 6, 2025
Convergence Rates for Softmax Gating Mixture of Experts

Convergence Rates for Softmax Gating Mixture of Experts arXiv:2503.03213v1 Announce Type: new Abstract: Mixture of experts (MoE) has recently emerged as an effective framework to advance the efficiency and scalability of machine learning models by softly dividing complex tasks among multiple specialized sub-models termed experts. Central to the success of MoE is an adaptive softmax…

March 6, 2025
Exploring specialization and sensitivity of convolutional neural networks in the context of simultaneous image augmentations

Exploring specialization and sensitivity of convolutional neural networks in the context of simultaneous image augmentations arXiv:2503.03283v1 Announce Type: new Abstract: Drawing parallels with the way biological networks are studied, we adapt the treatment–control paradigm to explainable artificial intelligence research and enrich it through multi-parametric input alterations. In this study, we propose a framework for investigating…

March 6, 2025
One-Tailed Vs. Two-Tailed Tests

One-Tailed Vs. Two-Tailed Tests Introduction If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for…

March 6, 2025
Kubernetes — Understanding and Utilizing Probes Effectively

Kubernetes — Understanding and Utilizing Probes Effectively Introduction Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits. Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container…

March 6, 2025
Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation Introduction Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results,…

March 6, 2025
Generative AI Is Declarative

Generative AI Is Declarative ChatGPT launched in 2022 and kicked off the Generative Ai boom. In the two years since, academics, technologists, and armchair experts have written libraries worth of articles on the technical underpinnings of generative AI and about the potential capabilities of both current and future generative AI models. Surprisingly little has been…

March 6, 2025
Mathematical Foundation of Interpretable Equivariant Surrogate Models

Mathematical Foundation of Interpretable Equivariant Surrogate Models arXiv:2503.01942v1 Announce Type: new Abstract: This paper introduces a rigorous mathematical framework for neural network explainability, and more broadly for the explainability of equivariant operators called Group Equivariant Operators (GEOs) based on Group Equivariant Non-Expansive Operators (GENEOs) transformations. The central concept involves quantifying the distance between GEOs by…

March 5, 2025
Gradient-free stochastic optimization for additive models

Gradient-free stochastic optimization for additive models arXiv:2503.02131v1 Announce Type: new Abstract: We address the problem of zero-order optimization from noisy observations for an objective function satisfying the Polyak-{L}ojasiewicz or the strong convexity condition. Additionally, we assume that the objective function has an additive structure and satisfies a higher-order smoothness property, characterized by the H”older family…

March 5, 2025
Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification

Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification arXiv:2503.02110v1 Announce Type: new Abstract: We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. citet{GL} previously established the lack of asymptotic…

March 5, 2025
Online Inference for Quantiles by Constant Learning-Rate Stochastic Gradient Descent

Online Inference for Quantiles by Constant Learning-Rate Stochastic Gradient Descent arXiv:2503.02178v1 Announce Type: new Abstract: This paper proposes an online inference method of the stochastic gradient descent (SGD) with a constant learning rate for quantile loss functions with theoretical guarantees. Since the quantile loss function is neither smooth nor strongly convex, we view such SGD…

March 5, 2025
Decentralized Reinforcement Learning for Multi-Agent Multi-Resource Allocation via Dynamic Cluster Agreements

Decentralized Reinforcement Learning for Multi-Agent Multi-Resource Allocation via Dynamic Cluster Agreements arXiv:2503.02437v1 Announce Type: new Abstract: This paper addresses the challenge of allocating heterogeneous resources among multiple agents in a decentralized manner. Our proposed method, LGTC-IPPO, builds upon Independent Proximal Policy Optimization (IPPO) by integrating dynamic cluster consensus, a mechanism that allows agents to form…

March 5, 2025
Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review

Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review “Conduct a comprehensive literature review on the state-of-the-art in Machine Learning and energy consumption. […]” With this prompt, I tested the new Deep Research function, which has been integrated into the OpenAI o3 reasoning model since the end of February — and conducted a state-of-the-art literature…

March 5, 2025
Practical SQL Puzzles That Will Level Up Your Skill

Practical SQL Puzzles That Will Level Up Your Skill There are some Sql patterns that, once you know them, you start seeing them everywhere. The solutions to the puzzles that I will show you today are actually very simple SQL queries, but understanding the concept behind them will surely unlock new solutions to the queries…

March 5, 2025
Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth

Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth I have been a data team manager for six months, and my team has grown from three to five. I wrote about my initial manager experiences back in November. In this article, I want to talk about something that is more essential to…

March 5, 2025
The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI

The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI Advancements in agentic artificial intelligence (AI) promise to bring significant opportunities to individuals and businesses in all sectors. However, as AI agents become more autonomous, they may use scheming behavior or break rules to achieve their functional goals. This can lead to the machine…

March 5, 2025
Approaching the Harm of Gradient Attacks While Only Flipping Labels

Approaching the Harm of Gradient Attacks While Only Flipping Labels arXiv:2503.00140v1 Announce Type: new Abstract: Availability attacks are one of the strongest forms of training-phase attacks in machine learning, making the model unusable. While prior work in distributed ML has demonstrated such effect via gradient attacks and, more recently, data poisoning, we ask: can similar…

March 4, 2025
An interpretation of the Brownian bridge as a physics-informed prior for the Poisson equation

An interpretation of the Brownian bridge as a physics-informed prior for the Poisson equation arXiv:2503.00213v1 Announce Type: new Abstract: Physics-informed machine learning is one of the most commonly used methods for fusing physical knowledge in the form of partial differential equations with experimental data. The idea is to construct a loss function where the physical…

March 4, 2025
Evolution of Information in Interactive Decision Making: A Case Study for Multi-Armed Bandits

Evolution of Information in Interactive Decision Making: A Case Study for Multi-Armed Bandits arXiv:2503.00273v1 Announce Type: new Abstract: We study the evolution of information in interactive decision making through the lens of a stochastic multi-armed bandit problem. Focusing on a fundamental example where a unique optimal arm outperforms the rest by a fixed margin, we…

March 4, 2025
LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention

LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention arXiv:2503.00387v1 Announce Type: new Abstract: Existing contextual multi-armed bandit (MAB) algorithms fail to effectively capture both long-term trends and local patterns across all arms, leading to suboptimal performance in environments with rapidly changing reward structures. They also rely on static exploration rates, which do not dynamically adjust…

March 4, 2025
Generalization Bounds for Equivariant Networks on Markov Data

Generalization Bounds for Equivariant Networks on Markov Data arXiv:2503.00292v1 Announce Type: new Abstract: Equivariant neural networks play a pivotal role in analyzing datasets with symmetry properties, particularly in complex data structures. However, integrating equivariance with Markov properties presents notable challenges due to the inherent dependencies within such data. Previous research has primarily concentrated on establishing…

March 4, 2025
How to Train LLMs to “Think” (o1 & DeepSeek-R1)

How to Train LLMs to “Think” (o1 & DeepSeek-R1) In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it “advanced reasoning” capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and published the…

March 4, 2025
Generative AI and Civic Institutions

Generative AI and Civic Institutions Different sectors, different goals Recent events have got me thinking about AI as it relates to our civic institutions — think government, education, public libraries, and so on. We often forget that civic and governmental organizations are inherently deeply different from private companies and profit-making enterprises. They exist to enable…

March 4, 2025
LLM + RAG: Creating an AI-Powered File Reader Assistant

LLM + RAG: Creating an AI-Powered File Reader Assistant Introduction AI is everywhere. It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.…

March 4, 2025
Data Science: From School to Work, Part II

Data Science: From School to Work, Part II In my previous article, I highlighted the importance of effective project management in Python development. Now, let’s shift our focus to the code itself and explore how to write clean, maintainable code — an essential practice in professional and collaborative environments. Readability & Maintainability: Well-structured code is easier to…

March 4, 2025
Avoidable and Unavoidable Randomness in GPT-4o

Avoidable and Unavoidable Randomness in GPT-4o Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces…

March 4, 2025
Transfer Learning through Enhanced Sufficient Representation: Enriching Source Domain Knowledge with Target Data

Transfer Learning through Enhanced Sufficient Representation: Enriching Source Domain Knowledge with Target Data arXiv:2502.20414v1 Announce Type: new Abstract: Transfer learning is an important approach for addressing the challenges posed by limited data availability in various applications. It accomplishes this by transferring knowledge from well-established source domains to a less familiar target domain. However, traditional transfer…

March 3, 2025
Efficient Risk-sensitive Planning via Entropic Risk Measures

Efficient Risk-sensitive Planning via Entropic Risk Measures arXiv:2502.20423v1 Announce Type: new Abstract: Risk-sensitive planning aims to identify policies maximizing some tail-focused metrics in Markov Decision Processes (MDPs). Such an optimization task can be very costly for the most widely used and interpretable metrics such as threshold probabilities or (Conditional) Values at Risk. Indeed, previous work…

March 3, 2025
Amortized Conditional Independence Testing

Amortized Conditional Independence Testing arXiv:2502.20925v1 Announce Type: new Abstract: Testing for the conditional independence structure in data is a fundamental and critical task in statistics and machine learning, which finds natural applications in causal discovery – a highly relevant problem to many scientific disciplines. Existing methods seek to design explicit test statistics that quantify the…

March 3, 2025
Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability

Learning Dynamics of Deep Linear Networks Beyond the Edge of Stability arXiv:2502.20531v1 Announce Type: new Abstract: Deep neural networks trained using gradient descent with a fixed learning rate $eta$ often operate in the regime of “edge of stability” (EOS), where the largest eigenvalue of the Hessian equilibrates about the stability threshold $2/eta$. In this work,…

March 3, 2025
Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes

Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes arXiv:2502.20966v1 Announce Type: new Abstract: Uncertainty quantification in neural networks through methods such as Dropout, Bayesian neural networks and Laplace approximations is either prone to underfitting or computationally demanding, rendering these approaches impractical for large-scale datasets. In this work, we address these shortcomings by…

March 3, 2025
Weekly Entering & Transitioning – Thread 03 Mar, 2025 – 10 Mar, 2025

Weekly Entering & Transitioning – Thread 03 Mar, 2025 – 10 Mar, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

March 3, 2025