Category: aimldsaimlds

Learnings from a Machine Learning Engineer — Part 1: The Data

Learnings from a Machine Learning Engineer — Part 1: The Data It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you…

February 14, 2025
Learnings from a Machine Learning Engineer — Part 4: The Model

Learnings from a Machine Learning Engineer — Part 4: The Model In this latest part of my series, I will share what I have learned on selecting a model for Image Classification and how to fine tune that model. I will also show how you can leverage the model to accelerate your labelling process, and…

February 14, 2025
SNAP: Sequential Non-Ancestor Pruning for Targeted Causal Effect Estimation With an Unknown Graph

SNAP: Sequential Non-Ancestor Pruning for Targeted Causal Effect Estimation With an Unknown Graph arXiv:2502.07857v1 Announce Type: new Abstract: Causal discovery can be computationally demanding for large numbers of variables. If we only wish to estimate the causal effects on a small subset of target variables, we might not need to learn the causal graph for…

February 13, 2025
Discrete Markov Probabilistic Models

Discrete Markov Probabilistic Models arXiv:2502.07939v1 Announce Type: new Abstract: This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in the space of bits ${0,1}^d$, where the noising process is a continuous-time Markov chain that can be sampled exactly via a Poissonian clock that flips labels…

February 13, 2025
The Observational Partial Order of Causal Structures with Latent Variables

The Observational Partial Order of Causal Structures with Latent Variables arXiv:2502.07891v1 Announce Type: new Abstract: For two causal structures with the same set of visible variables, one is said to observationally dominate the other if the set of distributions over the visible variables realizable by the first contains the set of distributions over the visible…

February 13, 2025
Optimizing Likelihoods via Mutual Information: Bridging Simulation-Based Inference and Bayesian Optimal Experimental Design

Optimizing Likelihoods via Mutual Information: Bridging Simulation-Based Inference and Bayesian Optimal Experimental Design arXiv:2502.08004v1 Announce Type: new Abstract: Simulation-based inference (SBI) is a method to perform inference on a variety of complex scientific models with challenging inference (inverse) problems. Bayesian Optimal Experimental Design (BOED) aims to efficiently use experimental resources to make better inferences. Various…

February 13, 2025
Multi-View Oriented GPLVM: Expressiveness and Efficiency

Multi-View Oriented GPLVM: Expressiveness and Efficiency arXiv:2502.08253v1 Announce Type: new Abstract: The multi-view Gaussian process latent variable model (MV-GPLVM) aims to learn a unified representation from multi-view data but is hindered by challenges such as limited kernel expressiveness and low computational efficiency. To overcome these issues, we first introduce a new duality between the spectral…

February 13, 2025
Method of Moments Estimation with Python Code

Method of Moments Estimation with Python Code Let’s say you are in a customer care center, and you would like to know the probability distribution of the number of calls per minute, or in other words, you want to answer the question: what is the probability of receiving zero, one, two, … etc., calls per…

February 13, 2025
Should Data Scientists Care About Quantum Computing?

Should Data Scientists Care About Quantum Computing? I am sure the quantum hype has reached every person in tech (and outside it, most probably). With some over-the-top claims, like “some company has proved quantum supremacy,” “the quantum revolution is here,” or my favorite, “quantum computers are here, and it will make classical computers obsolete.” I…

February 13, 2025
How to Measure the Reliability of a Large Language Model’s Response

How to Measure the Reliability of a Large Language Model’s Response The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it…

February 13, 2025
Manage Environment Variables with Pydantic

Manage Environment Variables with Pydantic Introduction Developers work on applications that are supposed to be deployed on some server in order to allow anyone to use those. Typically in the machine where these apps live, developers set up environment variables that allow the app to run. These variables can be API keys of external services,…

February 13, 2025
Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…

February 13, 2025
Confidence Intervals for Evaluation of Data Mining

Confidence Intervals for Evaluation of Data Mining arXiv:2502.07016v1 Announce Type: new Abstract: In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Some examples include classification accuracy, precision, recall, F measures, and Jaccard…

February 12, 2025
Epistemic Uncertainty in Conformal Scores: A Unified Approach

Epistemic Uncertainty in Conformal Scores: A Unified Approach arXiv:2502.06995v1 Announce Type: new Abstract: Conformal prediction methods create prediction bands with distribution-free guarantees but do not explicitly capture epistemic uncertainty, which can lead to overconfident predictions in data-sparse regions. Although recent conformal scores have been developed to address this limitation, they are typically designed for specific…

February 12, 2025
Generative Distribution Prediction: A Unified Approach to Multimodal Learning

Generative Distribution Prediction: A Unified Approach to Multimodal Learning arXiv:2502.07090v1 Announce Type: new Abstract: Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a…

February 12, 2025
Online Covariance Matrix Estimation in Sketched Newton Methods

Online Covariance Matrix Estimation in Sketched Newton Methods arXiv:2502.07114v1 Announce Type: new Abstract: Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched Newton method that leverages a randomized sketching technique…

February 12, 2025
Riemannian Proximal Sampler for High-accuracy Sampling on Manifolds

Riemannian Proximal Sampler for High-accuracy Sampling on Manifolds arXiv:2502.07265v1 Announce Type: new Abstract: We introduce the Riemannian Proximal Sampler, a method for sampling from densities defined on Riemannian manifolds. The performance of this sampler critically depends on two key oracles: the Manifold Brownian Increments (MBI) oracle and the Riemannian Heat-kernel (RHK) oracle. We establish high-accuracy…

February 12, 2025
Build a Decision Tree in Polars from Scratch

Build a Decision Tree in Polars from Scratch Decision Tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn, Lightgbm, xgboost and catboost have done a very good job…

February 12, 2025
Virtualization & Containers for Data Science Newbies

Virtualization & Containers for Data Science Newbies Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak. Many cloud services rely on virtualization. But other technologies, such…

February 12, 2025
4-Dimensional Data Visualization: Time in Bubble Charts

4-Dimensional Data Visualization: Time in Bubble Charts Bubble Charts elegantly compress large amounts of information into a single visualization, with bubble size adding a third dimension. However, comparing “before” and “after” states is often crucial. To address this, we propose adding a transition between these states, creating an intuitive user experience. Since we couldn’t find…

February 12, 2025
Understanding Model Calibration: A Gentle Introduction & Visual Exploration

Understanding Model Calibration: A Gentle Introduction & Visual Exploration How Reliable Are Your Predictions? About To be considered reliable, a model must be calibrated so that its confidence in each decision closely reflects its true outcome. In this blog post we’ll take a look at the most commonly used definition for calibration and then dive…

February 12, 2025
Data vs. Business Strategy

Data vs. Business Strategy There seems to be a consensus that leveraging data, analytics, and AI to create a data-driven organization requires a clear strategic approach. However, there is less clarity and agreement on exactly what this strategic approach should look like in practice. This article provides a short overview of what strategy work I…

February 12, 2025
Online Covariance Estimation in Nonsmooth Stochastic Approximation

Online Covariance Estimation in Nonsmooth Stochastic Approximation arXiv:2502.05305v1 Announce Type: new Abstract: We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of H’ajek and Le Cam.…

February 11, 2025
On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers

On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers arXiv:2502.05672v1 Announce Type: new Abstract: This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks,…

February 11, 2025
dynoGP: Deep Gaussian Processes for dynamic system identification

dynoGP: Deep Gaussian Processes for dynamic system identification arXiv:2502.05620v1 Announce Type: new Abstract: In this work, we present a novel approach to system identification for dynamical systems, based on a specific class of Deep Gaussian Processes (Deep GPs). These models are constructed by interconnecting linear dynamic GPs (equivalent to stochastic linear time-invariant dynamical systems) and…

February 11, 2025
Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction

Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction arXiv:2502.05676v1 Announce Type: new Abstract: Ensuring model calibration is critical for reliable predictions, yet popular distribution-free methods, such as histogram binning and isotonic regression, provide only asymptotic guarantees. We introduce a unified framework for Venn and Venn-Abers calibration, generalizing Vovk’s binary classification approach to arbitrary…

February 11, 2025
TD(0) Learning converges for Polynomial mixing and non-linear functions

TD(0) Learning converges for Polynomial mixing and non-linear functions arXiv:2502.05706v1 Announce Type: new Abstract: Theoretical work on Temporal Difference (TD) learning has provided finite-sample and high-probability guarantees for data generated from Markov chains. However, these bounds typically require linear function approximation, instance-dependent step sizes, algorithmic modifications, and restrictive mixing rates. We present theoretical findings for…

February 11, 2025
Six Ways to Control Style and Content in Diffusion Models

Six Ways to Control Style and Content in Diffusion Models Stable Diffusion 1.5/2.0/2.1/XL 1.0, DALL-E, Imagen… In the past years, Diffusion Models have showcased stunning quality in image generation. However, while producing great quality on generic concepts, these struggle to generate high quality for more specialised queries, for example generating images in a specific style,…

February 11, 2025
Sparsity-Based Interpolation of External, Internal and Swap Regret

Sparsity-Based Interpolation of External, Internal and Swap Regret arXiv:2502.04543v1 Announce Type: new Abstract: Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via $phi$-regret minimization, which measures the performance of an algorithm by its regret with respect to an arbitrary action modification rule $phi$. With $d$ experts…

February 10, 2025
Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect

Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect arXiv:2502.04673v1 Announce Type: new Abstract: Estimation and inference for the Average Treatment Effect (ATE) is a cornerstone of causal inference and often serves as the foundation for developing procedures for more complicated settings. Although traditionally analyzed in a batch setting, recent advances in martingale theory…

February 10, 2025
Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond

Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond arXiv:2502.04575v1 Announce Type: new Abstract: Given an unnormalized probability density $piproptomathrm{e}^{-V}$, estimating its normalizing constant $Z=int_{mathbb{R}^d}mathrm{e}^{-V(x)}mathrm{d}x$ or free energy $F=-log Z$ is a crucial problem in Bayesian statistics, statistical mechanics, and machine learning. It is challenging especially in high dimensions…

February 10, 2025
A Meta-learner for Heterogeneous Effects in Difference-in-Differences

A Meta-learner for Heterogeneous Effects in Difference-in-Differences arXiv:2502.04699v1 Announce Type: new Abstract: We address the problem of estimating heterogeneous treatment effects in panel data, adopting the popular Difference-in-Differences (DiD) framework under the conditional parallel trends assumption. We propose a novel doubly robust meta-learner for the Conditional Average Treatment Effect on the Treated (CATT), reducing the…

February 10, 2025
PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders

PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders arXiv:2502.04730v1 Announce Type: new Abstract: Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack…

February 10, 2025
Weekly Entering & Transitioning – Thread 10 Feb, 2025 – 17 Feb, 2025

Weekly Entering & Transitioning – Thread 10 Feb, 2025 – 17 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

February 10, 2025
The Gamma Hurdle Distribution

The Gamma Hurdle Distribution Which Outcome Matters? Here is a common scenario : An A/B test was conducted, where a random sample of units (e.g. customers) were selected for a campaign and they received Treatment A. Another sample was selected to receive Treatment B. “A” could be a communication or offer and “B” could be…

February 8, 2025
Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them)

Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them) Accurate impact estimations can make or break your business case. Yet, despite its importance, most teams use oversimplified calculations that can lead to inflated projections. These shot-in-the-dark numbers not only destroy credibility with stakeholders but can also result in misallocation of resources and…

February 8, 2025
I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too…

February 8, 2025
Synthetic Data Generation with LLMs

Synthetic Data Generation with LLMs Popularity of RAG Over the past two years while working with financial firms, I’ve observed firsthand how they identify and prioritize Generative AI use cases, balancing complexity with potential value. Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation…

February 8, 2025
The Method of Moments Estimator for Gaussian Mixture Models

The Method of Moments Estimator for Gaussian Mixture Models Audio Processing is one of the most important application domains of digital signal processing (DSP) and machine learning. Modeling acoustic environments is an essential step in developing digital audio processing systems such as: speech recognition, speech enhancement, acoustic echo cancellation, etc. Acoustic environments are filled with background…

February 8, 2025
Two in context learning tasks with complex functions

Two in context learning tasks with complex functions arXiv:2502.03503v1 Announce Type: new Abstract: We examine two in context learning (ICL) tasks with mathematical functions in several train and test settings for transformer models. Our study generalizes work on linear functions by showing that small transformers, even models with attention layers only, can approximate arbitrary polynomial…

February 7, 2025
Multivariate Conformal Prediction using Optimal Transport

Multivariate Conformal Prediction using Optimal Transport arXiv:2502.03609v1 Announce Type: new Abstract: Conformal prediction (CP) quantifies the uncertainty of machine learning models by constructing sets of plausible outputs. These sets are constructed by leveraging a so-called conformity score, a quantity computed using the input point of interest, a prediction model, and past observations. CP sets are…

February 7, 2025
Online Learning Algorithms in Hilbert Spaces with $beta-$ and $phi-$Mixing Sequences

Online Learning Algorithms in Hilbert Spaces with $beta-$ and $phi-$Mixing Sequences arXiv:2502.03551v1 Announce Type: new Abstract: In this paper, we study an online algorithm in a reproducing kernel Hilbert spaces (RKHS) based on a class of dependent processes, called the mixing process. For such a process, the degree of dependence is measured by various mixing…

February 7, 2025
Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures Approach

Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures Approach arXiv:2502.03650v1 Announce Type: new Abstract: Real-world data contain uncertainty and variations that can be correlated to external variables, known as randomness. An alternative cause of randomness is chaos, which can be an important component of chaotic time series.…

February 7, 2025
Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints

Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints arXiv:2502.03792v1 Announce Type: new Abstract: We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures…

February 7, 2025
How to Create Network Graph Visualizations in Microsoft PowerBI

How to Create Network Graph Visualizations in Microsoft PowerBI Microsoft PowerBI is a one of the most popular Business Intelligence (BI) tools, and while it has all the features you need to create dynamic analytic reporting for stakeholders across the business, creating some advanced data visualizations is more challenging. This article will walk through how…

February 7, 2025
Introduction to Minimum Cost Flow Optimization in Python

Introduction to Minimum Cost Flow Optimization in Python Minimum cost flow optimization minimizes the cost of moving flow through a network of nodes and edges. Nodes include sources (supply) and sinks (demand), with different costs and capacity limits. The aim is to find the least costly way to move volume from sources to sinks while…

February 7, 2025
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics

Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics Metric collection is an essential part of every machine learning project, enabling us to track model performance and monitor training progress. Ideally, Metrics should be collected and computed without introducing any additional overhead to the training process. However, just like other components of the…

February 7, 2025
A Visual Guide to How Diffusion Models Work

A Visual Guide to How Diffusion Models Work This article is aimed at those who want to understand exactly how Diffusion Models work, with no prior knowledge expected. I’ve tried to use illustrations wherever possible to provide visual intuitions on each part of these models. I’ve kept mathematical notation and equations to a minimum, and where…

February 7, 2025
Networks with Finite VC Dimension: Pro and Contra

Networks with Finite VC Dimension: Pro and Contra arXiv:2502.02679v1 Announce Type: new Abstract: Approximation and learning of classifiers of large data sets by neural networks in terms of high-dimensional geometry and statistical learning theory are investigated. The influence of the VC dimension of sets of input-output functions of networks on approximation capabilities is compared with…

February 6, 2025
Achievable distributional robustness when the robust risk is only partially identified

Achievable distributional robustness when the robust risk is only partially identified arXiv:2502.02710v1 Announce Type: new Abstract: In safety-critical applications, machine learning models should generalize well under worst-case distribution shifts, that is, have a small robust risk. Invariance-based algorithms can provably take advantage of structural assumptions on the shifts when the training distributions are heterogeneous enough…

February 6, 2025
Algorithms with Calibrated Machine Learning Predictions

Algorithms with Calibrated Machine Learning Predictions arXiv:2502.02861v1 Announce Type: new Abstract: The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. While this theoretical framework often assumes uniform reliability across all predictions, modern machine learning models can now provide instance-level uncertainty estimates. In this paper,…

February 6, 2025
Gap-Dependent Bounds for Federated $Q$-learning

Gap-Dependent Bounds for Federated $Q$-learning arXiv:2502.02859v1 Announce Type: new Abstract: We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $sqrt{T}$-type regret bounds and communication cost bounds with a $log T$ term scaling…

February 6, 2025
Uncertainty Quantification with the Empirical Neural Tangent Kernel

Uncertainty Quantification with the Empirical Neural Tangent Kernel arXiv:2502.02870v1 Announce Type: new Abstract: While neural networks have demonstrated impressive performance across various tasks, accurately quantifying uncertainty in their predictions is essential to ensure their trustworthiness and enable widespread adoption in critical systems. Several Bayesian uncertainty quantification (UQ) methods exist that are either cheap or reliable,…

February 6, 2025
Myths vs. Data: Does an Apple a Day Keep the Doctor Away?

Myths vs. Data: Does an Apple a Day Keep the Doctor Away? Introduction “Money can’t buy happiness.” “You can’t judge a book by its cover.” “An apple a day keeps the doctor away.” You’ve probably heard these sayings several times, but do they actually hold up when we look at the data? In this article series,…

February 6, 2025
Training Large Language Models: From TRPO to GRPO

Training Large Language Models: From TRPO to GRPO Deepseek has recently made quite a buzz in the AI community, thanks to its impressive performance at relatively low costs. I think this is a perfect opportunity to dive deeper into how Large Language Models (LLMs) are trained. In this article, we will focus on the Reinforcement Learning…

February 6, 2025
Supercharge Your RAG with Multi-Agent Self-RAG

Supercharge Your RAG with Multi-Agent Self-RAG Introduction Many of us might have tried to build a RAG application and noticed it falls significantly short of addressing real-life needs. Why is that? It’s because many real-world problems require multiple steps of information retrieval and reasoning. We need our agent to perform those as humans normally do,…

February 6, 2025
Doubly Robust Monte Carlo Tree Search

Doubly Robust Monte Carlo Tree Search arXiv:2502.01672v1 Announce Type: new Abstract: We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS…

February 5, 2025
Graph Canonical Correlation Analysis

Graph Canonical Correlation Analysis arXiv:2502.01780v1 Announce Type: new Abstract: Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their…

February 5, 2025
Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models

Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models arXiv:2502.01919v1 Announce Type: new Abstract: In this work, we present a comprehensive Bayesian posterior analysis of what we term Poisson Hierarchical Indian Buffet Processes, designed for complex random sparse count species sampling models that allow…

February 5, 2025
Local minima of the empirical risk in high dimension: General theorems and convex examples

Local minima of the empirical risk in high dimension: General theorems and convex examples arXiv:2502.01953v1 Announce Type: new Abstract: We consider a general model for high-dimensional empirical risk minimization whereby the data $mathbf{x}_i$ are $d$-dimensional isotropic Gaussian vectors, the model is parametrized by $mathbf{Theta}inmathbb{R}^{dtimes k}$, and the loss depends on the data via the projection…

February 5, 2025
Theoretical and Practical Analysis of Fr’echet Regression via Comparison Geometry

Theoretical and Practical Analysis of Fr’echet Regression via Comparison Geometry arXiv:2502.01995v1 Announce Type: new Abstract: Fr’echet regression extends classical regression methods to non-Euclidean metric spaces, enabling the analysis of data relationships on complex structures such as manifolds and graphs. This work establishes a rigorous theoretical analysis for Fr’echet regression through the lens of comparison geometry…

February 5, 2025
From Resume to Cover Letter Using AI and LLM, with Python and Streamlit

From Resume to Cover Letter Using AI and LLM, with Python and Streamlit DISCLAIMER: The idea of doing Cover Letter or even Resume with AI does not obviously start with me. A lot of people have done this before (very successfully) and have built websites and even companies from the idea. This is just a…

February 5, 2025
ML Feature Management: A Practical Evolution Guide

ML Feature Management: A Practical Evolution Guide In the world of machine learning, we obsess over model architectures, training pipelines, and hyper-parameter tuning, yet often overlook a fundamental aspect: how our features live and breathe throughout their lifecycle. From in-memory calculations that vanish after each prediction to the challenge of reproducing exact feature values months…

February 5, 2025
Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees

Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees arXiv:2502.00240v1 Announce Type: new Abstract: Learning effective regularization is crucial for solving ill-posed inverse problems, which arise in a wide range of scientific and engineering applications. While data-driven methods that parameterize regularizers using deep neural networks have demonstrated strong empirical performance, they often…

February 4, 2025
Supervised Quadratic Feature Analysis: An Information Geometry Approach to Dimensionality Reduction

Supervised Quadratic Feature Analysis: An Information Geometry Approach to Dimensionality Reduction arXiv:2502.00168v1 Announce Type: new Abstract: Supervised dimensionality reduction aims to map labeled data to a low-dimensional feature space while maximizing class discriminability. Despite the availability of methods for learning complex non-linear features (e.g. Deep Learning), there is an enduring demand for dimensionality reduction methods…

February 4, 2025
Learning to Fuse Temporal Proximity Networks: A Case Study in Chimpanzee Social Interactions

Learning to Fuse Temporal Proximity Networks: A Case Study in Chimpanzee Social Interactions arXiv:2502.00302v1 Announce Type: new Abstract: How can we identify groups of primate individuals which could be conjectured to drive social structure? To address this question, one of us has collected a time series of data for social interactions between chimpanzees. Here we…

February 4, 2025
Decentralized Inference for Distributed Geospatial Data Using Low-Rank Models

Decentralized Inference for Distributed Geospatial Data Using Low-Rank Models arXiv:2502.00309v1 Announce Type: new Abstract: Advancements in information technology have enabled the creation of massive spatial datasets, driving the need for scalable and efficient computational methodologies. While offering viable solutions, centralized frameworks are limited by vulnerabilities such as single-point failures and communication bottlenecks. This paper presents…

February 4, 2025
Variance Reduction via Resampling and Experience Replay

Variance Reduction via Resampling and Experience Replay arXiv:2502.00520v1 Announce Type: new Abstract: Experience replay is a foundational technique in reinforcement learning that enhances learning stability by storing past experiences in a replay buffer and reusing them during training. Despite its practical success, its theoretical properties remain underexplored. In this paper, we present a theoretical framework…

February 4, 2025
Towards Data Science is Launching as an Independent Publication

Towards Data Science is Launching as an Independent Publication Since founding Towards Data Science in 2016, we’ve built the largest publication on Medium with a dedicated community of readers and contributors focused on data science, machine learning, and AI. Medium built a fantastic platform, and we wouldn’t have been able to reach our audience without…

February 4, 2025
Show and Tell

Show and Tell Photo by Ståle Grut on Unsplash Introduction Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision,…

February 4, 2025
Neural Networks – Intuitively and Exhaustively Explained

Neural Networks – Intuitively and Exhaustively Explained An in-depth exploration of the most fundamental architecture in modern AI “The Thinking Part” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained. In this article we’ll form a thorough understanding of the neural network,…

February 4, 2025
How to Get Promoted as a Data Scientist

How to Get Promoted as a Data Scientist Image artificially generated using Grok 2. Introduction I have been working as a Data Scientist since 2017, and during that time I have been promoted from a junior/mid-level to a senior, and most recently to a Lead Data Scientist. There is a lot of content online regarding…

February 4, 2025
How to Find Seasonality Patterns in Time Series

How to Find Seasonality Patterns in Time Series Using Fourier Transforms to detect seasonal components In my professional life as a data scientist, I have encountered time series multiple times. Most of my knowledge comes from my academic experience, specifically my courses in Econometrics (I have a degree in Economics), where we studied statistical properties…

February 4, 2025
Adaptivity and Convergence of Probability Flow ODEs in Diffusion Generative Models

Adaptivity and Convergence of Probability Flow ODEs in Diffusion Generative Models arXiv:2501.18863v1 Announce Type: new Abstract: Score-based generative models, which transform noise into data by learning to reverse a diffusion process, have become a cornerstone of modern generative AI. This paper contributes to establishing theoretical guarantees for the probability flow ODE, a widely used diffusion-based…

February 3, 2025
A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization

A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization arXiv:2501.18756v1 Announce Type: new Abstract: Bayesian optimization is a widely used method for optimizing expensive black-box functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to reduce uncertainty about the function’s optimum and…

February 3, 2025
Trustworthy Evaluation of Generative AI Models

Trustworthy Evaluation of Generative AI Models arXiv:2501.18897v1 Announce Type: new Abstract: Generative AI (GenAI) models have recently achieved remarkable empirical performance in various applications, however, their evaluations yet lack uncertainty quantification. In this paper, we propose a method to compare two generative models based on an unbiased estimator of their relative performance gap. Statistically, our…

February 3, 2025
Optimizing Through Change: Bounds and Recommendations for Time-Varying Bayesian Optimization Algorithms

Optimizing Through Change: Bounds and Recommendations for Time-Varying Bayesian Optimization Algorithms arXiv:2501.18963v1 Announce Type: new Abstract: Time-Varying Bayesian Optimization (TVBO) is the go-to framework for optimizing a time-varying, expensive, noisy black-box function. However, most of the solutions proposed so far either rely on unrealistic assumptions on the nature of the objective function or do not…

February 3, 2025
Optimal Transport-based Conformal Prediction

Optimal Transport-based Conformal Prediction arXiv:2501.18991v1 Announce Type: new Abstract: Conformal Prediction (CP) is a principled framework for quantifying uncertainty in blackbox learning models, by constructing prediction sets with finite-sample coverage guarantees. Traditional approaches rely on scalar nonconformity scores, which fail to fully exploit the geometric structure of multivariate outputs, such as in multi-output regression or…

February 3, 2025
Weekly Entering & Transitioning – Thread 03 Feb, 2025 – 10 Feb, 2025

Weekly Entering & Transitioning – Thread 03 Feb, 2025 – 10 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

February 3, 2025
Awesome Plotly with code series (Part 9): To dot, to slope or to stack?

Awesome Plotly with code series (Part 9): To dot, to slope or to stack? Simple methods to replace cluttered bar charts with crisp, reader-friendly visuals. Continue reading on Towards Data Science » Jose Parreño Go to original source

February 3, 2025
5 Essential Tips Learned from My Data Science Journey

5 Essential Tips Learned from My Data Science Journey Personal reflections on my 10-year data odyssey Continue reading on Towards Data Science » Federico Rucci Go to original source

February 3, 2025
The Cultural Backlash Against Generative AI

The Cultural Backlash Against Generative AI What’s making many people resent generative AI, and what impact does that have on the companies responsible? Photo by Joshua Hoehne on Unsplash The recent reveal of DeepSeek-R1, the large scale LLM developed by a Chinese company (also named DeepSeek), has been a very interesting event for those of us…

February 3, 2025
How to Make a Data Science Portfolio That Stands Out

How to Make a Data Science Portfolio That Stands Out Create a data science portfolio with Cloud-flare and HUGO Continue reading on Towards Data Science » Egor Howell Go to original source

February 3, 2025
Improving Agent Systems & AI Reasoning

Improving Agent Systems & AI Reasoning DeepSeek-R1, OpenAI o1 & o3, Test-Time Compute Scaling, Model Post-Training and the Transition to Reasoning Language Models (RLMs) Image by author and GPT-4o meant to represent DeepSeek and other competitive GenAI model providers Introduction Over the past year generative AI adoption and AI Agent development have skyrocketed. Reports from LangChain…

February 3, 2025
Sparse AutoEncoder: from Superposition to interpretable features

Sparse AutoEncoder: from Superposition to interpretable features Disentangle features in complex Neural Network with superpositions Complex neural networks, such as Large Language Models (LLMs), suffer quite often from interpretability challenges. One of the most important reasons for such difficulty is superposition — a phenomenon of the neural network having fewer dimensions than the number of features it…

February 2, 2025
Injecting domain expertise into your AI system

Injecting domain expertise into your AI system How to connect the dots between AI technology and real life (Source: Getty Images) When starting their AI initiatives, many companies are trapped in silos and treat AI as a purely technical enterprise, sidelining domain experts or involving them too late. They end up with generic AI applications that miss…

February 2, 2025
Are Data Scientists at Risk in 2025?

Are Data Scientists at Risk in 2025? The impact of AI on data science jobs. Continue reading on Towards Data Science » Natassha Selvaraj Go to original source

February 2, 2025
Rapid Data Visualization with Copilot and Plotly

Rapid Data Visualization with Copilot and Plotly Code visualizations quickly and efficiently with Copilot, Plotly, and Streamlit Continue reading on Towards Data Science » Alan Jones Go to original source

February 2, 2025
DeepSeek V3: A New Contender in AI-Powered Data Science

DeepSeek V3: A New Contender in AI-Powered Data Science How DeepSeek’s budget-friendly AI model stacks up against ChatGPT, Claude, and Gemini in SQL, EDA, and machine learning Continue reading on Towards Data Science » Yu Dong Go to original source

February 2, 2025
Fine-tuning Multimodal Embedding Models

Fine-tuning Multimodal Embedding Models Adapting CLIP to YouTube Data (with Python Code) This is the 4th article in a larger series on multimodal AI. In the previous post, we discussed multimodal RAG systems, which can retrieve and synthesize information from different data modalities (e.g. text, images, audio). There, we saw how we could implement such a…

February 1, 2025
How Likely Is a Six Nations Grand Slam in 2025?

How Likely Is a Six Nations Grand Slam in 2025? Quantifying uncertainty in sports fixtures Photo by Thomas Serer on Unsplash Introduction For rugby fans the long wait is nearly over, like Christmas the Six Nations comes once a year to lift our spirits in the cold winter months. If you’re not very familiar with rugby, the…

February 1, 2025
Can Machines Dream? On the Creativity of Large Language Models

Can Machines Dream? On the Creativity of Large Language Models Exploring the Role of Hallucinations, Dependencies, and Imagination in AI Creativity Continue reading on Towards Data Science » Salvatore Raieli Go to original source

February 1, 2025
Inequality in Practice: E-commerce Portfolio Analysis

Inequality in Practice: E-commerce Portfolio Analysis From Mathematical Theory to Actionable Insights: A 6-Year Shopify Case Study Image generated by DALL-E, based on author’s prompt, inspired by “The Bremen Town Musicians” Are your top-selling products making or breaking your business? It’s terrifying to think your entire revenue might collapse if one or two products fall out…

February 1, 2025
2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy

2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU Continue reading on Towards Data Science » Benjamin Marie Go to original source

February 1, 2025
Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection

Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection arXiv:2501.17889v1 Announce Type: new Abstract: Variable selection plays a crucial role in enhancing modeling effectiveness across diverse fields, addressing the challenges posed by high-dimensional datasets of correlated variables. This work introduces a novel approach namely Knockoff with over-parameterization (Knoop) to enhance Knockoff filters for variable…

January 31, 2025
Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks

Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks arXiv:2501.17882v1 Announce Type: new Abstract: We consider a multi-player multi-armed bandit setting in the presence of adversaries that attempt to negatively affect the rewards received by the players in the system. The reward distributions for any given arm are heterogeneous across the players. In the event of…

January 31, 2025
U-aggregation: Unsupervised Aggregation of Multiple Learning Algorithms

U-aggregation: Unsupervised Aggregation of Multiple Learning Algorithms arXiv:2501.18084v1 Announce Type: new Abstract: Across various domains, the growing advocacy for open science and open-source machine learning has made an increasing number of models publicly available. These models allow practitioners to integrate them into their own contexts, reducing the need for extensive data labeling, training, and calibration.…

January 31, 2025
Optimal Survey Design for Private Mean Estimation

Optimal Survey Design for Private Mean Estimation arXiv:2501.18121v1 Announce Type: new Abstract: This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation,…

January 31, 2025
Random Feature Representation Boosting

Random Feature Representation Boosting arXiv:2501.18283v1 Announce Type: new Abstract: We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits…

January 31, 2025
Distributed Tracing: A Powerful Approach to Debugging Complex Systems

Distributed Tracing: A Powerful Approach to Debugging Complex Systems Why distributed tracing is the key to resolving performance issues Continue reading on Towards Data Science » Hareesha Dandamudi Go to original source

January 31, 2025