Category: aimldsaimlds
-
Learnings from a Machine Learning Engineer — Part 1: The Data
Learnings from a Machine Learning Engineer — Part 1: The Data It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you…
-
Learnings from a Machine Learning Engineer — Part 4: The Model
Learnings from a Machine Learning Engineer — Part 4: The Model In this latest part of my series, I will share what I have learned on selecting a model for Image Classification and how to fine tune that model. I will also show how you can leverage the model to accelerate your labelling process, and…
-
SNAP: Sequential Non-Ancestor Pruning for Targeted Causal Effect Estimation With an Unknown Graph
SNAP: Sequential Non-Ancestor Pruning for Targeted Causal Effect Estimation With an Unknown Graph arXiv:2502.07857v1 Announce Type: new Abstract: Causal discovery can be computationally demanding for large numbers of variables. If we only wish to estimate the causal effects on a small subset of target variables, we might not need to learn the causal graph for…
-
Discrete Markov Probabilistic Models
Discrete Markov Probabilistic Models arXiv:2502.07939v1 Announce Type: new Abstract: This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in the space of bits ${0,1}^d$, where the noising process is a continuous-time Markov chain that can be sampled exactly via a Poissonian clock that flips labels…
-
The Observational Partial Order of Causal Structures with Latent Variables
The Observational Partial Order of Causal Structures with Latent Variables arXiv:2502.07891v1 Announce Type: new Abstract: For two causal structures with the same set of visible variables, one is said to observationally dominate the other if the set of distributions over the visible variables realizable by the first contains the set of distributions over the visible…
-
Optimizing Likelihoods via Mutual Information: Bridging Simulation-Based Inference and Bayesian Optimal Experimental Design
Optimizing Likelihoods via Mutual Information: Bridging Simulation-Based Inference and Bayesian Optimal Experimental Design arXiv:2502.08004v1 Announce Type: new Abstract: Simulation-based inference (SBI) is a method to perform inference on a variety of complex scientific models with challenging inference (inverse) problems. Bayesian Optimal Experimental Design (BOED) aims to efficiently use experimental resources to make better inferences. Various…
-
Multi-View Oriented GPLVM: Expressiveness and Efficiency
Multi-View Oriented GPLVM: Expressiveness and Efficiency arXiv:2502.08253v1 Announce Type: new Abstract: The multi-view Gaussian process latent variable model (MV-GPLVM) aims to learn a unified representation from multi-view data but is hindered by challenges such as limited kernel expressiveness and low computational efficiency. To overcome these issues, we first introduce a new duality between the spectral…
-
Method of Moments Estimation with Python Code
Method of Moments Estimation with Python Code Let’s say you are in a customer care center, and you would like to know the probability distribution of the number of calls per minute, or in other words, you want to answer the question: what is the probability of receiving zero, one, two, … etc., calls per…
-
Should Data Scientists Care About Quantum Computing?
Should Data Scientists Care About Quantum Computing? I am sure the quantum hype has reached every person in tech (and outside it, most probably). With some over-the-top claims, like “some company has proved quantum supremacy,” “the quantum revolution is here,” or my favorite, “quantum computers are here, and it will make classical computers obsolete.” I…
-
How to Measure the Reliability of a Large Language Model’s Response
How to Measure the Reliability of a Large Language Model’s Response The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it…
-
Manage Environment Variables with Pydantic
Manage Environment Variables with Pydantic Introduction Developers work on applications that are supposed to be deployed on some server in order to allow anyone to use those. Typically in the machine where these apps live, developers set up environment variables that allow the app to run. These variables can be API keys of external services,…
-
Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets
Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…
-
Confidence Intervals for Evaluation of Data Mining
Confidence Intervals for Evaluation of Data Mining arXiv:2502.07016v1 Announce Type: new Abstract: In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Some examples include classification accuracy, precision, recall, F measures, and Jaccard…
-
Epistemic Uncertainty in Conformal Scores: A Unified Approach
Epistemic Uncertainty in Conformal Scores: A Unified Approach arXiv:2502.06995v1 Announce Type: new Abstract: Conformal prediction methods create prediction bands with distribution-free guarantees but do not explicitly capture epistemic uncertainty, which can lead to overconfident predictions in data-sparse regions. Although recent conformal scores have been developed to address this limitation, they are typically designed for specific…
-
Generative Distribution Prediction: A Unified Approach to Multimodal Learning
Generative Distribution Prediction: A Unified Approach to Multimodal Learning arXiv:2502.07090v1 Announce Type: new Abstract: Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a…
-
Online Covariance Matrix Estimation in Sketched Newton Methods
Online Covariance Matrix Estimation in Sketched Newton Methods arXiv:2502.07114v1 Announce Type: new Abstract: Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched Newton method that leverages a randomized sketching technique…
-
Riemannian Proximal Sampler for High-accuracy Sampling on Manifolds
Riemannian Proximal Sampler for High-accuracy Sampling on Manifolds arXiv:2502.07265v1 Announce Type: new Abstract: We introduce the Riemannian Proximal Sampler, a method for sampling from densities defined on Riemannian manifolds. The performance of this sampler critically depends on two key oracles: the Manifold Brownian Increments (MBI) oracle and the Riemannian Heat-kernel (RHK) oracle. We establish high-accuracy…
-
Build a Decision Tree in Polars from Scratch
Build a Decision Tree in Polars from Scratch Decision Tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn, Lightgbm, xgboost and catboost have done a very good job…
-
Virtualization & Containers for Data Science Newbies
Virtualization & Containers for Data Science Newbies Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak. Many cloud services rely on virtualization. But other technologies, such…
-
Understanding Model Calibration: A Gentle Introduction & Visual Exploration
Understanding Model Calibration: A Gentle Introduction & Visual Exploration How Reliable Are Your Predictions? About To be considered reliable, a model must be calibrated so that its confidence in each decision closely reflects its true outcome. In this blog post we’ll take a look at the most commonly used definition for calibration and then dive…
-
Data vs. Business Strategy
Data vs. Business Strategy There seems to be a consensus that leveraging data, analytics, and AI to create a data-driven organization requires a clear strategic approach. However, there is less clarity and agreement on exactly what this strategic approach should look like in practice. This article provides a short overview of what strategy work I…
-
Online Covariance Estimation in Nonsmooth Stochastic Approximation
Online Covariance Estimation in Nonsmooth Stochastic Approximation arXiv:2502.05305v1 Announce Type: new Abstract: We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of H’ajek and Le Cam.…
-
On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers
On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers arXiv:2502.05672v1 Announce Type: new Abstract: This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks,…
-
dynoGP: Deep Gaussian Processes for dynamic system identification
dynoGP: Deep Gaussian Processes for dynamic system identification arXiv:2502.05620v1 Announce Type: new Abstract: In this work, we present a novel approach to system identification for dynamical systems, based on a specific class of Deep Gaussian Processes (Deep GPs). These models are constructed by interconnecting linear dynamic GPs (equivalent to stochastic linear time-invariant dynamical systems) and…
-
Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction
Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction arXiv:2502.05676v1 Announce Type: new Abstract: Ensuring model calibration is critical for reliable predictions, yet popular distribution-free methods, such as histogram binning and isotonic regression, provide only asymptotic guarantees. We introduce a unified framework for Venn and Venn-Abers calibration, generalizing Vovk’s binary classification approach to arbitrary…
-
TD(0) Learning converges for Polynomial mixing and non-linear functions
TD(0) Learning converges for Polynomial mixing and non-linear functions arXiv:2502.05706v1 Announce Type: new Abstract: Theoretical work on Temporal Difference (TD) learning has provided finite-sample and high-probability guarantees for data generated from Markov chains. However, these bounds typically require linear function approximation, instance-dependent step sizes, algorithmic modifications, and restrictive mixing rates. We present theoretical findings for…
-
Six Ways to Control Style and Content in Diffusion Models
Six Ways to Control Style and Content in Diffusion Models Stable Diffusion 1.5/2.0/2.1/XL 1.0, DALL-E, Imagen… In the past years, Diffusion Models have showcased stunning quality in image generation. However, while producing great quality on generic concepts, these struggle to generate high quality for more specialised queries, for example generating images in a specific style,…
-
Sparsity-Based Interpolation of External, Internal and Swap Regret
Sparsity-Based Interpolation of External, Internal and Swap Regret arXiv:2502.04543v1 Announce Type: new Abstract: Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via $phi$-regret minimization, which measures the performance of an algorithm by its regret with respect to an arbitrary action modification rule $phi$. With $d$ experts…
-
Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond
Complexity Analysis of Normalizing Constant Estimation: from Jarzynski Equality to Annealed Importance Sampling and beyond arXiv:2502.04575v1 Announce Type: new Abstract: Given an unnormalized probability density $piproptomathrm{e}^{-V}$, estimating its normalizing constant $Z=int_{mathbb{R}^d}mathrm{e}^{-V(x)}mathrm{d}x$ or free energy $F=-log Z$ is a crucial problem in Bayesian statistics, statistical mechanics, and machine learning. It is challenging especially in high dimensions…
-
Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect
Optimistic Algorithms for Adaptive Estimation of the Average Treatment Effect arXiv:2502.04673v1 Announce Type: new Abstract: Estimation and inference for the Average Treatment Effect (ATE) is a cornerstone of causal inference and often serves as the foundation for developing procedures for more complicated settings. Although traditionally analyzed in a batch setting, recent advances in martingale theory…
-
PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders
PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders arXiv:2502.04730v1 Announce Type: new Abstract: Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack…
-
Weekly Entering & Transitioning – Thread 10 Feb, 2025 – 17 Feb, 2025
Weekly Entering & Transitioning – Thread 10 Feb, 2025 – 17 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
The Gamma Hurdle Distribution
The Gamma Hurdle Distribution Which Outcome Matters? Here is a common scenario : An A/B test was conducted, where a random sample of units (e.g. customers) were selected for a campaign and they received Treatment A. Another sample was selected to receive Treatment B. “A” could be a communication or offer and “B” could be…
-
Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them)
Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them) Accurate impact estimations can make or break your business case. Yet, despite its importance, most teams use oversimplified calculations that can lead to inflated projections. These shot-in-the-dark numbers not only destroy credibility with stakeholders but can also result in misallocation of resources and…
-
I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms
I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms Recently, DeepSeek announced their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too…
-
Synthetic Data Generation with LLMs
Synthetic Data Generation with LLMs Popularity of RAG Over the past two years while working with financial firms, I’ve observed firsthand how they identify and prioritize Generative AI use cases, balancing complexity with potential value. Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation…
-
The Method of Moments Estimator for Gaussian Mixture Models
The Method of Moments Estimator for Gaussian Mixture Models Audio Processing is one of the most important application domains of digital signal processing (DSP) and machine learning. Modeling acoustic environments is an essential step in developing digital audio processing systems such as: speech recognition, speech enhancement, acoustic echo cancellation, etc. Acoustic environments are filled with background…
-
Two in context learning tasks with complex functions
Two in context learning tasks with complex functions arXiv:2502.03503v1 Announce Type: new Abstract: We examine two in context learning (ICL) tasks with mathematical functions in several train and test settings for transformer models. Our study generalizes work on linear functions by showing that small transformers, even models with attention layers only, can approximate arbitrary polynomial…
-
Multivariate Conformal Prediction using Optimal Transport
Multivariate Conformal Prediction using Optimal Transport arXiv:2502.03609v1 Announce Type: new Abstract: Conformal prediction (CP) quantifies the uncertainty of machine learning models by constructing sets of plausible outputs. These sets are constructed by leveraging a so-called conformity score, a quantity computed using the input point of interest, a prediction model, and past observations. CP sets are…
-
Online Learning Algorithms in Hilbert Spaces with $beta-$ and $phi-$Mixing Sequences
Online Learning Algorithms in Hilbert Spaces with $beta-$ and $phi-$Mixing Sequences arXiv:2502.03551v1 Announce Type: new Abstract: In this paper, we study an online algorithm in a reproducing kernel Hilbert spaces (RKHS) based on a class of dependent processes, called the mixing process. For such a process, the degree of dependence is measured by various mixing…
-
Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures Approach
Rule-based Evolving Fuzzy System for Time Series Forecasting: New Perspectives Based on Type-2 Fuzzy Sets Measures Approach arXiv:2502.03650v1 Announce Type: new Abstract: Real-world data contain uncertainty and variations that can be correlated to external variables, known as randomness. An alternative cause of randomness is chaos, which can be an important component of chaotic time series.…
-
Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints
Guiding Two-Layer Neural Network Lipschitzness via Gradient Descent Learning Rate Constraints arXiv:2502.03792v1 Announce Type: new Abstract: We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures…
-
How to Create Network Graph Visualizations in Microsoft PowerBI
How to Create Network Graph Visualizations in Microsoft PowerBI Microsoft PowerBI is a one of the most popular Business Intelligence (BI) tools, and while it has all the features you need to create dynamic analytic reporting for stakeholders across the business, creating some advanced data visualizations is more challenging. This article will walk through how…
-
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics Metric collection is an essential part of every machine learning project, enabling us to track model performance and monitor training progress. Ideally, Metrics should be collected and computed without introducing any additional overhead to the training process. However, just like other components of the…
-
Introduction to Minimum Cost Flow Optimization in Python
Introduction to Minimum Cost Flow Optimization in Python Minimum cost flow optimization minimizes the cost of moving flow through a network of nodes and edges. Nodes include sources (supply) and sinks (demand), with different costs and capacity limits. The aim is to find the least costly way to move volume from sources to sinks while…
-
A Visual Guide to How Diffusion Models Work
A Visual Guide to How Diffusion Models Work This article is aimed at those who want to understand exactly how Diffusion Models work, with no prior knowledge expected. I’ve tried to use illustrations wherever possible to provide visual intuitions on each part of these models. I’ve kept mathematical notation and equations to a minimum, and where…
-
Networks with Finite VC Dimension: Pro and Contra
Networks with Finite VC Dimension: Pro and Contra arXiv:2502.02679v1 Announce Type: new Abstract: Approximation and learning of classifiers of large data sets by neural networks in terms of high-dimensional geometry and statistical learning theory are investigated. The influence of the VC dimension of sets of input-output functions of networks on approximation capabilities is compared with…
-
Achievable distributional robustness when the robust risk is only partially identified
Achievable distributional robustness when the robust risk is only partially identified arXiv:2502.02710v1 Announce Type: new Abstract: In safety-critical applications, machine learning models should generalize well under worst-case distribution shifts, that is, have a small robust risk. Invariance-based algorithms can provably take advantage of structural assumptions on the shifts when the training distributions are heterogeneous enough…
-
Algorithms with Calibrated Machine Learning Predictions
Algorithms with Calibrated Machine Learning Predictions arXiv:2502.02861v1 Announce Type: new Abstract: The field of algorithms with predictions incorporates machine learning advice in the design of online algorithms to improve real-world performance. While this theoretical framework often assumes uniform reliability across all predictions, modern machine learning models can now provide instance-level uncertainty estimates. In this paper,…
-
Gap-Dependent Bounds for Federated $Q$-learning
Gap-Dependent Bounds for Federated $Q$-learning arXiv:2502.02859v1 Announce Type: new Abstract: We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $sqrt{T}$-type regret bounds and communication cost bounds with a $log T$ term scaling…
-
Uncertainty Quantification with the Empirical Neural Tangent Kernel
Uncertainty Quantification with the Empirical Neural Tangent Kernel arXiv:2502.02870v1 Announce Type: new Abstract: While neural networks have demonstrated impressive performance across various tasks, accurately quantifying uncertainty in their predictions is essential to ensure their trustworthiness and enable widespread adoption in critical systems. Several Bayesian uncertainty quantification (UQ) methods exist that are either cheap or reliable,…
-
Myths vs. Data: Does an Apple a Day Keep the Doctor Away?
Myths vs. Data: Does an Apple a Day Keep the Doctor Away? Introduction “Money can’t buy happiness.” “You can’t judge a book by its cover.” “An apple a day keeps the doctor away.” You’ve probably heard these sayings several times, but do they actually hold up when we look at the data? In this article series,…
-
Training Large Language Models: From TRPO to GRPO
Training Large Language Models: From TRPO to GRPO Deepseek has recently made quite a buzz in the AI community, thanks to its impressive performance at relatively low costs. I think this is a perfect opportunity to dive deeper into how Large Language Models (LLMs) are trained. In this article, we will focus on the Reinforcement Learning…
-
Supercharge Your RAG with Multi-Agent Self-RAG
Supercharge Your RAG with Multi-Agent Self-RAG Introduction Many of us might have tried to build a RAG application and noticed it falls significantly short of addressing real-life needs. Why is that? It’s because many real-world problems require multiple steps of information retrieval and reasoning. We need our agent to perform those as humans normally do,…
-
Doubly Robust Monte Carlo Tree Search
Doubly Robust Monte Carlo Tree Search arXiv:2502.01672v1 Announce Type: new Abstract: We present Doubly Robust Monte Carlo Tree Search (DR-MCTS), a novel algorithm that integrates Doubly Robust (DR) off-policy estimation into Monte Carlo Tree Search (MCTS) to enhance sample efficiency and decision quality in complex environments. Our approach introduces a hybrid estimator that combines MCTS…
-
Graph Canonical Correlation Analysis
Graph Canonical Correlation Analysis arXiv:2502.01780v1 Announce Type: new Abstract: Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their…
-
Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models
Poisson Hierarchical Indian Buffet Processes for Within and Across Group Sharing of Latent Features-With Indications for Microbiome Species Sampling Models arXiv:2502.01919v1 Announce Type: new Abstract: In this work, we present a comprehensive Bayesian posterior analysis of what we term Poisson Hierarchical Indian Buffet Processes, designed for complex random sparse count species sampling models that allow…
-
Local minima of the empirical risk in high dimension: General theorems and convex examples
Local minima of the empirical risk in high dimension: General theorems and convex examples arXiv:2502.01953v1 Announce Type: new Abstract: We consider a general model for high-dimensional empirical risk minimization whereby the data $mathbf{x}_i$ are $d$-dimensional isotropic Gaussian vectors, the model is parametrized by $mathbf{Theta}inmathbb{R}^{dtimes k}$, and the loss depends on the data via the projection…
-
Theoretical and Practical Analysis of Fr’echet Regression via Comparison Geometry
Theoretical and Practical Analysis of Fr’echet Regression via Comparison Geometry arXiv:2502.01995v1 Announce Type: new Abstract: Fr’echet regression extends classical regression methods to non-Euclidean metric spaces, enabling the analysis of data relationships on complex structures such as manifolds and graphs. This work establishes a rigorous theoretical analysis for Fr’echet regression through the lens of comparison geometry…
-
From Resume to Cover Letter Using AI and LLM, with Python and Streamlit
From Resume to Cover Letter Using AI and LLM, with Python and Streamlit DISCLAIMER: The idea of doing Cover Letter or even Resume with AI does not obviously start with me. A lot of people have done this before (very successfully) and have built websites and even companies from the idea. This is just a…
-
ML Feature Management: A Practical Evolution Guide
ML Feature Management: A Practical Evolution Guide In the world of machine learning, we obsess over model architectures, training pipelines, and hyper-parameter tuning, yet often overlook a fundamental aspect: how our features live and breathe throughout their lifecycle. From in-memory calculations that vanish after each prediction to the challenge of reproducing exact feature values months…
-
Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees
Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees arXiv:2502.00240v1 Announce Type: new Abstract: Learning effective regularization is crucial for solving ill-posed inverse problems, which arise in a wide range of scientific and engineering applications. While data-driven methods that parameterize regularizers using deep neural networks have demonstrated strong empirical performance, they often…
-
Supervised Quadratic Feature Analysis: An Information Geometry Approach to Dimensionality Reduction
Supervised Quadratic Feature Analysis: An Information Geometry Approach to Dimensionality Reduction arXiv:2502.00168v1 Announce Type: new Abstract: Supervised dimensionality reduction aims to map labeled data to a low-dimensional feature space while maximizing class discriminability. Despite the availability of methods for learning complex non-linear features (e.g. Deep Learning), there is an enduring demand for dimensionality reduction methods…
-
Decentralized Inference for Distributed Geospatial Data Using Low-Rank Models
Decentralized Inference for Distributed Geospatial Data Using Low-Rank Models arXiv:2502.00309v1 Announce Type: new Abstract: Advancements in information technology have enabled the creation of massive spatial datasets, driving the need for scalable and efficient computational methodologies. While offering viable solutions, centralized frameworks are limited by vulnerabilities such as single-point failures and communication bottlenecks. This paper presents…
-
Variance Reduction via Resampling and Experience Replay
Variance Reduction via Resampling and Experience Replay arXiv:2502.00520v1 Announce Type: new Abstract: Experience replay is a foundational technique in reinforcement learning that enhances learning stability by storing past experiences in a replay buffer and reusing them during training. Despite its practical success, its theoretical properties remain underexplored. In this paper, we present a theoretical framework…
-
Towards Data Science is Launching as an Independent Publication
Towards Data Science is Launching as an Independent Publication Since founding Towards Data Science in 2016, we’ve built the largest publication on Medium with a dedicated community of readers and contributors focused on data science, machine learning, and AI. Medium built a fantastic platform, and we wouldn’t have been able to reach our audience without…
-
Show and Tell
Show and Tell Photo by Ståle Grut on Unsplash Introduction Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision,…
-
Neural Networks – Intuitively and Exhaustively Explained
Neural Networks – Intuitively and Exhaustively Explained An in-depth exploration of the most fundamental architecture in modern AI “The Thinking Part” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained. In this article we’ll form a thorough understanding of the neural network,…
-
How to Get Promoted as a Data Scientist
How to Get Promoted as a Data Scientist Image artificially generated using Grok 2. Introduction I have been working as a Data Scientist since 2017, and during that time I have been promoted from a junior/mid-level to a senior, and most recently to a Lead Data Scientist. There is a lot of content online regarding…
-
How to Find Seasonality Patterns in Time Series
How to Find Seasonality Patterns in Time Series Using Fourier Transforms to detect seasonal components In my professional life as a data scientist, I have encountered time series multiple times. Most of my knowledge comes from my academic experience, specifically my courses in Econometrics (I have a degree in Economics), where we studied statistical properties…
-
Adaptivity and Convergence of Probability Flow ODEs in Diffusion Generative Models
Adaptivity and Convergence of Probability Flow ODEs in Diffusion Generative Models arXiv:2501.18863v1 Announce Type: new Abstract: Score-based generative models, which transform noise into data by learning to reverse a diffusion process, have become a cornerstone of modern generative AI. This paper contributes to establishing theoretical guarantees for the probability flow ODE, a widely used diffusion-based…
-
A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization
A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization arXiv:2501.18756v1 Announce Type: new Abstract: Bayesian optimization is a widely used method for optimizing expensive black-box functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to reduce uncertainty about the function’s optimum and…
-
Trustworthy Evaluation of Generative AI Models
Trustworthy Evaluation of Generative AI Models arXiv:2501.18897v1 Announce Type: new Abstract: Generative AI (GenAI) models have recently achieved remarkable empirical performance in various applications, however, their evaluations yet lack uncertainty quantification. In this paper, we propose a method to compare two generative models based on an unbiased estimator of their relative performance gap. Statistically, our…
-
Optimizing Through Change: Bounds and Recommendations for Time-Varying Bayesian Optimization Algorithms
Optimizing Through Change: Bounds and Recommendations for Time-Varying Bayesian Optimization Algorithms arXiv:2501.18963v1 Announce Type: new Abstract: Time-Varying Bayesian Optimization (TVBO) is the go-to framework for optimizing a time-varying, expensive, noisy black-box function. However, most of the solutions proposed so far either rely on unrealistic assumptions on the nature of the objective function or do not…
-
Optimal Transport-based Conformal Prediction
Optimal Transport-based Conformal Prediction arXiv:2501.18991v1 Announce Type: new Abstract: Conformal Prediction (CP) is a principled framework for quantifying uncertainty in blackbox learning models, by constructing prediction sets with finite-sample coverage guarantees. Traditional approaches rely on scalar nonconformity scores, which fail to fully exploit the geometric structure of multivariate outputs, such as in multi-output regression or…
-
Weekly Entering & Transitioning – Thread 03 Feb, 2025 – 10 Feb, 2025
Weekly Entering & Transitioning – Thread 03 Feb, 2025 – 10 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
Awesome Plotly with code series (Part 9): To dot, to slope or to stack?
Awesome Plotly with code series (Part 9): To dot, to slope or to stack? Simple methods to replace cluttered bar charts with crisp, reader-friendly visuals. Continue reading on Towards Data Science » Jose Parreño Go to original source
-
5 Essential Tips Learned from My Data Science Journey
5 Essential Tips Learned from My Data Science Journey Personal reflections on my 10-year data odyssey Continue reading on Towards Data Science » Federico Rucci Go to original source
-
The Cultural Backlash Against Generative AI
The Cultural Backlash Against Generative AI What’s making many people resent generative AI, and what impact does that have on the companies responsible? Photo by Joshua Hoehne on Unsplash The recent reveal of DeepSeek-R1, the large scale LLM developed by a Chinese company (also named DeepSeek), has been a very interesting event for those of us…
-
How to Make a Data Science Portfolio That Stands Out
How to Make a Data Science Portfolio That Stands Out Create a data science portfolio with Cloud-flare and HUGO Continue reading on Towards Data Science » Egor Howell Go to original source
-
Improving Agent Systems & AI Reasoning
Improving Agent Systems & AI Reasoning DeepSeek-R1, OpenAI o1 & o3, Test-Time Compute Scaling, Model Post-Training and the Transition to Reasoning Language Models (RLMs) Image by author and GPT-4o meant to represent DeepSeek and other competitive GenAI model providers Introduction Over the past year generative AI adoption and AI Agent development have skyrocketed. Reports from LangChain…
-
Sparse AutoEncoder: from Superposition to interpretable features
Sparse AutoEncoder: from Superposition to interpretable features Disentangle features in complex Neural Network with superpositions Complex neural networks, such as Large Language Models (LLMs), suffer quite often from interpretability challenges. One of the most important reasons for such difficulty is superposition — a phenomenon of the neural network having fewer dimensions than the number of features it…
-
Injecting domain expertise into your AI system
Injecting domain expertise into your AI system How to connect the dots between AI technology and real life (Source: Getty Images) When starting their AI initiatives, many companies are trapped in silos and treat AI as a purely technical enterprise, sidelining domain experts or involving them too late. They end up with generic AI applications that miss…
-
Are Data Scientists at Risk in 2025?
Are Data Scientists at Risk in 2025? The impact of AI on data science jobs. Continue reading on Towards Data Science » Natassha Selvaraj Go to original source
-
Rapid Data Visualization with Copilot and Plotly
Rapid Data Visualization with Copilot and Plotly Code visualizations quickly and efficiently with Copilot, Plotly, and Streamlit Continue reading on Towards Data Science » Alan Jones Go to original source
-
DeepSeek V3: A New Contender in AI-Powered Data Science
DeepSeek V3: A New Contender in AI-Powered Data Science How DeepSeek’s budget-friendly AI model stacks up against ChatGPT, Claude, and Gemini in SQL, EDA, and machine learning Continue reading on Towards Data Science » Yu Dong Go to original source
-
Fine-tuning Multimodal Embedding Models
Fine-tuning Multimodal Embedding Models Adapting CLIP to YouTube Data (with Python Code) This is the 4th article in a larger series on multimodal AI. In the previous post, we discussed multimodal RAG systems, which can retrieve and synthesize information from different data modalities (e.g. text, images, audio). There, we saw how we could implement such a…
-
How Likely Is a Six Nations Grand Slam in 2025?
How Likely Is a Six Nations Grand Slam in 2025? Quantifying uncertainty in sports fixtures Photo by Thomas Serer on Unsplash Introduction For rugby fans the long wait is nearly over, like Christmas the Six Nations comes once a year to lift our spirits in the cold winter months. If you’re not very familiar with rugby, the…
-
Can Machines Dream? On the Creativity of Large Language Models
Can Machines Dream? On the Creativity of Large Language Models Exploring the Role of Hallucinations, Dependencies, and Imagination in AI Creativity Continue reading on Towards Data Science » Salvatore Raieli Go to original source
-
Inequality in Practice: E-commerce Portfolio Analysis
Inequality in Practice: E-commerce Portfolio Analysis From Mathematical Theory to Actionable Insights: A 6-Year Shopify Case Study Image generated by DALL-E, based on author’s prompt, inspired by “The Bremen Town Musicians” Are your top-selling products making or breaking your business? It’s terrifying to think your entire revenue might collapse if one or two products fall out…
-
2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy
2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU Continue reading on Towards Data Science » Benjamin Marie Go to original source
-
Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection
Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection arXiv:2501.17889v1 Announce Type: new Abstract: Variable selection plays a crucial role in enhancing modeling effectiveness across diverse fields, addressing the challenges posed by high-dimensional datasets of correlated variables. This work introduces a novel approach namely Knockoff with over-parameterization (Knoop) to enhance Knockoff filters for variable…
-
Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks
Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks arXiv:2501.17882v1 Announce Type: new Abstract: We consider a multi-player multi-armed bandit setting in the presence of adversaries that attempt to negatively affect the rewards received by the players in the system. The reward distributions for any given arm are heterogeneous across the players. In the event of…
-
U-aggregation: Unsupervised Aggregation of Multiple Learning Algorithms
U-aggregation: Unsupervised Aggregation of Multiple Learning Algorithms arXiv:2501.18084v1 Announce Type: new Abstract: Across various domains, the growing advocacy for open science and open-source machine learning has made an increasing number of models publicly available. These models allow practitioners to integrate them into their own contexts, reducing the need for extensive data labeling, training, and calibration.…
-
Optimal Survey Design for Private Mean Estimation
Optimal Survey Design for Private Mean Estimation arXiv:2501.18121v1 Announce Type: new Abstract: This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation,…
-
Random Feature Representation Boosting
Random Feature Representation Boosting arXiv:2501.18283v1 Announce Type: new Abstract: We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits…
-
Distributed Tracing: A Powerful Approach to Debugging Complex Systems
Distributed Tracing: A Powerful Approach to Debugging Complex Systems Why distributed tracing is the key to resolving performance issues Continue reading on Towards Data Science » Hareesha Dandamudi Go to original source