Category: aimldsaimlds
-
Unraveling Large Language Model Hallucinations
Unraveling Large Language Model Hallucinations Introduction In a YouTube video titled Deep Dive into LLMs like ChatGPT, former Senior Director of AI at Tesla, Andrej Karpathy discusses the psychology of Large Language Models (LLMs) as emergent cognitive effects of the training pipeline. This article is inspired by his explanation of LLM hallucinations and the information presented in the…
-
Vision Transformers (ViT) Explained: Are They Better Than CNNs?
Vision Transformers (ViT) Explained: Are They Better Than CNNs? 1. Introduction Ever since the introduction of the self-attention mechanism, Transformers have been the top choice when it comes to Natural Language Processing (NLP) tasks. Self-attention-based models are highly parallelizable and require substantially fewer parameters, making them much more computationally efficient, less prone to overfitting, and…
-
I Won’t Change Unless You Do
I Won’t Change Unless You Do In Game Theory, how can players ever come to an end if there still might be a better option to decide for? Maybe one player still wants to change their decision. But if they do, maybe the other player wants to change too. How can they ever hope to…
-
Practical Evaluation of Copula-based Survival Metrics: Beyond the Independent Censoring Assumption
Practical Evaluation of Copula-based Survival Metrics: Beyond the Independent Censoring Assumption arXiv:2502.19460v1 Announce Type: new Abstract: Conventional survival metrics, such as Harrell’s concordance index and the Brier Score, rely on the independent censoring assumption for valid inference in the presence of right-censored data. However, when instances are censored for reasons related to the event of…
-
Advancing calibration for stochastic agent-based models in epidemiology with Stein variational inference and Gaussian process surrogates
Advancing calibration for stochastic agent-based models in epidemiology with Stein variational inference and Gaussian process surrogates arXiv:2502.19550v1 Announce Type: new Abstract: Accurate calibration of stochastic agent-based models (ABMs) in epidemiology is crucial to make them useful in public health policy decisions and interventions. Traditional calibration methods, e.g., Markov Chain Monte Carlo (MCMC), that yield a…
-
Fast Debiasing of the LASSO Estimator
Fast Debiasing of the LASSO Estimator arXiv:2502.19825v1 Announce Type: new Abstract: In high-dimensional sparse regression, the textsc{Lasso} estimator offers excellent theoretical guarantees but is well-known to produce biased estimates. To address this, cite{Javanmard2014} introduced a method to “debias” the textsc{Lasso} estimates for a random sub-Gaussian sensing matrix $boldsymbol{A}$. Their approach relies on computing an “approximate…
-
Multiple Linked Tensor Factorization
Multiple Linked Tensor Factorization arXiv:2502.20286v1 Announce Type: new Abstract: In biomedical research and other fields, it is now common to generate high content data that are both multi-source and multi-way. Multi-source data are collected from different high-throughput technologies while multi-way data are collected over multiple dimensions, yielding multiple tensor arrays. Integrative analysis of these data…
-
Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula
Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula arXiv:2502.20003v1 Announce Type: new Abstract: The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO,…
-
Write for Towards Data Science
Write for Towards Data Science Quick Links: Submission Guidelines How To Submit Your Work How to get your article ready for publication! Adding and using images Longform posts, columns, and online books FAQ Why become a contributor? We are looking for writers to propose up-to-date content focused on data science, machine learning, artificial intelligence and…
-
Debugging the Dreaded NaN
Debugging the Dreaded NaN You are training your latest AI model, anxiously watching as the loss steadily decreases when suddenly — boom! Your logs are flooded with NaNs (Not a Number) — your model is irreparably corrupted and you’re left staring at your screen in despair. To make matters worse, the NaNs don’t appear consistently.…
-
How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo
How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first. Previously, we covered the first two major stages of training an LLM: Pre-training — Learning from massive datasets to form a base…
-
Applications of Statistical Field Theory in Deep Learning
Applications of Statistical Field Theory in Deep Learning arXiv:2502.18553v1 Announce Type: new Abstract: Deep learning algorithms have made incredible strides in the past decade yet due to the complexity of these algorithms, the science of deep learning remains in its early stages. Being an experimentally driven field, it is natural to seek a theory of…
-
Learning and Computation of $Phi$-Equilibria at the Frontier of Tractability
Learning and Computation of $Phi$-Equilibria at the Frontier of Tractability arXiv:2502.18582v1 Announce Type: new Abstract: $Phi$-equilibria — and the associated notion of $Phi$-regret — are a powerful and flexible framework at the heart of online learning and game theory, whereby enriching the set of deviations $Phi$ begets stronger notions of rationality. Recently, Daskalakis, Farina, Fishelson,…
-
Forecasting intermittent time series with Gaussian Processes and Tweedie likelihood
Forecasting intermittent time series with Gaussian Processes and Tweedie likelihood arXiv:2502.19086v1 Announce Type: new Abstract: We introduce the use of Gaussian Processes (GPs) for the probabilistic forecasting of intermittent time series. The model is trained in a Bayesian framework that accounts for the uncertainty about the latent function and marginalizes it out when making predictions.…
-
Nonlinear Sparse Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data
Nonlinear Sparse Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data arXiv:2502.18756v1 Announce Type: new Abstract: Motivation: Biomedical studies increasingly produce multi-view high-dimensional datasets (e.g., multi-omics) that demand integrative analysis. Existing canonical correlation analysis (CCA) and generalized CCA methods address at most two of the following three key aspects simultaneously: (i) nonlinear dependence, (ii) sparsity for…
-
Enhancing Gradient-based Discrete Sampling via Parallel Tempering
Enhancing Gradient-based Discrete Sampling via Parallel Tempering arXiv:2502.19240v1 Announce Type: new Abstract: While gradient-based discrete samplers are effective in sampling from complex distributions, they are susceptible to getting trapped in local minima, particularly in high-dimensional, multimodal discrete distributions, owing to the discontinuities inherent in these landscapes. To circumvent this issue, we combine parallel tempering, also…
-
The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines
The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines “You don’t have to be an expert to deceive someone, though you might need some expertise to reliably recognize when you are being deceived.” When my co-instructor and I start our quarterly lesson on deceptive visualizations for the data visualization course we teach at the University…
-
Nine Rules for SIMD Acceleration of Your Rust Code (Part 1)
Nine Rules for SIMD Acceleration of Your Rust Code (Part 1) Thanks to Ben Lichtman (B3NNY) at the Seattle Rust Meetup for pointing me in the right direction on SIMD. SIMD (Single Instruction, Multiple Data) operations have been a feature of Intel/AMD and ARM CPUs since the early 2000s. These operations enable you to, for example,…
-
LLaDA: The Diffusion Model That Could Redefine Language Generation
LLaDA: The Diffusion Model That Could Redefine Language Generation Introduction What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them? This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to…
-
Are GNNs doomed by the topology of their input graph?
Are GNNs doomed by the topology of their input graph? arXiv:2502.17739v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable success in learning from graph-structured data. However, the influence of the input graph’s topology on GNN behavior remains poorly understood. In this work, we explore whether GNNs are inherently limited by the structure…
-
An Overview of Large Language Models for Statisticians
An Overview of Large Language Models for Statisticians arXiv:2502.17814v1 Announce Type: new Abstract: Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI), exhibiting remarkable capabilities across diverse tasks such as text generation, reasoning, and decision-making. While their success has primarily been driven by advances in computational power and deep learning architectures,…
-
Conformal Prediction Under Generalized Covariate Shift with Posterior Drift
Conformal Prediction Under Generalized Covariate Shift with Posterior Drift arXiv:2502.17744v1 Announce Type: new Abstract: In many real applications of statistical learning, collecting sufficiently many training data is often expensive, time-consuming, or even unrealistic. In this case, a transfer learning approach, which aims to leverage knowledge from a related source domain to improve the learning performance…
-
Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training
Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training arXiv:2502.18049v1 Announce Type: new Abstract: Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies…
-
Near-Optimal Approximations for Bayesian Inference in Function Space
Near-Optimal Approximations for Bayesian Inference in Function Space arXiv:2502.18279v1 Announce Type: new Abstract: We propose a scalable inference algorithm for Bayes posteriors defined on a reproducing kernel Hilbert space (RKHS). Given a likelihood function and a Gaussian random element representing the prior, the corresponding Bayes posterior measure $Pi_{text{B}}$ can be obtained as the stationary distribution…
-
When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI
When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI Imagine you’re building your dream home. Just about everything is ready. All that’s left to do is pick out a front door. Since the neighborhood has a low crime rate, you decide you want a door with a standard lock — nothing too fancy,…
-
Efficient Data Handling in Python with Arrow
Efficient Data Handling in Python with Arrow 1. Introduction We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the…
-
Breaking the Bottleneck: GPU-Optimised Video Processing for Deep Learning
Breaking the Bottleneck: GPU-Optimised Video Processing for Deep Learning Deep Learning (DL) applications often require processing video data for tasks such as object detection, classification, and segmentation. However, conventional video processing pipelines are typically inefficient for deep learning inference, leading to performance bottlenecks. In this post will leverage PyTorch and FFmpeg with NVIDIA hardware acceleration…
-
Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements
Exact Recovery of Sparse Binary Vectors from Generalized Linear Measurements arXiv:2502.16008v1 Announce Type: new Abstract: We consider the problem of exact recovery of a $k$-sparse binary vector from generalized linear measurements (such as logistic regression). We analyze the linear estimation algorithm (Plan, Vershynin, Yudovina, 2017), and also show information theoretic lower bounds on the number…
-
A Review of Causal Decision Making
A Review of Causal Decision Making arXiv:2502.16156v1 Announce Type: new Abstract: To make effective decisions, it is important to have a thorough understanding of the causal relationships among actions, environments, and outcomes. This review aims to surface three crucial aspects of decision-making through a causal lens: 1) the discovery of causal relationships through causal structure…
-
Rectifying Conformity Scores for Better Conditional Coverage
Rectifying Conformity Scores for Better Conditional Coverage arXiv:2502.16336v1 Announce Type: new Abstract: We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of…
-
Statistical Inference in Reinforcement Learning: A Selective Survey
Statistical Inference in Reinforcement Learning: A Selective Survey arXiv:2502.16195v1 Announce Type: new Abstract: Reinforcement learning (RL) is concerned with how intelligence agents take actions in a given environment to maximize the cumulative reward they receive. In healthcare, applying RL algorithms could assist patients in improving their health status. In ride-sharing platforms, applying RL algorithms could…
-
Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness
Subspace Recovery in Winsorized PCA: Insights into Accuracy and Robustness arXiv:2502.16391v1 Announce Type: new Abstract: In this paper, we explore the theoretical properties of subspace recovery using Winsorized Principal Component Analysis (WPCA), utilizing a common data transformation technique that caps extreme values to mitigate the impact of outliers. Despite the widespread use of winsorization in…
-
Enhancing RAG: Beyond Vanilla Approaches
Enhancing RAG: Beyond Vanilla Approaches Retrieval-Augmented Generation (RAG) is a powerful technique that enhances language models by incorporating external information retrieval mechanisms. While standard RAG implementations improve response relevance, they often struggle in complex retrieval scenarios. This article explores the limitations of a vanilla RAG setup and introduces advanced techniques to enhance its accuracy and…
-
6 Common LLM Customization Strategies Briefly Explained
6 Common LLM Customization Strategies Briefly Explained Why Customize LLMs? Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring a vast amount of resources on training data, training time and holding a large number of parameters. LLM have revolutionized natural language processing especially in the last 2 years, demonstrating remarkable…
-
Modifying Final Splits of Classification Tree for Fine-tuning Subpopulation Target in Policy Making
Modifying Final Splits of Classification Tree for Fine-tuning Subpopulation Target in Policy Making arXiv:2502.15072v1 Announce Type: new Abstract: Policymakers often use Classification and Regression Trees (CART) to partition populations based on binary outcomes and target subpopulations whose probability of the binary event exceeds a threshold. However, classic CART and knowledge distillation method whose student model…
-
Variational phylogenetic inference with products over bipartitions
Variational phylogenetic inference with products over bipartitions arXiv:2502.15110v1 Announce Type: new Abstract: Bayesian phylogenetics requires accurate and efficient approximation of posterior distributions over trees. In this work, we develop a variational Bayesian approach for ultrametric phylogenetic trees. We present a novel variational family based on coalescent times of a single-linkage clustering and derive a closed-form…
-
Tensor Product Neural Networks for Functional ANOVA Model
Tensor Product Neural Networks for Functional ANOVA Model arXiv:2502.15215v1 Announce Type: new Abstract: Interpretability for machine learning models is becoming more and more important as machine learning models become more complex. The functional ANOVA model, which decomposes a high-dimensional function into a sum of lower dimensional functions so called components, is one of the most…
-
Fr’echet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects
Fr’echet Cumulative Covariance Net for Deep Nonlinear Sufficient Dimension Reduction with Random Objects arXiv:2502.15374v1 Announce Type: new Abstract: Nonlinear sufficient dimension reductioncitep{libing_generalSDR}, which constructs nonlinear low-dimensional representations to summarize essential features of high-dimensional data, is an important branch of representation learning. However, most existing methods are not applicable when the response variables are complex non-Euclidean…
-
Jeffrey’s update rule as a minimizer of Kullback-Leibler divergence
Jeffrey’s update rule as a minimizer of Kullback-Leibler divergence arXiv:2502.15504v1 Announce Type: new Abstract: In this paper, we show a more concise and high level proof than the original one, derived by researcher Bart Jacobs, for the following theorem: in the context of Bayesian update rules for learning or updating internal states that produce predictions,…
-
Weekly Entering & Transitioning – Thread 24 Feb, 2025 – 03 Mar, 2025
Weekly Entering & Transitioning – Thread 24 Feb, 2025 – 03 Mar, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data What is synthetic data? Data created by a computer intended to replicate or augment existing data. Why is it useful? We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society…
-
Do European M&Ms Actually Taste Better than American M&Ms?
Do European M&Ms Actually Taste Better than American M&Ms? (Oh, I am the only one who’s been asking this question…? Hm. Well, if you have a minute, please enjoy this exploratory Data Analysis — featuring experimental design, statistics, and interactive visualization — applied a bit too earnestly to resolve an international debate.) 1. Introduction 1.1…
-
Talking about Games
Talking about Games Game theory is a field of research that is quite prominent in Economics but rather unpopular in other scientific disciplines. However, the concepts used in game theory can be of interest to a wider audience, including data scientists, statisticians, computer scientists or psychologists, to name just a few. This article is the…
-
Towards a perturbation-based explanation for medical AI as differentiable programs
Towards a perturbation-based explanation for medical AI as differentiable programs arXiv:2502.14001v1 Announce Type: new Abstract: Recent advancement in machine learning algorithms reaches a point where medical devices can be equipped with artificial intelligence (AI) models for diagnostic support and routine automation in clinical settings. In medicine and healthcare, there is a particular demand for sufficient…
-
New Lower Bounds for Stochastic Non-Convex Optimization through Divergence Composition
New Lower Bounds for Stochastic Non-Convex Optimization through Divergence Composition arXiv:2502.14060v1 Announce Type: new Abstract: We study fundamental limits of first-order stochastic optimization in a range of nonconvex settings, including L-smooth functions satisfying Quasar-Convexity (QC), Quadratic Growth (QG), and Restricted Secant Inequalities (RSI). While the convergence properties of standard algorithms are well-understood in deterministic regimes,…
-
Multi-Objective Bayesian Optimization for Networked Black-Box Systems: A Path to Greener Profits and Smarter Designs
Multi-Objective Bayesian Optimization for Networked Black-Box Systems: A Path to Greener Profits and Smarter Designs arXiv:2502.14121v1 Announce Type: new Abstract: Designing modern industrial systems requires balancing several competing objectives, such as profitability, resilience, and sustainability, while accounting for complex interactions between technological, economic, and environmental factors. Multi-objective optimization (MOO) methods are commonly used to navigate…
-
Conformal Prediction under L’evy-Prokhorov Distribution Shifts: Robustness to Local and Global Perturbations
Conformal Prediction under L’evy-Prokhorov Distribution Shifts: Robustness to Local and Global Perturbations arXiv:2502.14105v1 Announce Type: new Abstract: Conformal prediction provides a powerful framework for constructing prediction intervals with finite-sample guarantees, yet its robustness under distribution shifts remains a significant challenge. This paper addresses this limitation by modeling distribution shifts using L’evy-Prokhorov (LP) ambiguity sets, which…
-
Prediction-Powered Adaptive Shrinkage Estimation
Prediction-Powered Adaptive Shrinkage Estimation arXiv:2502.14166v1 Announce Type: new Abstract: Prediction-Powered Inference (PPI) is a powerful framework for enhancing statistical estimates by combining limited gold-standard data with machine learning (ML) predictions. While prior work has demonstrated PPI’s benefits for individual statistical tasks, modern applications require answering numerous parallel statistical questions. We introduce Prediction-Powered Adaptive Shrinkage (PAS),…
-
Unraveling Spatially Variable Genes: A Statistical Perspective on Spatial Transcriptomics
Unraveling Spatially Variable Genes: A Statistical Perspective on Spatial Transcriptomics [ The article was written by Guanao Yan, Ph.D. student of Statistics and Data Science at UCLA. Guanao is the first author of the Nature Communications review article [1]. Spatially resolved transcriptomics (SRT) is revolutionizing Genomics by enabling the high-throughput measurement of gene expression while…
-
Reinforcement Learning with PDEs
Reinforcement Learning with PDEs Previously we discussed applying reinforcement learning to Ordinary Differential Equations (ODEs) by integrating ODEs within gymnasium. ODEs are a powerful tool that can describe a wide range of systems but are limited to a single variable. Partial Differential Equations (PDEs) are differential equations involving derivatives of multiple variables that can cover…
-
How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API
How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API For a long time, one of the common ways to start new Node.js projects was using boilerplate templates. These templates help developers reuse familiar code structures and implement standard features, such as access to cloud file storage. With the latest developments in LLM,…
-
Don’t Let Conda Eat Your Hard Drive
Don’t Let Conda Eat Your Hard Drive If you’re an Anaconda user, you know that conda environments help you manage package dependencies, avoid compatibility conflicts, and share your projects with others. Unfortunately, they can also take over your computer’s hard drive. I write lots of computer tutorials and to keep them organized, each has a dedicated folder…
-
AI Agents from Zero to Hero – Part 1
AI Agents from Zero to Hero – Part 1 Intro AI Agents are autonomous programs that perform tasks, make decisions, and communicate with others. Normally, they use a set of tools to help complete tasks. In GenAI applications, these Agents process sequential reasoning and can use external tools (like web searches or database queries) when…
-
Model selection for behavioral learning data and applications to contextual bandits
Model selection for behavioral learning data and applications to contextual bandits arXiv:2502.13186v1 Announce Type: new Abstract: Learning for animals or humans is the process that leads to behaviors better adapted to the environment. This process highly depends on the individual that learns and is usually observed only through the individual’s actions. This article presents ways…
-
Task Shift: From Classification to Regression in Overparameterized Linear Models
Task Shift: From Classification to Regression in Overparameterized Linear Models arXiv:2502.13285v1 Announce Type: new Abstract: Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression…
-
An Efficient Permutation-Based Kernel Two-Sample Test
An Efficient Permutation-Based Kernel Two-Sample Test arXiv:2502.13570v1 Announce Type: new Abstract: Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due…
-
Identifying metric structures of deep latent variable models
Identifying metric structures of deep latent variable models arXiv:2502.13757v1 Announce Type: new Abstract: Deep latent variable models learn condensed representations of data that, hopefully, reflect the inner workings of the studied phenomena. Unfortunately, these latent representations are not statistically identifiable, meaning they cannot be uniquely determined. Domain experts, therefore, need to tread carefully when interpreting…
-
Graph Signal Inference by Learning Narrowband Spectral Kernels
Graph Signal Inference by Learning Narrowband Spectral Kernels arXiv:2502.13686v1 Announce Type: new Abstract: While a common assumption in graph signal analysis is the smoothness of the signals or the band-limitedness of their spectrum, in many instances the spectrum of real graph data may be concentrated at multiple regions of the spectrum, possibly including mid-to-high-frequency components.…
-
Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge
Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge “I train models, analyze data and create dashboards — why should I care about Containers?” Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on…
-
Advanced Time Intelligence in DAX with Performance in Mind
Advanced Time Intelligence in DAX with Performance in Mind We all know the usual Time Intelligence function based on years, quarters, months, and days. But sometimes, we need to perform more exotic timer intelligence calculations. But we should not forget to consider performance while programming the measures. Introduction There are many Dax functions in Power BI…
-
Multimodal Search Engine Agents Powered by BLIP-2 and Gemini
Multimodal Search Engine Agents Powered by BLIP-2 and Gemini This post was co-authored with Rafael Guedes. Introduction Traditional models can only process a single type of data, such as text, images, or tabular data. Multimodality is a trending concept in the AI research community, referring to a model’s ability to learn from multiple types of…
-
Formulation of Feature Circuits with Sparse Autoencoders in LLM
Formulation of Feature Circuits with Sparse Autoencoders in LLM Large Language models (LLMs) have witnessed impressive progress and these large models can do a variety of tasks, from generating human-like text to answering questions. However, understanding how these models work still remains challenging, especially due a phenomenon called superposition where features are mixed into one…
-
Zero Human Code: What I Learned from Forcing AI to Build (and Fix) Its Own Code for 27 Straight Days
Zero Human Code: What I Learned from Forcing AI to Build (and Fix) Its Own Code for 27 Straight Days 27 days, 1,700+ commits, 99,9% AI generated code The narrative around AI development tools has become increasingly detached from reality. YouTube is filled with claims of building complex applications in hours using AI assistants. The…
-
Suboptimal Shapley Value Explanations
Suboptimal Shapley Value Explanations arXiv:2502.12209v1 Announce Type: new Abstract: Deep Neural Networks (DNNs) have demonstrated strong capacity in supporting a wide variety of applications. Shapley value has emerged as a prominent tool to analyze feature importance to help people understand the inference process of deep neural models. Computing Shapley value function requires choosing a baseline…
-
The Majority Vote Paradigm Shift: When Popular Meets Optimal
The Majority Vote Paradigm Shift: When Popular Meets Optimal arXiv:2502.12581v1 Announce Type: new Abstract: Reliably labelling data typically requires annotations from multiple human workers. However, humans are far from being perfect. Hence, it is a common practice to aggregate labels gathered from multiple annotators to make a more confident estimate of the true label. Among…
-
Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation
Generalized Kernel Inducing Points by Duality Gap for Dataset Distillation arXiv:2502.12607v1 Announce Type: new Abstract: We propose Duality Gap KIP (DGKIP), an extension of the Kernel Inducing Points (KIP) method for dataset distillation. While existing dataset distillation methods often rely on bi-level optimization, DGKIP eliminates the need for such optimization by leveraging duality theory in…
-
Green LIME: Improving AI Explainability through Design of Experiments
Green LIME: Improving AI Explainability through Design of Experiments arXiv:2502.12753v1 Announce Type: new Abstract: In artificial intelligence (AI), the complexity of many models and processes often surpasses human interpretability, making it challenging to understand why a specific prediction is made. This lack of transparency is particularly problematic in critical fields like healthcare, where trust in…
-
Federated Variational Inference for Bayesian Mixture Models
Federated Variational Inference for Bayesian Mixture Models arXiv:2502.12684v1 Announce Type: new Abstract: We present a federated learning approach for Bayesian model-based clustering of large-scale binary and categorical datasets. We introduce a principled ‘divide and conquer’ inference procedure using variational inference with local merge and delete moves within batches of the data in parallel, followed by…
-
How to Fine-Tune DistilBERT for Emotion Classification
How to Fine-Tune DistilBERT for Emotion Classification The customer support teams were drowning with the overwhelming volume of customer inquiries at every company I’ve worked at. Have you had similar experiences? What if I told you that you could use AI to automatically identify, categorize, and even resolve the most common issues? By fine-tuning a…
-
Learning How to Play Atari Games Through Deep Neural Networks
Learning How to Play Atari Games Through Deep Neural Networks In July 1959, Arthur Samuel developed one of the first agents to play the game of checkers. What constitutes an agent that plays checkers can be best described in Samuel’s own words, “…a computer [that] can be programmed so that it will learn to play…
-
Honestly Uncertain
Honestly Uncertain Ethical issues aside, should you be honest when asked how certain you are about some belief? Of course, it depends. In this blog post, you’ll learn on what. Different ways of evaluating probabilistic predictions come with dramatically different degrees of “optimal honesty”. Perhaps surprisingly, the linear function that assigns +1 to true and fully…
-
How LLMs Work: Pre-Training to Post-Training, Neural Networks, Hallucinations, and Inference
How LLMs Work: Pre-Training to Post-Training, Neural Networks, Hallucinations, and Inference With the recent explosion of interest in large language models (LLMs), they often seem almost magical. But let’s demystify them. I wanted to step back and unpack the fundamentals — breaking down how LLMs are built, trained, and fine-tuned to become the AI systems we interact…
-
The Future of Data: How Decision Intelligence is Revolutionizing Data
The Future of Data: How Decision Intelligence is Revolutionizing Data In the past few years, technology and AI have evolved more than ever. As I read about the new concepts in tech and learn new skills and techniques each day, I feel in a state of limbo — there is so much content to consume and yet,…
-
Forecasting time series with constraints
Forecasting time series with constraints arXiv:2502.10485v1 Announce Type: new Abstract: Time series forecasting presents unique challenges that limit the effectiveness of traditional machine learning algorithms. To address these limitations, various approaches have incorporated linear constraints into learning algorithms, such as generalized additive models and hierarchical forecasting. In this paper, we propose a unified framework for…
-
Weighted quantization using MMD: From mean field to mean shift via gradient flows
Weighted quantization using MMD: From mean field to mean shift via gradient flows arXiv:2502.10600v1 Announce Type: new Abstract: Approximating a probability distribution using a set of particles is a fundamental problem in machine learning and statistics, with applications including clustering and quantization. Formally, we seek a finite weighted mixture of Dirac measures that best approximates…
-
Generative Adversarial Networks for High-Dimensional Item Factor Analysis: A Deep Adversarial Learning Algorithm
Generative Adversarial Networks for High-Dimensional Item Factor Analysis: A Deep Adversarial Learning Algorithm arXiv:2502.10650v1 Announce Type: new Abstract: Advances in deep learning and representation learning have transformed item factor analysis (IFA) in the item response theory (IRT) literature by enabling more efficient and accurate parameter estimation. Variational Autoencoders (VAEs) have been one of the most…
-
Batch-Adaptive Annotations for Causal Inference with Complex-Embedded Outcomes
Batch-Adaptive Annotations for Causal Inference with Complex-Embedded Outcomes arXiv:2502.10605v1 Announce Type: new Abstract: Estimating the causal effects of an intervention on outcomes is crucial. But often in domains such as healthcare and social services, this critical information about outcomes is documented by unstructured text, e.g. clinical notes in healthcare or case notes in social services.…
-
Dynamic Influence Tracker: Measuring Time-Varying Sample Influence During Training
Dynamic Influence Tracker: Measuring Time-Varying Sample Influence During Training arXiv:2502.10793v1 Announce Type: new Abstract: Existing methods for measuring training sample influence on models only provide static, overall measurements, overlooking how sample influence changes during training. We propose Dynamic Influence Tracker (DIT), which captures the time-varying sample influence across arbitrary time windows during training. DIT offers…
-
Tutorial: Semantic Clustering of User Messages with LLM Prompts
Tutorial: Semantic Clustering of User Messages with LLM Prompts As a Developer Advocate, it’s challenging to keep up with user forum messages and understand the big picture of what users are saying. There’s plenty of valuable content — but how can you quickly spot the key conversations? In this tutorial, I’ll show you an AI…
-
On-Device Machine Learning in Spatial Computing
On-Device Machine Learning in Spatial Computing The landscape of computing is undergoing a profound transformation with the emergence of spatial computing platforms(VR and AR). As we step into this new era, the intersection of virtual reality, Augmented Reality, and on-device machine learning presents unprecedented opportunities for developers to create experiences that seamlessly blend digital content…
-
Algorithmic contiguity from low-degree conjecture and applications in correlated random graphs
Algorithmic contiguity from low-degree conjecture and applications in correlated random graphs arXiv:2502.09832v1 Announce Type: new Abstract: In this paper, assuming a natural strengthening of the low-degree conjecture, we provide evidence of computational hardness for two problems: (1) the (partial) matching recovery problem in the sparse correlated ErdH{o}s-R’enyi graphs $mathcal G(n,q;rho)$ when the edge-density $q=n^{-1+o(1)}$ and…
-
On Volume Minimization in Conformal Regression
On Volume Minimization in Conformal Regression arXiv:2502.09985v1 Announce Type: new Abstract: We study the question of volume optimality in split conformal regression, a topic still poorly understood in comparison to coverage control. Using the fact that the calibration step can be seen as an empirical volume minimization problem, we first derive a finite-sample upper-bound on…
-
Estimation of the Learning Coefficient Using Empirical Loss
Estimation of the Learning Coefficient Using Empirical Loss arXiv:2502.09998v1 Announce Type: new Abstract: The learning coefficient plays a crucial role in analyzing the performance of information criteria, such as the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC), which Sumio Watanabe developed to assess model generalization ability. In regular statistical…
-
Improved Online Confidence Bounds for Multinomial Logistic Bandits
Improved Online Confidence Bounds for Multinomial Logistic Bandits arXiv:2502.10020v1 Announce Type: new Abstract: In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly…
-
Combinatorial Reinforcement Learning with Preference Feedback
Combinatorial Reinforcement Learning with Preference Feedback arXiv:2502.10158v1 Announce Type: new Abstract: In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action–an assortment of multiple items to–a user, whose preference feedback follows a multinomial logistic (MNL) model. This framework allows us to model real-world scenarios, particularly those…
-
Weekly Entering & Transitioning – Thread 17 Feb, 2025 – 24 Feb, 2025
Weekly Entering & Transitioning – Thread 17 Feb, 2025 – 24 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…
-
How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp)
How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp) Machine learning and AI are among the most popular topics nowadays, especially within the tech space. I am fortunate enough to work and develop with these technologies every day as a machine learning engineer! In this article, I will walk you through my…
-
➡️ Start Asking Your Data ‘Why?’ — A Gentle Intro To Causality
➡️ Start Asking Your Data ‘Why?’ — A Gentle Intro To Causality Correlation does not imply causation. It turns out, however, that with some simple ingenious tricks one can, potentially, unveil causal relationships within standard observational data, without having to resort to expensive randomised control trials. This post is targeted towards anyone making data driven…
-
Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning
Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning Introduction Data science is undoubtedly one of the most fascinating fields today. Following significant breakthroughs in machine learning about a decade ago, data science has surged in popularity within the tech community. Each year, we witness increasingly powerful tools that once seemed unimaginable. Innovations such as the Transformer…
-
Publish Interactive Data Visualizations for Free with Python and Marimo
Publish Interactive Data Visualizations for Free with Python and Marimo Working in Data Science, it can be hard to share insights from complex datasets using only static figures. All the facets that describe the shape and meaning of interesting data are not always captured in a handful of pre-generated figures. While we have powerful technologies…
-
A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection
A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection arXiv:2502.08695v1 Announce Type: new Abstract: Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we…
-
Optimal Algorithms in Linear Regression under Covariate Shift: On the Importance of Precondition
Optimal Algorithms in Linear Regression under Covariate Shift: On the Importance of Precondition arXiv:2502.09047v1 Announce Type: new Abstract: A common pursuit in modern statistical learning is to attain satisfactory generalization out of the source data distribution (OOD). In theory, the challenge remains unsolved even under the canonical setting of covariate shift for the linear model.…
-
Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards
Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards arXiv:2502.08993v1 Announce Type: new Abstract: Unbiased recommender learning (URL) and off-policy evaluation/learning (OPE/L) techniques are effective in addressing the data bias caused by display position and logging policies, thereby consistently improving the performance of recommendations. However, when both bias exits in the logged data, these estimators may suffer…
-
Non-asymptotic Analysis of Diffusion Annealed Langevin Monte Carlo for Generative Modelling
Non-asymptotic Analysis of Diffusion Annealed Langevin Monte Carlo for Generative Modelling arXiv:2502.09306v1 Announce Type: new Abstract: We investigate the theoretical properties of general diffusion (interpolation) paths and their Langevin Monte Carlo implementation, referred to as diffusion annealed Langevin Monte Carlo (DALMC), under weak conditions on the data distribution. Specifically, we analyse and provide non-asymptotic error…
-
A Differentiable Rank-Based Objective For Better Feature Learning
A Differentiable Rank-Based Objective For Better Feature Learning arXiv:2502.09445v1 Announce Type: new Abstract: In this paper, we leverage existing statistical methods to better understand feature learning from data. We tackle this by modifying the model-free variable selection method, Feature Ordering by Conditional Independence (FOCI), which is introduced in cite{azadkia2021simple}. While FOCI is based on a…
-
Building a Data Engineering Center of Excellence
Building a Data Engineering Center of Excellence As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice…
-
Learnings from a Machine Learning Engineer — Part 5: The Training
Learnings from a Machine Learning Engineer — Part 5: The Training In this fifth part of my series, I will outline the steps for creating a Docker container for training your image classification model, evaluating performance, and preparing for deployment. AI/ML engineers would prefer to focus on model training and data engineering, but the reality…
-
Learnings from a Machine Learning Engineer — Part 3: The Evaluation
Learnings from a Machine Learning Engineer — Part 3: The Evaluation In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in…