Category: aimldsaimlds

Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code

Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Valuable tips to reduce your DAGs’ parse time and save resources. Photo by Dan Roizer on Unsplash Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows…

January 31, 2025
Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data How much data does AI really need? TLDR: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST¹ to classify handwritten digits. Best runs for “furthest-from-centroid” selection compared to full dataset. Image by author. What if I told you…

January 31, 2025
Actually, Being a Data Scientist is Awesome

Actually, Being a Data Scientist is Awesome Don’t let the doom and gloom get to you Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

January 31, 2025
Navigating Data Science Content: Recognizing Common Pitfalls, Part 1

Navigating Data Science Content: Recognizing Common Pitfalls, Part 1 Uncovering and correcting misconceptions in online data science content to help you learn more effectively Continue reading on Towards Data Science » Geremie Yeo Go to original source

January 31, 2025
Near-Optimal Algorithms for Omniprediction

Near-Optimal Algorithms for Omniprediction arXiv:2501.17205v1 Announce Type: new Abstract: Omnipredictors are simple prediction functions that encode loss-minimizing predictions with respect to a hypothesis class $H$, simultaneously for every loss function within a class of losses $L$. In this work, we give near-optimal learning algorithms for omniprediction, in both the online and offline settings. To begin,…

January 30, 2025
Testing Conditional Mean Independence Using Generative Neural Networks

Testing Conditional Mean Independence Using Generative Neural Networks arXiv:2501.17345v1 Announce Type: new Abstract: Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional…

January 30, 2025
A Survey on Cluster-based Federated Learning

A Survey on Cluster-based Federated Learning arXiv:2501.17512v1 Announce Type: new Abstract: As the industrial and commercial use of Federated Learning (FL) has expanded, so has the need for optimized algorithms. In settings were FL clients’ data is non-independently and identically distributed (non-IID) and with highly heterogeneous distributions, the baseline FL approach seems to fall short.…

January 30, 2025
Exact characterization of {epsilon}-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation

Exact characterization of {epsilon}-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation arXiv:2501.17731v1 Announce Type: new Abstract: Probabilistic guarantees on the prediction of data-driven classifiers are necessary to define models that can be considered reliable. This is a key requirement for modern machine learning in which the goodness of a system is…

January 30, 2025
Sequential Learning of the Pareto Front for Multi-objective Bandits

Sequential Learning of the Pareto Front for Multi-objective Bandits arXiv:2501.17513v1 Announce Type: new Abstract: We study the problem of sequential learning of the Pareto front in multi-objective multi-armed bandits. An agent is faced with K possible arms to pull. At each turn she picks one, and receives a vector-valued reward. When she thinks she has…

January 30, 2025
Great Books for AI Engineering

Great Books for AI Engineering 10 books with valuable insights about AI science and engineering Great books for AI Engineering — Plus ‘Brave New Words’ (Image is Author’s own work) A few years ago I recommended 21 books in Great Books for Data Science and Great Books for Data Science 2. Since then a lot has changed. While…

January 30, 2025
AI Ethics for the Everyday User — Why Should You Care?

AI Ethics for the Everyday User — Why Should You Care? A beginner’s guide to understanding the importance of ethics in artificial intelligence Continue reading on Towards Data Science » Murtaza Ali Go to original source

January 30, 2025
NLP Illustrated, Part 3: Word2Vec

NLP Illustrated, Part 3: Word2Vec An exhaustive and illustrated guide to Word2Vec with code! Continue reading on Towards Data Science » Shreya Rao Go to original source

January 30, 2025
The Challenges and Realities of Being a Data Scientist

The Challenges and Realities of Being a Data Scientist Some harsh truths behind the field of data science Continue reading on Towards Data Science » Egor Howell Go to original source

January 30, 2025
Machine Learning Incidents in AdTech

Machine Learning Incidents in AdTech Source: https://unsplash.com/photos/a-couple-of-signs-that-are-on-a-fence-xXbQIrWH2_A Challenges with deep learning in production One of the biggest challenges I encountered in my career as a data scientist was migrating the core algorithms in a mobile AdTech platform from classic machine learning models to deep learning. I worked on a Demand Side Platform (DSP) for user…

January 30, 2025
Nonparametric Sparse Online Learning of the Koopman Operator

Nonparametric Sparse Online Learning of the Koopman Operator arXiv:2501.16489v1 Announce Type: new Abstract: The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman…

January 29, 2025
Variational Schr”odinger Momentum Diffusion

Variational Schr”odinger Momentum Diffusion arXiv:2501.16675v1 Announce Type: new Abstract: The momentum Schr”odinger Bridge (mSB) has emerged as a leading method for accelerating generative diffusion processes and reducing transport costs. However, the lack of simulation-free properties inevitably results in high training costs and affects scalability. To obtain a trade-off between transport properties and scalability, we introduce…

January 29, 2025
Exponential Family Attention

Exponential Family Attention arXiv:2501.16790v1 Announce Type: new Abstract: The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial,…

January 29, 2025
Towards the Generalization of Multi-view Learning: An Information-theoretical Analysis

Towards the Generalization of Multi-view Learning: An Information-theoretical Analysis arXiv:2501.16768v1 Announce Type: new Abstract: Multiview learning has drawn widespread attention for its efficacy in leveraging cross-view consensus and complementarity information to achieve a comprehensive representation of data. While multi-view learning has undergone vigorous development and achieved remarkable success, the theoretical understanding of its generalization behavior…

January 29, 2025
Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect

Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect arXiv:2501.16988v1 Announce Type: new Abstract: Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through “Marginal Variable Importance Metric” (MVIM), a model-agnostic…

January 29, 2025
Analyze Tornado Data with Python and GeoPandas

Analyze Tornado Data with Python and GeoPandas Insights from NOAA’s public domain database Continue reading on Towards Data Science » Lee Vaughan Go to original source

January 29, 2025
Basics of Probability Notations

Basics of Probability Notations Union, Intersection, Independence, Disjoint, Complement: Advanced Probability for Data Science Series (1) Continue reading on Towards Data Science » Sunghyun Ahn Go to original source

January 29, 2025
How GenAI Tools Have Changed My Work as a Data Scientist

How GenAI Tools Have Changed My Work as a Data Scientist An overview of the 4 use cases and 6 GenAI tools I use Continue reading on Towards Data Science » Jonte Dancker Go to original source

January 29, 2025
How to do Date calculations in DAX

How to do Date calculations in DAX Moving back and forth in time is a common task for Time Intelligence in DAX. Let’s take a deeper look on how DATEADD() works. Continue reading on Towards Data Science » Salvatore Cagliari Go to original source

January 29, 2025
Who is Right? The Dean or the Students?

Who is Right? The Dean or the Students? A cautionary tale on two perspectives on averaging Continue reading on Towards Data Science » Paolo Molignini, PhD Go to original source

January 29, 2025
ED-Filter: Dynamic Feature Filtering for Eating Disorder Classification

ED-Filter: Dynamic Feature Filtering for Eating Disorder Classification arXiv:2501.14785v1 Announce Type: new Abstract: Eating disorders (ED) are critical psychiatric problems that have alarmed the mental health community. Mental health professionals are increasingly recognizing the utility of data derived from social media platforms such as Twitter. However, high dimensionality and extensive feature sets of Twitter data…

January 28, 2025
Explaining Categorical Feature Interactions Using Graph Covariance and LLMs

Explaining Categorical Feature Interactions Using Graph Covariance and LLMs arXiv:2501.14932v1 Announce Type: new Abstract: Modern datasets often consist of numerous samples with abundant features and associated timestamps. Analyzing such datasets to uncover underlying events typically requires complex statistical methods and substantial domain expertise. A notable example, and the primary data focus of this paper, is…

January 28, 2025
Median of Forests for Robust Density Estimation

Median of Forests for Robust Density Estimation arXiv:2501.15157v1 Announce Type: new Abstract: Robust density estimation refers to the consistent estimation of the density function even when the data is contaminated by outliers. We find that existing forest density estimation at a certain point is inherently resistant to the outliers outside the cells containing the point,…

January 28, 2025
Conformal Inference of Individual Treatment Effects Using Conditional Density Estimates

Conformal Inference of Individual Treatment Effects Using Conditional Density Estimates arXiv:2501.14933v1 Announce Type: new Abstract: In an era where diverse and complex data are increasingly accessible, the precise prediction of individual treatment effects (ITE) becomes crucial across fields such as healthcare, economics, and public policy. Current state-of-the-art approaches, while providing valid prediction intervals through Conformal…

January 28, 2025
A Review on Self-Supervised Learning for Time Series Anomaly Detection: Recent Advances and Open Challenges

A Review on Self-Supervised Learning for Time Series Anomaly Detection: Recent Advances and Open Challenges arXiv:2501.15196v1 Announce Type: new Abstract: Time series anomaly detection presents various challenges due to the sequential and dynamic nature of time-dependent data. Traditional unsupervised methods frequently encounter difficulties in generalization, often overfitting to known normal patterns observed during training and…

January 28, 2025
Building a Regression Model: Delivery Duration Prediction

Building a Regression Model: Delivery Duration Prediction Building a Regression Model to Predict Delivery Durations: A Practical Guide E2E walkthrough for approaching a regression modeling task In this article, we’re going to walk through the process of building a regression model — from dataset cleaning & preparation, to model training & evaluation. The specific regression task we will…

January 28, 2025
Beyond Causal Language Modeling

Beyond Causal Language Modeling A deep dive into “Not All Tokens Are What You Need for Pretraining” Introduction A few days ago, I had the chance to present at a local reading group that focused on some of the most exciting and insightful papers from NeurIPS 2024. As a presenter, I selected a paper titled…

January 28, 2025
Build a Decision Tree in Polars from Scratch

Build a Decision Tree in Polars from Scratch Explore decision trees with polars backend Photo by Leonard Laub on Unsplash Decision tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn,…

January 28, 2025
How to Implement Guardrails for Your AI Agents with CrewAI

How to Implement Guardrails for Your AI Agents with CrewAI LLM Agents are non-deterministic by nature: implement proper guardrails for your AI Application. Continue reading on Towards Data Science » Alessandro Romano Go to original source

January 28, 2025
Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus

Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus Why descriptive statistics aren’t enough and plotting your data is always essential Continue reading on Towards Data Science » Maria Mouschoutzi, PhD Go to original source

January 28, 2025
Distributionally Robust Coreset Selection under Covariate Shift

Distributionally Robust Coreset Selection under Covariate Shift arXiv:2501.14253v1 Announce Type: new Abstract: Coreset selection, which involves selecting a small subset from an existing training dataset, is an approach to reducing training data, and various approaches have been proposed for this method. In practical situations where these methods are employed, it is often the case that…

January 27, 2025
EFiGP: Eigen-Fourier Physics-Informed Gaussian Process for Inference of Dynamic Systems

EFiGP: Eigen-Fourier Physics-Informed Gaussian Process for Inference of Dynamic Systems arXiv:2501.14107v1 Announce Type: new Abstract: Parameter estimation and trajectory reconstruction for data-driven dynamical systems governed by ordinary differential equations (ODEs) are essential tasks in fields such as biology, engineering, and physics. These inverse problems — estimating ODE parameters from observational data — are particularly challenging…

January 27, 2025
Statistical Verification of Linear Classifiers

Statistical Verification of Linear Classifiers arXiv:2501.14430v1 Announce Type: new Abstract: We propose a homogeneity test closely related to the concept of linear separability between two samples. Using the test one can answer the question whether a linear classifier is merely “random” or effectively captures differences between two classes. We focus on establishing upper bounds for…

January 27, 2025
coverforest: Conformal Predictions with Random Forest in Python

coverforest: Conformal Predictions with Random Forest in Python arXiv:2501.14570v1 Announce Type: new Abstract: Conformal prediction provides a framework for uncertainty quantification, specifically in the forms of prediction intervals and sets with distribution-free guaranteed coverage. While recent cross-conformal techniques such as CV+ and Jackknife+-after-bootstrap achieve better data efficiency than traditional split conformal methods, they incur substantial…

January 27, 2025
Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization

Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization arXiv:2501.14635v1 Announce Type: new Abstract: The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the unregularized barycenter for discretized probability distributions on point clouds is a challenging task when…

January 27, 2025
Weekly Entering & Transitioning – Thread 27 Jan, 2025 – 03 Feb, 2025

Weekly Entering & Transitioning – Thread 27 Jan, 2025 – 03 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

January 27, 2025
Your Neural Network Can’t Explain This. TMLE to the Rescue!

Your Neural Network Can’t Explain This. TMLE to the Rescue! Targeted Maximum Likelihood Estimation (TMLE) helps you explain patterns where other techniques fall short Continue reading on Towards Data Science » Ari Joury, PhD Go to original source

January 27, 2025
Optimising Budgets With Marketing Mix Models In Python

Optimising Budgets With Marketing Mix Models In Python Part 3 of a hands-on guide to help you master MMM in pymc Photo by Towfiqu barbhuiya on Unsplash What is this series about? Welcome to part 3 of my series on marketing mix modelling (MMM), a hands-on guide to help you master MMM. Throughout this series, we’ll cover key…

January 27, 2025
[Official] 2024 End of Year Salary Sharing thread

[Official] 2024 End of Year Salary Sharing thread This is the official thread for sharing your current salaries (or recent offers). See last year’s Salary Sharing thread here. There was also an unofficial one from an hour ago here. Please only post salaries/offers if you’re including hard numbers, but feel free to use a throwaway…

January 26, 2025
How Cheap Mortgages Transformed Poland’s Real Estate Market

How Cheap Mortgages Transformed Poland’s Real Estate Market Insights from a synthetic control group Continue reading on Towards Data Science » Lukasz Szubelak Go to original source

January 26, 2025
Choosing Classification Model Evaluation Criteria

Choosing Classification Model Evaluation Criteria Is Recall / Precision better than Sensitivity / Specificity? Continue reading on Towards Data Science » Viyaleta Apgar Go to original source

January 26, 2025
Understanding Emergent Capabilities in LLMs: Lessons from Biological Systems

Understanding Emergent Capabilities in LLMs: Lessons from Biological Systems How natural systems fundamental laws help explain AI’s unexpected abilities Continue reading on Towards Data Science » Javier Marin Go to original source

January 25, 2025
Deep Learning for Click Prediction in Mobile AdTech

Deep Learning for Click Prediction in Mobile AdTech Source: https://pixabay.com/illustrations/rays-stars-light-explosion-galaxy-9350519/ Machine Learning for Real-Time Bidding The past few years were a revolution for the mobile advertising and gaming industries, with the broad adoption of neural networks for advertising tasks, including click prediction. This migration occurred prior to the success of Large Language Models (LLMs) and…

January 25, 2025
Multi-Headed Cross Attention — By Hand

Multi-Headed Cross Attention — By Hand Hand computing a fundamental component of multimodal models Continue reading on Towards Data Science » Daniel Warfield Go to original source

January 25, 2025
Does It Matter That Online Experiments Interact?

Does It Matter That Online Experiments Interact? What interactions do, why they are just like any other change in the environment post-experiment, and some reassurance Photo by Uriel Soberanes on Unsplash Experiments do not run one at a time. At any moment, hundreds to thousands of experiments run on a mature website. The question comes up:…

January 25, 2025
Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 2

Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 2 Using Unavailable Data at Prediction Time and Mixing Magic Numbers with Real Numbers Continue reading on Towards Data Science » Thomas A Dorfer Go to original source

January 25, 2025
Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data

Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data arXiv:2501.13483v1 Announce Type: new Abstract: Neural amortized Bayesian inference (ABI) can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, neural ABI is not yet sufficiently robust for widespread and safe applicability. In particular, when performing inference on observations outside of the…

January 24, 2025
LITE: Efficiently Estimating Gaussian Probability of Maximality

LITE: Efficiently Estimating Gaussian Probability of Maximality arXiv:2501.13535v1 Announce Type: new Abstract: We consider the problem of computing the probability of maximality (PoM) of a Gaussian random vector, i.e., the probability for each dimension to be maximal. This is a key challenge in applications ranging from Bayesian optimization to reinforcement learning, where the PoM not…

January 24, 2025
Learning under Commission and Omission Event Outliers

Learning under Commission and Omission Event Outliers arXiv:2501.13599v1 Announce Type: new Abstract: Event stream is an important data format in real life. The events are usually expected to follow some regular patterns over time. However, the patterns could be contaminated by unexpected absences or occurrences of events. In this paper, we adopt the temporal point…

January 24, 2025
Bayesian Model Parameter Learning in Linear Inverse Problems with Application in EEG Focal Source Imaging

Bayesian Model Parameter Learning in Linear Inverse Problems with Application in EEG Focal Source Imaging arXiv:2501.13109v1 Announce Type: cross Abstract: Inverse problems can be described as limited-data problems in which the signal of interest cannot be observed directly. A physics-based forward model that relates the signal with the observations is typically needed. Unfortunately, unknown model…

January 24, 2025
A dimensionality reduction technique based on the Gromov-Wasserstein distance

A dimensionality reduction technique based on the Gromov-Wasserstein distance arXiv:2501.13732v1 Announce Type: new Abstract: Analyzing relationships between objects is a pivotal problem within data science. In this context, Dimensionality reduction (DR) techniques are employed to generate smaller and more manageable data representations. This paper proposes a new method for dimensionality reduction, based on optimal transportation…

January 24, 2025
A Derivation and Application of Restricted Boltzmann Machines (2024 Nobel Prize)

A Derivation and Application of Restricted Boltzmann Machines (2024 Nobel Prize) Investigating Geoffrey Hinton’s Nobel Prize-winning work and building it from scratch using PyTorch One recipient of the 2024 Nobel Prize in Physics was Geoffrey Hinton for his contributions in the field of AI and machine learning. A lot of people know he worked on neural…

January 24, 2025
Apollo and Design Choices of Video Large Multimodal Models (LMMs)

Apollo and Design Choices of Video Large Multimodal Models (LMMs) Let’s Explore Major Design Choices from Meta’s Apollo Paper Continue reading on Towards Data Science » Matthew Gunton Go to original source

January 24, 2025
On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI?

On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI? These 3 starter projects only take a weekend (and a few cups of coffee, maybe) Continue reading on Towards Data Science » Thuwarakesh Murallie Go to original source

January 24, 2025
The Basics you Must Master Before Diving into Marketing & Product Analytics

The Basics you Must Master Before Diving into Marketing & Product Analytics Things that still confuse many Data Analysts Recently, I gave a presentation on a specific topic: how to investigate drop-offs in conversion funnels within the context of marketing and product analysis. What surprised me? The incredible engagement from the audience. The questions were varied…

January 24, 2025
The Solar Cycle(s): history, data analysis and trend forecasting.

The Solar Cycle(s): history, data analysis and trend forecasting. The Solar Cycle(s): History, Data Analysis and Trend Forecasting A brief article on the Solar Cycles, the history behind their observation, data analysis and time series forecasting for the incoming solar maximum in 2025–2026 and the next decades You have probably heard about the 11-year Solar Cycle…

January 24, 2025
Ultralow-dimensionality reduction for identifying critical transitions by spatial-temporal PCA

Ultralow-dimensionality reduction for identifying critical transitions by spatial-temporal PCA arXiv:2501.12582v1 Announce Type: new Abstract: Discovering dominant patterns and exploring dynamic behaviors especially critical state transitions and tipping points in high-dimensional time-series data are challenging tasks in study of real-world complex systems, which demand interpretable data representations to facilitate comprehension of both spatial and temporal information…

January 23, 2025
Sequential Change Point Detection via Denoising Score Matching

Sequential Change Point Detection via Denoising Score Matching arXiv:2501.12667v1 Announce Type: new Abstract: Sequential change-point detection plays a critical role in numerous real-world applications, where timely identification of distributional shifts can greatly mitigate adverse outcomes. Classical methods commonly rely on parametric density assumptions of pre- and post-change distributions, limiting their effectiveness for high-dimensional, complex data…

January 23, 2025
On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration

On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration arXiv:2501.12785v1 Announce Type: new Abstract: This paper tackles the efficiency and stability issues in learning from observations (LfO). We commence by investigating how reward functions and policies generalize in LfO. Subsequently, the built-in reinforcement learning (RL) approach in generative adversarial imitation from observation (GAIfO)…

January 23, 2025
Singular leaning coefficients and efficiency in learning theory

Singular leaning coefficients and efficiency in learning theory arXiv:2501.12747v1 Announce Type: new Abstract: Singular learning models with non-positive Fisher information matrices include neural networks, reduced-rank regression, Boltzmann machines, normal mixture models, and others. These models have been widely used in the development of learning machines. However, theoretical analysis is still in its early stages. In…

January 23, 2025
Fixed-Budget Change Point Identification in Piecewise Constant Bandits

Fixed-Budget Change Point Identification in Piecewise Constant Bandits arXiv:2501.12957v1 Announce Type: new Abstract: We study the piecewise constant bandit problem where the expected reward is a piecewise constant function with one change point (discontinuity) across the action space $[0,1]$ and the learner’s aim is to locate the change point. Under the assumption of a fixed…

January 23, 2025
Harmonizing and Pooling Datasets for Health Research in R

Harmonizing and Pooling Datasets for Health Research in R R code to extract data from unique datasets and combine them in one harmonized dataset ready for seamless analysis Continue reading on Towards Data Science » Rodrigo M Carrillo Larco, MD, PhD Go to original source

January 23, 2025
Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code

Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code A comparison of two cutting-edge dynamic topic models solving consumer complaints classification exercise Continue reading on Towards Data Science » Petr Korab Go to original source

January 23, 2025
Optimize the dbt Doc Function with a CI

Optimize the dbt Doc Function with a CI How to set an automated check to improve your dbt documentation Image by the author (generated with chatgpt) In large dbt projects, maintaining consistent and up-to-date documentation can be a challenge. Although dbt’s {{ doc() }} function allows you to store and reuse descriptions for the columns of…

January 23, 2025
How to Evaluate LLM Summarization

How to Evaluate LLM Summarization A practical and effective guide for evaluating AI summaries Image from Unsplash Summarization is one of the most practical and convenient tasks enabled by LLMs. However, compared to other LLM tasks like question-asking or classification, evaluating LLMs on summarization is far more challenging. And so I myself have neglected evals for…

January 23, 2025
How to Utilize ModernBERT and Synthetic Data for Robust Text Classification

How to Utilize ModernBERT and Synthetic Data for Robust Text Classification Learn how to fine-tune ModernBERT and create augmentations of text samples Continue reading on Towards Data Science » Eivind Kjosbakken Go to original source

January 23, 2025
Extension of Symmetrized Neural Network Operators with Fractional and Mixed Activation Functions

Extension of Symmetrized Neural Network Operators with Fractional and Mixed Activation Functions arXiv:2501.10496v1 Announce Type: new Abstract: We propose a novel extension to symmetrized neural network operators by incorporating fractional and mixed activation functions. This study addresses the limitations of existing models in approximating higher-order smooth functions, particularly in complex and high-dimensional spaces. Our framework…

January 22, 2025
Simulation of Random LR Fuzzy Intervals

Simulation of Random LR Fuzzy Intervals arXiv:2501.10482v1 Announce Type: new Abstract: Random fuzzy variables join the modeling of the impreciseness (due to their “fuzzy part”) and randomness. Statistical samples of such objects are widely used, and their direct, numerically effective generation is therefore necessary. Usually, these samples consist of triangular or trapezoidal fuzzy numbers. In…

January 22, 2025
Multi-Output Conformal Regression: A Unified Comparative Study with New Conformity Scores

Multi-Output Conformal Regression: A Unified Comparative Study with New Conformity Scores arXiv:2501.10533v1 Announce Type: new Abstract: Quantifying uncertainty in multivariate regression is essential in many real-world applications, yet existing methods for constructing prediction regions often face limitations such as the inability to capture complex dependencies, lack of coverage guarantees, or high computational cost. Conformal prediction…

January 22, 2025
DPERC: Direct Parameter Estimation for Mixed Data

DPERC: Direct Parameter Estimation for Mixed Data arXiv:2501.10540v1 Announce Type: new Abstract: The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this…

January 22, 2025
Model-Robust and Adaptive-Optimal Transfer Learning for Tackling Concept Shifts in Nonparametric Regression

Model-Robust and Adaptive-Optimal Transfer Learning for Tackling Concept Shifts in Nonparametric Regression arXiv:2501.10870v1 Announce Type: new Abstract: When concept shifts and sample scarcity are present in the target domain of interest, nonparametric regression learners often struggle to generalize effectively. The technique of transfer learning remedies these issues by leveraging data or pre-trained models from similar…

January 22, 2025
Large Language Models: A Short Introduction

Large Language Models: A Short Introduction And why you should care about LLMs Image by author. There’s an acronym you’ve probably heard non-stop for the past few years: LLM, which stands for Large Language Model. In this article we’re going to take a brief look at what LLMs are, why they’re an extremely exciting piece of technology, why…

January 22, 2025
Data-Driven Decision Making with Sentiment Analysis in R

Data-Driven Decision Making with Sentiment Analysis in R Leveraging the Quanteda, Textstem and Sentimentr Packages to Extract Customer Insights and Enhance Business Strategy Continue reading on Towards Data Science » Devashree Madhugiri Go to original source

January 22, 2025
LyRec: A Song Recommender That Reads Between the Lyrics

LyRec: A Song Recommender That Reads Between the Lyrics This is how I built an emotionally intelligent LLM-powered song recommendation system. Photo by David Pupăză on Unsplash Do you remember the last time you found yourself obsessing over a song? Maybe it was the raw emotion that resonated with you, or perhaps it was the lyrics…

January 22, 2025
Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT Mastering the art of fine-tuning: Learnings for training your own LLMs. (Image from Unsplash) This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs. It is practical in the…

January 22, 2025
Fighting Fraud Fairly: Upgrade Your AI Toolkit

Fighting Fraud Fairly: Upgrade Your AI Toolkit A practical approach to address bias in AI systems Photo by the author As sophisticated AI systems are increasingly used in decision-making, ensuring fairness has become a priority, with a growing need to prevent algorithms from disproportionately affecting vulnerable groups in sensitive areas like the justice or educational system. One…

January 22, 2025
Advancing AI Reasoning: Meta-CoT and System 2 Thinking

Advancing AI Reasoning: Meta-CoT and System 2 Thinking How Meta-CoT enhances system 2 reasoning for complex AI challenges Continue reading on Towards Data Science » Kaushik Rajan Go to original source

January 21, 2025
Modern Data And Application Engineering Breaks the Loss of Business Context

Modern Data And Application Engineering Breaks the Loss of Business Context Here’s how your data retains its business relevance as it travels through your enterprise Continue reading on Towards Data Science » Bernd Wessely Go to original source

January 21, 2025
Why LLMs Suck at ASCII Art

Why LLMs Suck at ASCII Art How being bad at art can be so dangerous Large Language Models have been doing a pretty good job of knocking down challenge after challenge in areas both expected and not. From writing poetry to generating entire websites from questionably… drawn images, these models seem almost unstoppable (and dire…

January 21, 2025
Building a Data Dashboard

Building a Data Dashboard Using the streamlit Python library Continue reading on Towards Data Science » Thomas Reid Go to original source

January 21, 2025
Why Generative-AI Apps’ Quality Often Sucks and What to Do About It

Why Generative-AI Apps’ Quality Often Sucks and What to Do About It How to get from PoCs to tested high-quality applications in production Image licensed from elements.envato.com, edit by Marcel Müller, 2025 The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient,…

January 21, 2025
SBAMDT: Bayesian Additive Decision Trees with Adaptive Soft Semi-multivariate Split Rules

SBAMDT: Bayesian Additive Decision Trees with Adaptive Soft Semi-multivariate Split Rules arXiv:2501.09900v1 Announce Type: new Abstract: Bayesian Additive Regression Trees [BART, Chipman et al., 2010] have gained significant popularity due to their remarkable predictive performance and ability to quantify uncertainty. However, standard decision tree models rely on recursive data splits at each decision node, using…

January 20, 2025
Tracking student skills real-time through a continuous-variable dynamic Bayesian network

Tracking student skills real-time through a continuous-variable dynamic Bayesian network arXiv:2501.10050v1 Announce Type: new Abstract: The field of Knowledge Tracing is focused on predicting the success rate of a student for a given skill. Modern methods like Deep Knowledge Tracing provide accurate estimates given enough data, but being based on neural networks they struggle to…

January 20, 2025
Statistical Inference for Sequential Feature Selection after Domain Adaptation

Statistical Inference for Sequential Feature Selection after Domain Adaptation arXiv:2501.09933v1 Announce Type: new Abstract: In high-dimensional regression, feature selection methods, such as sequential feature selection (SeqFS), are commonly used to identify relevant features. When data is limited, domain adaptation (DA) becomes crucial for transferring knowledge from a related source domain to a target domain, improving…

January 20, 2025
Contributions to the Decision Theoretic Foundations of Machine Learning and Robust Statistics under Weakly Structured Information

Contributions to the Decision Theoretic Foundations of Machine Learning and Robust Statistics under Weakly Structured Information arXiv:2501.10195v1 Announce Type: new Abstract: This habilitation thesis is cumulative and, therefore, is collecting and connecting research that I (together with several co-authors) have conducted over the last few years. Thus, the absolute core of the work is formed…

January 20, 2025
Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach

Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach arXiv:2501.10202v1 Announce Type: new Abstract: This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a…

January 20, 2025
Weekly Entering & Transitioning – Thread 20 Jan, 2025 – 27 Jan, 2025

Weekly Entering & Transitioning – Thread 20 Jan, 2025 – 27 Jan, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

January 20, 2025
Anyone ever feel like working as a data scientist at hinge?

Anyone ever feel like working as a data scientist at hinge? Need to figure out what that damn algorithm is doing to keep me from getting matches lol. On a serious note I have read about some interesting algorithmic work at dating app companies. Any data scientists here ever worked for a dating app company?…

January 20, 2025
Influential Time-Series Forecasting Papers of 2023-2024: Part 1

Influential Time-Series Forecasting Papers of 2023-2024: Part 1 This article explores some of the latest advancements in time-series forecasting. You can find the article here. Edit: If you know of any other interesting papers, please share them in the comments. submitted by /u/nkafr [link] [comments] /u/nkafr Go to original source

January 20, 2025
Should I Try to postpone my FAANG Interview?

Should I Try to postpone my FAANG Interview? So I got contacted by a FAANG Recruiter for a Data Scientist Role I applied for a month and a half ago. But as I have started to prep, I realize I am not ready and need 1 to 2 months before I would be able to…

January 20, 2025
Where to Start when Data is Limited: A Guide

Where to Start when Data is Limited: A Guide Hey, I’ve put together an article on my thoughts and some research around how to get the most out of small datasets when performance requirements mean conventional analysis isn’t enough. It’s aimed at helping people get started with new projects who have already started with the…

January 20, 2025
The Concepts Data Professionals Should Know in 2025: Part 1

The Concepts Data Professionals Should Know in 2025: Part 1 From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT. Continue reading on Towards Data Science » Sarah Lea Go to original source

January 20, 2025
Zero-Shot Player Tracking in Tennis with Kalman Filtering

Zero-Shot Player Tracking in Tennis with Kalman Filtering Automated tennis tracking without labels: GroundingDINO, Kalman filtering, and court homography https://medium.com/media/6f735abc63f905de122bb8a0679f97fd/href With the recent surge in sports tracking projects, many inspired by Skalski’s popular soccer tracking project, there’s been a notable shift towards using automated player tracking for sport hobbyists. Most of these approaches follow a…

January 20, 2025
How to Log Your Data with MLflow

How to Log Your Data with MLflow MLflow, MLOps, Data Science Mastering data logging in MLOps for your AI workflow Photo by Chris Liverani on Unsplash Preface Data is one of the most critical components of the machine learning process. In fact, the quality of the data used in training a model often determines the success or failure…

January 20, 2025
How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW…

How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW… Make the right choice for YOU Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

January 20, 2025
Showcasing Soaring Wildfire Counts With Streamlit and Python: A Powerful Approach

Showcasing Soaring Wildfire Counts With Streamlit and Python: A Powerful Approach Analyzing historical wildfire trends in Canada with public data Continue reading on Towards Data Science » John Loewen, PhD Go to original source

January 19, 2025