Category: aimldsaimlds

  • Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code

    Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Valuable tips to reduce your DAGs’ parse time and save resources. Photo by Dan Roizer on Unsplash Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows…

  • Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

    Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data How much data does AI really need? TLDR: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST¹ to classify handwritten digits. Best runs for “furthest-from-centroid” selection compared to full dataset. Image by author. What if I told you…

  • Actually, Being a Data Scientist is Awesome

    Actually, Being a Data Scientist is Awesome Don’t let the doom and gloom get to you Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

  • Navigating Data Science Content: Recognizing Common Pitfalls, Part 1

    Navigating Data Science Content: Recognizing Common Pitfalls, Part 1 Uncovering and correcting misconceptions in online data science content to help you learn more effectively Continue reading on Towards Data Science » Geremie Yeo Go to original source

  • Near-Optimal Algorithms for Omniprediction

    Near-Optimal Algorithms for Omniprediction arXiv:2501.17205v1 Announce Type: new Abstract: Omnipredictors are simple prediction functions that encode loss-minimizing predictions with respect to a hypothesis class $H$, simultaneously for every loss function within a class of losses $L$. In this work, we give near-optimal learning algorithms for omniprediction, in both the online and offline settings. To begin,…

  • Testing Conditional Mean Independence Using Generative Neural Networks

    Testing Conditional Mean Independence Using Generative Neural Networks arXiv:2501.17345v1 Announce Type: new Abstract: Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional…

  • A Survey on Cluster-based Federated Learning

    A Survey on Cluster-based Federated Learning arXiv:2501.17512v1 Announce Type: new Abstract: As the industrial and commercial use of Federated Learning (FL) has expanded, so has the need for optimized algorithms. In settings were FL clients’ data is non-independently and identically distributed (non-IID) and with highly heterogeneous distributions, the baseline FL approach seems to fall short.…

  • Exact characterization of {epsilon}-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation

    Exact characterization of {epsilon}-Safe Decision Regions for exponential family distributions and Multi Cost SVM approximation arXiv:2501.17731v1 Announce Type: new Abstract: Probabilistic guarantees on the prediction of data-driven classifiers are necessary to define models that can be considered reliable. This is a key requirement for modern machine learning in which the goodness of a system is…

  • Sequential Learning of the Pareto Front for Multi-objective Bandits

    Sequential Learning of the Pareto Front for Multi-objective Bandits arXiv:2501.17513v1 Announce Type: new Abstract: We study the problem of sequential learning of the Pareto front in multi-objective multi-armed bandits. An agent is faced with K possible arms to pull. At each turn she picks one, and receives a vector-valued reward. When she thinks she has…

  • Great Books for AI Engineering

    Great Books for AI Engineering 10 books with valuable insights about AI science and engineering Great books for AI Engineering — Plus ‘Brave New Words’ (Image is Author’s own work) A few years ago I recommended 21 books in Great Books for Data Science and Great Books for Data Science 2. Since then a lot has changed. While…

  • AI Ethics for the Everyday User — Why Should You Care?

    AI Ethics for the Everyday User — Why Should You Care? A beginner’s guide to understanding the importance of ethics in artificial intelligence Continue reading on Towards Data Science » Murtaza Ali Go to original source

  • NLP Illustrated, Part 3: Word2Vec

    NLP Illustrated, Part 3: Word2Vec An exhaustive and illustrated guide to Word2Vec with code! Continue reading on Towards Data Science » Shreya Rao Go to original source

  • The Challenges and Realities of Being a Data Scientist

    The Challenges and Realities of Being a Data Scientist Some harsh truths behind the field of data science Continue reading on Towards Data Science » Egor Howell Go to original source

  • Machine Learning Incidents in AdTech

    Machine Learning Incidents in AdTech Source: https://unsplash.com/photos/a-couple-of-signs-that-are-on-a-fence-xXbQIrWH2_A Challenges with deep learning in production One of the biggest challenges I encountered in my career as a data scientist was migrating the core algorithms in a mobile AdTech platform from classic machine learning models to deep learning. I worked on a Demand Side Platform (DSP) for user…

  • Nonparametric Sparse Online Learning of the Koopman Operator

    Nonparametric Sparse Online Learning of the Koopman Operator arXiv:2501.16489v1 Announce Type: new Abstract: The Koopman operator provides a powerful framework for representing the dynamics of general nonlinear dynamical systems. Data-driven techniques to learn the Koopman operator typically assume that the chosen function space is closed under system dynamics. In this paper, we study the Koopman…

  • Variational Schr”odinger Momentum Diffusion

    Variational Schr”odinger Momentum Diffusion arXiv:2501.16675v1 Announce Type: new Abstract: The momentum Schr”odinger Bridge (mSB) has emerged as a leading method for accelerating generative diffusion processes and reducing transport costs. However, the lack of simulation-free properties inevitably results in high training costs and affects scalability. To obtain a trade-off between transport properties and scalability, we introduce…

  • Exponential Family Attention

    Exponential Family Attention arXiv:2501.16790v1 Announce Type: new Abstract: The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial,…

  • Towards the Generalization of Multi-view Learning: An Information-theoretical Analysis

    Towards the Generalization of Multi-view Learning: An Information-theoretical Analysis arXiv:2501.16768v1 Announce Type: new Abstract: Multiview learning has drawn widespread attention for its efficacy in leveraging cross-view consensus and complementarity information to achieve a comprehensive representation of data. While multi-view learning has undergone vigorous development and achieved remarkable success, the theoretical understanding of its generalization behavior…

  • Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect

    Marginal and Conditional Importance Measures from Machine Learning Models and Their Relationship with Conditional Average Treatment Effect arXiv:2501.16988v1 Announce Type: new Abstract: Interpreting black-box machine learning models is challenging due to their strong dependence on data and inherently non-parametric nature. This paper reintroduces the concept of importance through “Marginal Variable Importance Metric” (MVIM), a model-agnostic…

  • Analyze Tornado Data with Python and GeoPandas

    Analyze Tornado Data with Python and GeoPandas Insights from NOAA’s public domain database Continue reading on Towards Data Science » Lee Vaughan Go to original source

  • Basics of Probability Notations

    Basics of Probability Notations Union, Intersection, Independence, Disjoint, Complement: Advanced Probability for Data Science Series (1) Continue reading on Towards Data Science » Sunghyun Ahn Go to original source

  • How GenAI Tools Have Changed My Work as a Data Scientist

    How GenAI Tools Have Changed My Work as a Data Scientist An overview of the 4 use cases and 6 GenAI tools I use Continue reading on Towards Data Science » Jonte Dancker Go to original source

  • How to do Date calculations in DAX

    How to do Date calculations in DAX Moving back and forth in time is a common task for Time Intelligence in DAX. Let’s take a deeper look on how DATEADD() works. Continue reading on Towards Data Science » Salvatore Cagliari Go to original source

  • Who is Right? The Dean or the Students?

    Who is Right? The Dean or the Students? A cautionary tale on two perspectives on averaging Continue reading on Towards Data Science » Paolo Molignini, PhD Go to original source

  • ED-Filter: Dynamic Feature Filtering for Eating Disorder Classification

    ED-Filter: Dynamic Feature Filtering for Eating Disorder Classification arXiv:2501.14785v1 Announce Type: new Abstract: Eating disorders (ED) are critical psychiatric problems that have alarmed the mental health community. Mental health professionals are increasingly recognizing the utility of data derived from social media platforms such as Twitter. However, high dimensionality and extensive feature sets of Twitter data…

  • Explaining Categorical Feature Interactions Using Graph Covariance and LLMs

    Explaining Categorical Feature Interactions Using Graph Covariance and LLMs arXiv:2501.14932v1 Announce Type: new Abstract: Modern datasets often consist of numerous samples with abundant features and associated timestamps. Analyzing such datasets to uncover underlying events typically requires complex statistical methods and substantial domain expertise. A notable example, and the primary data focus of this paper, is…

  • Median of Forests for Robust Density Estimation

    Median of Forests for Robust Density Estimation arXiv:2501.15157v1 Announce Type: new Abstract: Robust density estimation refers to the consistent estimation of the density function even when the data is contaminated by outliers. We find that existing forest density estimation at a certain point is inherently resistant to the outliers outside the cells containing the point,…

  • Conformal Inference of Individual Treatment Effects Using Conditional Density Estimates

    Conformal Inference of Individual Treatment Effects Using Conditional Density Estimates arXiv:2501.14933v1 Announce Type: new Abstract: In an era where diverse and complex data are increasingly accessible, the precise prediction of individual treatment effects (ITE) becomes crucial across fields such as healthcare, economics, and public policy. Current state-of-the-art approaches, while providing valid prediction intervals through Conformal…

  • A Review on Self-Supervised Learning for Time Series Anomaly Detection: Recent Advances and Open Challenges

    A Review on Self-Supervised Learning for Time Series Anomaly Detection: Recent Advances and Open Challenges arXiv:2501.15196v1 Announce Type: new Abstract: Time series anomaly detection presents various challenges due to the sequential and dynamic nature of time-dependent data. Traditional unsupervised methods frequently encounter difficulties in generalization, often overfitting to known normal patterns observed during training and…

  • Building a Regression Model: Delivery Duration Prediction

    Building a Regression Model: Delivery Duration Prediction Building a Regression Model to Predict Delivery Durations: A Practical Guide E2E walkthrough for approaching a regression modeling task In this article, we’re going to walk through the process of building a regression model — from dataset cleaning & preparation, to model training & evaluation. The specific regression task we will…

  • Beyond Causal Language Modeling

    Beyond Causal Language Modeling A deep dive into “Not All Tokens Are What You Need for Pretraining” Introduction A few days ago, I had the chance to present at a local reading group that focused on some of the most exciting and insightful papers from NeurIPS 2024. As a presenter, I selected a paper titled…

  • Build a Decision Tree in Polars from Scratch

    Build a Decision Tree in Polars from Scratch Explore decision trees with polars backend Photo by Leonard Laub on Unsplash Decision tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn,…

  • How to Implement Guardrails for Your AI Agents with CrewAI

    How to Implement Guardrails for Your AI Agents with CrewAI LLM Agents are non-deterministic by nature: implement proper guardrails for your AI Application. Continue reading on Towards Data Science » Alessandro Romano Go to original source

  • Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus

    Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus Why descriptive statistics aren’t enough and plotting your data is always essential Continue reading on Towards Data Science » Maria Mouschoutzi, PhD Go to original source

  • Distributionally Robust Coreset Selection under Covariate Shift

    Distributionally Robust Coreset Selection under Covariate Shift arXiv:2501.14253v1 Announce Type: new Abstract: Coreset selection, which involves selecting a small subset from an existing training dataset, is an approach to reducing training data, and various approaches have been proposed for this method. In practical situations where these methods are employed, it is often the case that…

  • EFiGP: Eigen-Fourier Physics-Informed Gaussian Process for Inference of Dynamic Systems

    EFiGP: Eigen-Fourier Physics-Informed Gaussian Process for Inference of Dynamic Systems arXiv:2501.14107v1 Announce Type: new Abstract: Parameter estimation and trajectory reconstruction for data-driven dynamical systems governed by ordinary differential equations (ODEs) are essential tasks in fields such as biology, engineering, and physics. These inverse problems — estimating ODE parameters from observational data — are particularly challenging…

  • Statistical Verification of Linear Classifiers

    Statistical Verification of Linear Classifiers arXiv:2501.14430v1 Announce Type: new Abstract: We propose a homogeneity test closely related to the concept of linear separability between two samples. Using the test one can answer the question whether a linear classifier is merely “random” or effectively captures differences between two classes. We focus on establishing upper bounds for…

  • coverforest: Conformal Predictions with Random Forest in Python

    coverforest: Conformal Predictions with Random Forest in Python arXiv:2501.14570v1 Announce Type: new Abstract: Conformal prediction provides a framework for uncertainty quantification, specifically in the forms of prediction intervals and sets with distribution-free guaranteed coverage. While recent cross-conformal techniques such as CV+ and Jackknife+-after-bootstrap achieve better data efficiency than traditional split conformal methods, they incur substantial…

  • Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization

    Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization arXiv:2501.14635v1 Announce Type: new Abstract: The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the unregularized barycenter for discretized probability distributions on point clouds is a challenging task when…

  • Weekly Entering & Transitioning – Thread 27 Jan, 2025 – 03 Feb, 2025

    Weekly Entering & Transitioning – Thread 27 Jan, 2025 – 03 Feb, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Your Neural Network Can’t Explain This. TMLE to the Rescue!

    Your Neural Network Can’t Explain This. TMLE to the Rescue! Targeted Maximum Likelihood Estimation (TMLE) helps you explain patterns where other techniques fall short Continue reading on Towards Data Science » Ari Joury, PhD Go to original source

  • Optimising Budgets With Marketing Mix Models In Python

    Optimising Budgets With Marketing Mix Models In Python Part 3 of a hands-on guide to help you master MMM in pymc Photo by Towfiqu barbhuiya on Unsplash What is this series about? Welcome to part 3 of my series on marketing mix modelling (MMM), a hands-on guide to help you master MMM. Throughout this series, we’ll cover key…

  • [Official] 2024 End of Year Salary Sharing thread

    [Official] 2024 End of Year Salary Sharing thread This is the official thread for sharing your current salaries (or recent offers). See last year’s Salary Sharing thread here. There was also an unofficial one from an hour ago here. Please only post salaries/offers if you’re including hard numbers, but feel free to use a throwaway…

  • How Cheap Mortgages Transformed Poland’s Real Estate Market

    How Cheap Mortgages Transformed Poland’s Real Estate Market Insights from a synthetic control group Continue reading on Towards Data Science » Lukasz Szubelak Go to original source

  • Choosing Classification Model Evaluation Criteria

    Choosing Classification Model Evaluation Criteria Is Recall / Precision better than Sensitivity / Specificity? Continue reading on Towards Data Science » Viyaleta Apgar Go to original source

  • Understanding Emergent Capabilities in LLMs: Lessons from Biological Systems

    Understanding Emergent Capabilities in LLMs: Lessons from Biological Systems How natural systems fundamental laws help explain AI’s unexpected abilities Continue reading on Towards Data Science » Javier Marin Go to original source

  • Deep Learning for Click Prediction in Mobile AdTech

    Deep Learning for Click Prediction in Mobile AdTech Source: https://pixabay.com/illustrations/rays-stars-light-explosion-galaxy-9350519/ Machine Learning for Real-Time Bidding The past few years were a revolution for the mobile advertising and gaming industries, with the broad adoption of neural networks for advertising tasks, including click prediction. This migration occurred prior to the success of Large Language Models (LLMs) and…

  • Multi-Headed Cross Attention — By Hand

    Multi-Headed Cross Attention — By Hand Hand computing a fundamental component of multimodal models Continue reading on Towards Data Science » Daniel Warfield Go to original source

  • Does It Matter That Online Experiments Interact?

    Does It Matter That Online Experiments Interact? What interactions do, why they are just like any other change in the environment post-experiment, and some reassurance Photo by Uriel Soberanes on Unsplash Experiments do not run one at a time. At any moment, hundreds to thousands of experiments run on a mature website. The question comes up:…

  • Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 2

    Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 2 Using Unavailable Data at Prediction Time and Mixing Magic Numbers with Real Numbers Continue reading on Towards Data Science » Thomas A Dorfer Go to original source

  • Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data

    Robust Amortized Bayesian Inference with Self-Consistency Losses on Unlabeled Data arXiv:2501.13483v1 Announce Type: new Abstract: Neural amortized Bayesian inference (ABI) can solve probabilistic inverse problems orders of magnitude faster than classical methods. However, neural ABI is not yet sufficiently robust for widespread and safe applicability. In particular, when performing inference on observations outside of the…

  • LITE: Efficiently Estimating Gaussian Probability of Maximality

    LITE: Efficiently Estimating Gaussian Probability of Maximality arXiv:2501.13535v1 Announce Type: new Abstract: We consider the problem of computing the probability of maximality (PoM) of a Gaussian random vector, i.e., the probability for each dimension to be maximal. This is a key challenge in applications ranging from Bayesian optimization to reinforcement learning, where the PoM not…

  • Learning under Commission and Omission Event Outliers

    Learning under Commission and Omission Event Outliers arXiv:2501.13599v1 Announce Type: new Abstract: Event stream is an important data format in real life. The events are usually expected to follow some regular patterns over time. However, the patterns could be contaminated by unexpected absences or occurrences of events. In this paper, we adopt the temporal point…

  • Bayesian Model Parameter Learning in Linear Inverse Problems with Application in EEG Focal Source Imaging

    Bayesian Model Parameter Learning in Linear Inverse Problems with Application in EEG Focal Source Imaging arXiv:2501.13109v1 Announce Type: cross Abstract: Inverse problems can be described as limited-data problems in which the signal of interest cannot be observed directly. A physics-based forward model that relates the signal with the observations is typically needed. Unfortunately, unknown model…

  • A dimensionality reduction technique based on the Gromov-Wasserstein distance

    A dimensionality reduction technique based on the Gromov-Wasserstein distance arXiv:2501.13732v1 Announce Type: new Abstract: Analyzing relationships between objects is a pivotal problem within data science. In this context, Dimensionality reduction (DR) techniques are employed to generate smaller and more manageable data representations. This paper proposes a new method for dimensionality reduction, based on optimal transportation…

  • A Derivation and Application of Restricted Boltzmann Machines (2024 Nobel Prize)

    A Derivation and Application of Restricted Boltzmann Machines (2024 Nobel Prize) Investigating Geoffrey Hinton’s Nobel Prize-winning work and building it from scratch using PyTorch One recipient of the 2024 Nobel Prize in Physics was Geoffrey Hinton for his contributions in the field of AI and machine learning. A lot of people know he worked on neural…

  • Apollo and Design Choices of Video Large Multimodal Models (LMMs)

    Apollo and Design Choices of Video Large Multimodal Models (LMMs) Let’s Explore Major Design Choices from Meta’s Apollo Paper Continue reading on Towards Data Science » Matthew Gunton Go to original source

  • On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI?

    On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI? These 3 starter projects only take a weekend (and a few cups of coffee, maybe) Continue reading on Towards Data Science » Thuwarakesh Murallie Go to original source

  • The Basics you Must Master Before Diving into Marketing & Product Analytics

    The Basics you Must Master Before Diving into Marketing & Product Analytics Things that still confuse many Data Analysts Recently, I gave a presentation on a specific topic: how to investigate drop-offs in conversion funnels within the context of marketing and product analysis. What surprised me? The incredible engagement from the audience. The questions were varied…

  • The Solar Cycle(s): history, data analysis and trend forecasting.

    The Solar Cycle(s): history, data analysis and trend forecasting. The Solar Cycle(s): History, Data Analysis and Trend Forecasting A brief article on the Solar Cycles, the history behind their observation, data analysis and time series forecasting for the incoming solar maximum in 2025–2026 and the next decades You have probably heard about the 11-year Solar Cycle…

  • Ultralow-dimensionality reduction for identifying critical transitions by spatial-temporal PCA

    Ultralow-dimensionality reduction for identifying critical transitions by spatial-temporal PCA arXiv:2501.12582v1 Announce Type: new Abstract: Discovering dominant patterns and exploring dynamic behaviors especially critical state transitions and tipping points in high-dimensional time-series data are challenging tasks in study of real-world complex systems, which demand interpretable data representations to facilitate comprehension of both spatial and temporal information…

  • Sequential Change Point Detection via Denoising Score Matching

    Sequential Change Point Detection via Denoising Score Matching arXiv:2501.12667v1 Announce Type: new Abstract: Sequential change-point detection plays a critical role in numerous real-world applications, where timely identification of distributional shifts can greatly mitigate adverse outcomes. Classical methods commonly rely on parametric density assumptions of pre- and post-change distributions, limiting their effectiveness for high-dimensional, complex data…

  • Singular leaning coefficients and efficiency in learning theory

    Singular leaning coefficients and efficiency in learning theory arXiv:2501.12747v1 Announce Type: new Abstract: Singular learning models with non-positive Fisher information matrices include neural networks, reduced-rank regression, Boltzmann machines, normal mixture models, and others. These models have been widely used in the development of learning machines. However, theoretical analysis is still in its early stages. In…

  • On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration

    On Generalization and Distributional Update for Mimicking Observations with Adequate Exploration arXiv:2501.12785v1 Announce Type: new Abstract: This paper tackles the efficiency and stability issues in learning from observations (LfO). We commence by investigating how reward functions and policies generalize in LfO. Subsequently, the built-in reinforcement learning (RL) approach in generative adversarial imitation from observation (GAIfO)…

  • Fixed-Budget Change Point Identification in Piecewise Constant Bandits

    Fixed-Budget Change Point Identification in Piecewise Constant Bandits arXiv:2501.12957v1 Announce Type: new Abstract: We study the piecewise constant bandit problem where the expected reward is a piecewise constant function with one change point (discontinuity) across the action space $[0,1]$ and the learner’s aim is to locate the change point. Under the assumption of a fixed…

  • Harmonizing and Pooling Datasets for Health Research in R

    Harmonizing and Pooling Datasets for Health Research in R R code to extract data from unique datasets and combine them in one harmonized dataset ready for seamless analysis Continue reading on Towards Data Science » Rodrigo M Carrillo Larco, MD, PhD Go to original source

  • Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code

    Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code A comparison of two cutting-edge dynamic topic models solving consumer complaints classification exercise Continue reading on Towards Data Science » Petr Korab Go to original source

  • Optimize the dbt Doc Function with a CI

    Optimize the dbt Doc Function with a CI How to set an automated check to improve your dbt documentation Image by the author (generated with chatgpt) In large dbt projects, maintaining consistent and up-to-date documentation can be a challenge. Although dbt’s {{ doc() }} function allows you to store and reuse descriptions for the columns of…

  • How to Evaluate LLM Summarization

    How to Evaluate LLM Summarization A practical and effective guide for evaluating AI summaries Image from Unsplash Summarization is one of the most practical and convenient tasks enabled by LLMs. However, compared to other LLM tasks like question-asking or classification, evaluating LLMs on summarization is far more challenging. And so I myself have neglected evals for…

  • How to Utilize ModernBERT and Synthetic Data for Robust Text Classification

    How to Utilize ModernBERT and Synthetic Data for Robust Text Classification Learn how to fine-tune ModernBERT and create augmentations of text samples Continue reading on Towards Data Science » Eivind Kjosbakken Go to original source

  • Extension of Symmetrized Neural Network Operators with Fractional and Mixed Activation Functions

    Extension of Symmetrized Neural Network Operators with Fractional and Mixed Activation Functions arXiv:2501.10496v1 Announce Type: new Abstract: We propose a novel extension to symmetrized neural network operators by incorporating fractional and mixed activation functions. This study addresses the limitations of existing models in approximating higher-order smooth functions, particularly in complex and high-dimensional spaces. Our framework…

  • Simulation of Random LR Fuzzy Intervals

    Simulation of Random LR Fuzzy Intervals arXiv:2501.10482v1 Announce Type: new Abstract: Random fuzzy variables join the modeling of the impreciseness (due to their “fuzzy part”) and randomness. Statistical samples of such objects are widely used, and their direct, numerically effective generation is therefore necessary. Usually, these samples consist of triangular or trapezoidal fuzzy numbers. In…

  • Multi-Output Conformal Regression: A Unified Comparative Study with New Conformity Scores

    Multi-Output Conformal Regression: A Unified Comparative Study with New Conformity Scores arXiv:2501.10533v1 Announce Type: new Abstract: Quantifying uncertainty in multivariate regression is essential in many real-world applications, yet existing methods for constructing prediction regions often face limitations such as the inability to capture complex dependencies, lack of coverage guarantees, or high computational cost. Conformal prediction…

  • DPERC: Direct Parameter Estimation for Mixed Data

    DPERC: Direct Parameter Estimation for Mixed Data arXiv:2501.10540v1 Announce Type: new Abstract: The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this…

  • Model-Robust and Adaptive-Optimal Transfer Learning for Tackling Concept Shifts in Nonparametric Regression

    Model-Robust and Adaptive-Optimal Transfer Learning for Tackling Concept Shifts in Nonparametric Regression arXiv:2501.10870v1 Announce Type: new Abstract: When concept shifts and sample scarcity are present in the target domain of interest, nonparametric regression learners often struggle to generalize effectively. The technique of transfer learning remedies these issues by leveraging data or pre-trained models from similar…

  • Large Language Models: A Short Introduction

    Large Language Models: A Short Introduction And why you should care about LLMs Image by author. There’s an acronym you’ve probably heard non-stop for the past few years: LLM, which stands for Large Language Model. In this article we’re going to take a brief look at what LLMs are, why they’re an extremely exciting piece of technology, why…

  • Data-Driven Decision Making with Sentiment Analysis in R

    Data-Driven Decision Making with Sentiment Analysis in R Leveraging the Quanteda, Textstem and Sentimentr Packages to Extract Customer Insights and Enhance Business Strategy Continue reading on Towards Data Science » Devashree Madhugiri Go to original source

  • LyRec: A Song Recommender That Reads Between the Lyrics

    LyRec: A Song Recommender That Reads Between the Lyrics This is how I built an emotionally intelligent LLM-powered song recommendation system. Photo by David Pupăză on Unsplash Do you remember the last time you found yourself obsessing over a song? Maybe it was the raw emotion that resonated with you, or perhaps it was the lyrics…

  • Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

    Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT Mastering the art of fine-tuning: Learnings for training your own LLMs. (Image from Unsplash) This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs. It is practical in the…

  • Fighting Fraud Fairly: Upgrade Your AI Toolkit

    Fighting Fraud Fairly: Upgrade Your AI Toolkit A practical approach to address bias in AI systems Photo by the author As sophisticated AI systems are increasingly used in decision-making, ensuring fairness has become a priority, with a growing need to prevent algorithms from disproportionately affecting vulnerable groups in sensitive areas like the justice or educational system. One…

  • Advancing AI Reasoning: Meta-CoT and System 2 Thinking

    Advancing AI Reasoning: Meta-CoT and System 2 Thinking How Meta-CoT enhances system 2 reasoning for complex AI challenges Continue reading on Towards Data Science » Kaushik Rajan Go to original source

  • Modern Data And Application Engineering Breaks the Loss of Business Context

    Modern Data And Application Engineering Breaks the Loss of Business Context Here’s how your data retains its business relevance as it travels through your enterprise Continue reading on Towards Data Science » Bernd Wessely Go to original source

  • Why LLMs Suck at ASCII Art

    Why LLMs Suck at ASCII Art How being bad at art can be so dangerous Large Language Models have been doing a pretty good job of knocking down challenge after challenge in areas both expected and not. From writing poetry to generating entire websites from questionably… drawn images, these models seem almost unstoppable (and dire…

  • Building a Data Dashboard

    Building a Data Dashboard Using the streamlit Python library Continue reading on Towards Data Science » Thomas Reid Go to original source

  • Why Generative-AI Apps’ Quality Often Sucks and What to Do About It

    Why Generative-AI Apps’ Quality Often Sucks and What to Do About It How to get from PoCs to tested high-quality applications in production Image licensed from elements.envato.com, edit by Marcel Müller, 2025 The generative AI hype has rolled through the business world in the past two years. This technology can make business process executions more efficient,…

  • SBAMDT: Bayesian Additive Decision Trees with Adaptive Soft Semi-multivariate Split Rules

    SBAMDT: Bayesian Additive Decision Trees with Adaptive Soft Semi-multivariate Split Rules arXiv:2501.09900v1 Announce Type: new Abstract: Bayesian Additive Regression Trees [BART, Chipman et al., 2010] have gained significant popularity due to their remarkable predictive performance and ability to quantify uncertainty. However, standard decision tree models rely on recursive data splits at each decision node, using…

  • Tracking student skills real-time through a continuous-variable dynamic Bayesian network

    Tracking student skills real-time through a continuous-variable dynamic Bayesian network arXiv:2501.10050v1 Announce Type: new Abstract: The field of Knowledge Tracing is focused on predicting the success rate of a student for a given skill. Modern methods like Deep Knowledge Tracing provide accurate estimates given enough data, but being based on neural networks they struggle to…

  • Statistical Inference for Sequential Feature Selection after Domain Adaptation

    Statistical Inference for Sequential Feature Selection after Domain Adaptation arXiv:2501.09933v1 Announce Type: new Abstract: In high-dimensional regression, feature selection methods, such as sequential feature selection (SeqFS), are commonly used to identify relevant features. When data is limited, domain adaptation (DA) becomes crucial for transferring knowledge from a related source domain to a target domain, improving…

  • Contributions to the Decision Theoretic Foundations of Machine Learning and Robust Statistics under Weakly Structured Information

    Contributions to the Decision Theoretic Foundations of Machine Learning and Robust Statistics under Weakly Structured Information arXiv:2501.10195v1 Announce Type: new Abstract: This habilitation thesis is cumulative and, therefore, is collecting and connecting research that I (together with several co-authors) have conducted over the last few years. Thus, the absolute core of the work is formed…

  • Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach

    Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach arXiv:2501.10202v1 Announce Type: new Abstract: This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a…

  • Weekly Entering & Transitioning – Thread 20 Jan, 2025 – 27 Jan, 2025

    Weekly Entering & Transitioning – Thread 20 Jan, 2025 – 27 Jan, 2025 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Anyone ever feel like working as a data scientist at hinge?

    Anyone ever feel like working as a data scientist at hinge? Need to figure out what that damn algorithm is doing to keep me from getting matches lol. On a serious note I have read about some interesting algorithmic work at dating app companies. Any data scientists here ever worked for a dating app company?…

  • Influential Time-Series Forecasting Papers of 2023-2024: Part 1

    Influential Time-Series Forecasting Papers of 2023-2024: Part 1 This article explores some of the latest advancements in time-series forecasting. You can find the article here. Edit: If you know of any other interesting papers, please share them in the comments. submitted by /u/nkafr [link] [comments] /u/nkafr Go to original source

  • Should I Try to postpone my FAANG Interview?

    Should I Try to postpone my FAANG Interview? So I got contacted by a FAANG Recruiter for a Data Scientist Role I applied for a month and a half ago. But as I have started to prep, I realize I am not ready and need 1 to 2 months before I would be able to…

  • Where to Start when Data is Limited: A Guide

    Where to Start when Data is Limited: A Guide Hey, I’ve put together an article on my thoughts and some research around how to get the most out of small datasets when performance requirements mean conventional analysis isn’t enough. It’s aimed at helping people get started with new projects who have already started with the…

  • The Concepts Data Professionals Should Know in 2025: Part 1

    The Concepts Data Professionals Should Know in 2025: Part 1 From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT. Continue reading on Towards Data Science » Sarah Lea Go to original source

  • Zero-Shot Player Tracking in Tennis with Kalman Filtering

    Zero-Shot Player Tracking in Tennis with Kalman Filtering Automated tennis tracking without labels: GroundingDINO, Kalman filtering, and court homography https://medium.com/media/6f735abc63f905de122bb8a0679f97fd/href With the recent surge in sports tracking projects, many inspired by Skalski’s popular soccer tracking project, there’s been a notable shift towards using automated player tracking for sport hobbyists. Most of these approaches follow a…

  • How to Log Your Data with MLflow

    How to Log Your Data with MLflow MLflow, MLOps, Data Science Mastering data logging in MLOps for your AI workflow Photo by Chris Liverani on Unsplash Preface Data is one of the most critical components of the machine learning process. In fact, the quality of the data used in training a model often determines the success or failure…

  • How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW…

    How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW… Make the right choice for YOU Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

  • Showcasing Soaring Wildfire Counts With Streamlit and Python: A Powerful Approach

    Showcasing Soaring Wildfire Counts With Streamlit and Python: A Powerful Approach Analyzing historical wildfire trends in Canada with public data Continue reading on Towards Data Science » John Loewen, PhD Go to original source