Category: aimldsaimlds

  • Contrastive representations of high-dimensional, structured treatments

    Contrastive representations of high-dimensional, structured treatments arXiv:2411.19245v1 Announce Type: new Abstract: Estimating causal effects is vital for decision making. In standard causal effect estimation, treatments are usually binary- or continuous-valued. However, in many important real-world settings, treatments can be structured, high-dimensional objects, such as text, video, or audio. This provides a challenge to traditional causal…

  • Weekly Entering & Transitioning – Thread 02 Dec, 2024 – 09 Dec, 2024

    Weekly Entering & Transitioning – Thread 02 Dec, 2024 – 09 Dec, 2024 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • F5-TTS is highly underrated for Audio Cloning !

    F5-TTS is highly underrated for Audio Cloning ! submitted by /u/mehul_gupta1997 [link] [comments] /u/mehul_gupta1997 Go to original source

  • Daily averaged time series comparison -Linking plankton and aerosols emissions?

    Daily averaged time series comparison -Linking plankton and aerosols emissions? Hi everyone, so we have this dataset of daily averaged pytoplankton time series over a full year; coccolithophores, chlorophytes, cyanobacteria, diatoms, dinoflagellates, phaecocystis, zooplankton. Then we have atmospheric measurements on the same time intervals of a few aerosols species; Methanesulphonic acid, carboxylic acids, aliphatics, sulphates,…

  • Need help gathering data

    Need help gathering data Hello! I’m currently analysing data from politicians across the world and I would like to know if there’s a database with data like years in charge, studies they had, age, gender and some other relevant topics. Please, if you had any links I’ll be glad to check them all. *Need help,…

  • Feature creation out of two features.

    Feature creation out of two features. I have been working on a project that tried to identify interactions in variables. What is a good way to capture these interactions by creating features? What are good mathematical expressions to capture interaction beyond multiplication and division? Do note i have nulls and i cannot change it. submitted…

  • Smaller is smarter

    Smaller is smarter Concerns about the environmental impacts of Large Language Models (LLMs) are growing. Although detailed information about the actual costs of LLMs can be difficult to find, let’s attempt to gather some facts to understand the scale. Generated with ChatGPT-4o Since comprehensive data on ChatGPT-4 is not readily available, we can consider Llama 3.1…

  • Why “Statistical Significance” Is Pointless

    Why “Statistical Significance” Is Pointless Here’s a better framework for data-driven decision-making Continue reading on Towards Data Science » Samuele Mazzanti Go to original source

  • The Lead, Shadow, and Sparring Roles in New Data Settings

    The Lead, Shadow, and Sparring Roles in New Data Settings From data engineer to domain expert—what it takes to build a new data platform Continue reading on Towards Data Science » Marina Tosic Go to original source

  • How to Solve a Simple Problem With Machine Learning

    How to Solve a Simple Problem With Machine Learning A technical walkthrough of lesson one Continue reading on Towards Data Science » Oscar Leo Go to original source

  • When Not to Use the Streamlit AgGrid Component

    When Not to Use the Streamlit AgGrid Component Streamlit-AgGrid is amazing. But there are 2 scenarios where its use is not recommended. Continue reading on Towards Data Science » Jose Parreño Go to original source

  • Making News Recommendations Explainable with Large Language Models

    Making News Recommendations Explainable with Large Language Models A prompt-based experiment to improve both accuracy and transparent reasoning in content personalization. Deliver relevant content to readers at the right time. Image by author. At DER SPIEGEL, we are continually exploring ways to improve how we recommend news articles to our readers. In our latest (offline) experiment,…

  • Grokking Behavioral Interviews

    Grokking Behavioral Interviews Master the art of behavioral interviews and land your dream job Continue reading on Towards Data Science » Mina Ghashami Go to original source

  • Model Validation Techniques, Explained: A Visual Guide with Code Examples

    Model Validation Techniques, Explained: A Visual Guide with Code Examples MODEL EVALUATION & OPTIMIZATION 12 must-know methods to validate your machine learning Every day, machines make millions of predictions — from detecting objects in photos to helping doctors find diseases. But before trusting these predictions, we need to know if they’re any good. After all, no one would…

  • Dunder Methods: The Hidden Gems of Python

    Dunder Methods: The Hidden Gems of Python Real-world examples on how actively using special methods can simplify coding and improve readability. Dunder methods, though possibly a basic topic in Python, are something I have often noticed being understood only superficially, even by people who have been coding for quite some time. Disclaimer: This is a forgivable…

  • How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs?

    How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs? Delve into an end-to-end Machine Learning project to improve the quality of the Open Food Facts database Image generated with Flux1 Open Food Facts’ purpose is to create the largest open-source food database in the world. To this day, it has collected over 3 millions products…

  • Effortless Data Handling: Find Variables Across Multiple Data Files with R

    Effortless Data Handling: Find Variables Across Multiple Data Files with R A practical solution with code and workflow Lost in a maze of datasets and endless data dictionaries? Say goodbye to tedious variable hunting! Discover how to quickly identify and extract the variables you need from multiple SAS files using two simple R functions. Streamline your…

  • Why Internal Company Chatbots Fail and How to Use Generative AI in Enterprise with Impact

    Why Internal Company Chatbots Fail and How to Use Generative AI in Enterprise with Impact Start with the problem and not with the solution Background licensed from elements.envato.com, edit by Marcel Müller 2024 The most common disillusion that many organizations have is the following: They get excited about generative AI with ChatGPT or Microsoft Co-Pilot, read some…

  • Think you Know Excel? Take Your Analytics Skills to the Next Level with Power Query!

    Think you Know Excel? Take Your Analytics Skills to the Next Level with Power Query! 5 practical use cases that prove Power Query is worth exploring. I have a confession to make: I’ve been living under a rock 🪨. Not literally, but how else can I explain not discovering Power Query in Excel until now? Imagine…

  • Water Cooler Small Talk: Simpson’s Paradox

    Water Cooler Small Talk: Simpson’s Paradox Is your data tricking you? What can you do about it? Continue reading on Towards Data Science » Maria Mouschoutzi, PhD Go to original source

  • The Intuition behind Concordance Index — Survival Analysis

    The Intuition behind Concordance Index — Survival Analysis The Intuition behind Concordance Index — Survival Analysis Ranking accuracy versus absolute accuracy Taken by the author and her Border Collie. “Be thankful for what you have. Be fearless for what you want” How long would you keep your Gym membership before you decide to cancel it? or Netflix if you are a series…

  • Complete MLOPS Cycle for a Computer Vision Project

    Complete MLOPS Cycle for a Computer Vision Project These days, we encounter (and maybe produce on our own) many computer vision projects, where AI is the hottest topic for new technologies… Continue reading on Towards Data Science » Yağmur Çiğdem Aktaş Go to original source

  • A quick guide to Network Science

    A quick guide to Network Science For those who would like to learn about complex connections — from theory to practice in Python Continue reading on Towards Data Science » Milan Janosov Go to original source

  • The Most Expensive Data Science Mistake I’ve Witnessed in My Career

    The Most Expensive Data Science Mistake I’ve Witnessed in My Career Why true success in machine learning goes beyond optimizing a single metric Continue reading on Towards Data Science » Claudia Ng Go to original source

  • Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling

    Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling Data Science Simple Random Sampling (SRS) works, but if you do not know Probability Proportional to Size Sampling (PPS), you are risking yourself some critical statistical mistakes. Learn why, when, and how you can use PPS Sampling here! Photo by Justin Morgan on Unsplash…

  • On the ERM Principle in Meta-Learning

    On the ERM Principle in Meta-Learning arXiv:2411.17898v1 Announce Type: new Abstract: Classic supervised learning involves algorithms trained on $n$ labeled examples to produce a hypothesis $h in mathcal{H}$ aimed at performing well on unseen examples. Meta-learning extends this by training across $n$ tasks, with $m$ examples per task, producing a hypothesis class $mathcal{H}$ within some…

  • A Flexible Defense Against the Winner’s Curse

    A Flexible Defense Against the Winner’s Curse arXiv:2411.18569v1 Announce Type: new Abstract: Across science and policy, decision-makers often need to draw conclusions about the best candidate among competing alternatives. For instance, researchers may seek to infer the effectiveness of the most successful treatment or determine which demographic group benefits most from a specific treatment. Similarly,…

  • Isometry pursuit

    Isometry pursuit arXiv:2411.18502v1 Announce Type: new Abstract: Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying…

  • Functional relevance based on the continuous Shapley value

    Functional relevance based on the continuous Shapley value arXiv:2411.18575v1 Announce Type: new Abstract: The presence of Artificial Intelligence (AI) in our society is increasing, which brings with it the need to understand the behaviour of AI mechanisms, including machine learning predictive algorithms fed with tabular data, text, or images, among other types of data. This…

  • When Is Heterogeneity Actionable for Personalization?

    When Is Heterogeneity Actionable for Personalization? arXiv:2411.16552v1 Announce Type: cross Abstract: Targeting and personalization policies can be used to improve outcomes beyond the uniform policy that assigns the best performing treatment in an A/B test to everyone. Personalization relies on the presence of heterogeneity of treatment effects, yet, as we show in this paper, heterogeneity…

  • A Story of Long Tails: Why Uncertainty in Marketing Mix Modelling is Important

    A Story of Long Tails: Why Uncertainty in Marketing Mix Modelling is Important “Details matter. It’s worth waiting to get it right.” — Steve Jobs Continue reading on Towards Data Science » Javier Marin Go to original source

  • How to Transition from Engineering to Data Science

    How to Transition from Engineering to Data Science AI for engineers: experience of an engineering graduate Continue reading on Towards Data Science » Dan Pietrow Go to original source

  • How to Prune LLaMA 3.2 and Similar Large Language Models

    How to Prune LLaMA 3.2 and Similar Large Language Models This article explores a structured pruning technique for state-of-the-art models, that uses a GLU architecture, enabling the creation of… Continue reading on Towards Data Science » Pere Martra Go to original source

  • Level Up Your Coding Skills with Python Threading

    Level Up Your Coding Skills with Python Threading Learn how to use queues, daemon threads, and events in a Machine Learning project Continue reading on Towards Data Science » Marcello Politi Go to original source

  • How to Develop an Effective AI-Powered Legal Assistant

    How to Develop an Effective AI-Powered Legal Assistant Create a machine-learning-based search into legal decisions Continue reading on Towards Data Science » Eivind Kjosbakken Go to original source

  • 170  |  Formalizing Design with Gabrielle Mérite and Alan Wilson

    170  |  Formalizing Design with Gabrielle Mérite and Alan Wilson Data design systems and styleguides are currently a huge trend in the data design world. Moritz is joined by Gabrielle Mérite and Alan Wilson and together we exchange experiences in this emerging space, from designing dataviz components as part of Adobe Spectrum, the styleguide for Deloitte’s Insights…

  • 169  |  Data Conversations with Vidya Setlur

    169  |  Data Conversations with Vidya Setlur We have Vidya Setlur on the show to talk about the role language, and natural language processing (NLP) play in data visualization and analytics. Vidya is the director of research at Tableau and has a background in natural language processing and visualization. She is one of the main drivers behind…

  • 168  |  Highlights from IEEE VIS’22 with Tamara Munzner

    168  |  Highlights from IEEE VIS’22 with Tamara Munzner Finally, this year we managed to record another classic episode from the IEEE VIS Conference (we recorded a total of 10 with this one!) We have Data Stories’ friend Prof. Tamara Munzner with us to talk about the conference and to highlight a few things she picked from…

  • 167  |  Visualization and Statistics with Andrew Gelman and Jessica Hullman

    167  |  Visualization and Statistics with Andrew Gelman and Jessica Hullman In this new episode, we talk about the interplay between statistics and data visualization. We do that with Andrew Gelman, Professor of Statistics and Political Science at Columbia University, and Jessica Hullman, Professor of Computer Science at Northwestern University. Andrew started the popular blog “Statistical Modeling,…

  • 166  |  Catching up with Amanda Makulec

    166  |  Catching up with Amanda Makulec Hey all, we are back! In this episode, we have Amanda Makulec to catch up on what happened during this whole period of time. Amanda is a public health and data visualization expert and she is the Executive Director of the Data Visualization Society. In the episode, we talk about…

  • What’s going on everybody?

    What’s going on everybody? sentdex Go to original source

  • Visualizing Neural Network Internals

    Visualizing Neural Network Internals sentdex Go to original source

  • Building an LLM fine-tuning Dataset

    Building an LLM fine-tuning Dataset sentdex Go to original source

  • Getting Back on Grid

    Getting Back on Grid sentdex Go to original source

  • Open Source AI Inference API w/ Together

    Open Source AI Inference API w/ Together sentdex Go to original source

  • Conformalised Conditional Normalising Flows for Joint Prediction Regions in time series

    Conformalised Conditional Normalising Flows for Joint Prediction Regions in time series arXiv:2411.17042v1 Announce Type: new Abstract: Conformal Prediction offers a powerful framework for quantifying uncertainty in machine learning models, enabling the construction of prediction sets with finite-sample validity guarantees. While easily adaptable to non-probabilistic models, applying conformal prediction to probabilistic generative models, such as Normalising…

  • Fast, Precise Thompson Sampling for Bayesian Optimization

    Fast, Precise Thompson Sampling for Bayesian Optimization arXiv:2411.17071v1 Announce Type: new Abstract: Thompson sampling (TS) has optimal regret and excellent empirical performance in multi-armed bandit problems. Yet, in Bayesian optimization, TS underperforms popular acquisition functions (e.g., EI, UCB). TS samples arms according to the probability that they are optimal. A recent algorithm, P-Star Sampler (PSS),…

  • Spatio-Temporal Conformal Prediction for Power Outage Data

    Spatio-Temporal Conformal Prediction for Power Outage Data arXiv:2411.17099v1 Announce Type: new Abstract: In recent years, increasingly unpredictable and severe global weather patterns have frequently caused long-lasting power outages. Building resilience, the ability to withstand, adapt to, and recover from major disruptions, has become crucial for the power industry. To enable rapid recovery, accurately predicting future…

  • Training a neural netwok for data reduction and better generalization

    Training a neural netwok for data reduction and better generalization arXiv:2411.17180v1 Announce Type: new Abstract: The motivation for sparse learners is to compress the inputs (features) by selecting only the ones needed for good generalization. Linear models with LASSO-type regularization achieve this by setting the weights of irrelevant features to zero, effectively identifying and ignoring…

  • A Generalized Unified Skew-Normal Process with Neural Bayes Inference

    A Generalized Unified Skew-Normal Process with Neural Bayes Inference arXiv:2411.17400v1 Announce Type: new Abstract: In recent decades, statisticians have been increasingly encountering spatial data that exhibit non-Gaussian behaviors such as asymmetry and heavy-tailedness. As a result, the assumptions of symmetry and fixed tail weight in Gaussian processes have become restrictive and may fail to capture…

  • Weekly Entering & Transitioning – Thread 25 Nov, 2024 – 02 Dec, 2024

    Weekly Entering & Transitioning – Thread 25 Nov, 2024 – 02 Dec, 2024 Welcome to this week’s entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: Learning resources (e.g. books, tutorials, videos) Traditional education (e.g. schools, degrees, electives) Alternative education (e.g.…

  • Just spent the afternoon chatting with ChatGPT about a work problem. Now I am a convert.

    Just spent the afternoon chatting with ChatGPT about a work problem. Now I am a convert. I have to build an optimization algorithm on a domain I have not worked in before (price sensitivity based, revenue optimization) Well, instead of googling around, I asked ChatGPT which we do have available at work. And it was…

  • Should I try to become a Data scientist or AI engineer

    Should I try to become a Data scientist or AI engineer Background: I’m a 25M with 2.5 years experience as an analyst. (Soon enrolling in a masters program in CS) There are a few careers possibilities for me, but I’m confused as to whether I should try to become a general data scientist or ai…

  • Have you ever presented an analysis or shipped a model just because someone demand it, even when you knew it was wrong, just to save your ass?

    Have you ever presented an analysis or shipped a model just because someone demand it, even when you knew it was wrong, just to save your ass? This has been quite common in my career. Execs demand a model X, we barely have good data to create nor the model turns out good, but telling…

  • I Wrote a Guide to Simulation in Python with SimPy

    I Wrote a Guide to Simulation in Python with SimPy Hi folks, I wrote a guide on discrete-event simulation with SimPy, designed to help you learn how to build simulations using Python. Kind of like the official documentation but on steroids. I have used SimPy personally in my own career for over a decade, it…

  • Neuromorphic Computing — an Edgier, Greener AI

    Neuromorphic Computing — an Edgier, Greener AI Neuromorphic Computing — an Edgier, Greener AI Why computer hardware and AI algorithms are being reinvented using inspiration from the brain euromorphic Computing might not just help bring AI to the edge, but also reduce carbon emissions at data centers. Generated by author with ImageGen 3. There are periodic proclamations of the coming neuromorphic computing…

  • NLP Illustrated, Part 2: Word Embeddings

    NLP Illustrated, Part 2: Word Embeddings An illustrated and intuitive guide to word embeddings Continue reading on Towards Data Science » Shreya Rao Go to original source

  • Addressing Missing Data

    Addressing Missing Data Understand missing data patterns (MCAR, MNAR, MAR) for better model performance with Missingno Continue reading on Towards Data Science » Gizem Kaya Go to original source

  • Optimizing Transformer Models for Variable-Length Input Sequences

    Optimizing Transformer Models for Variable-Length Input Sequences How PyTorch NestedTensors, FlashAttention2, and xFormers can Boost Performance and Reduce AI Costs Photo by Tanja Zöllner on Unsplash As generative AI (genAI) models grow in both popularity and scale, so do the computational demands and costs associated with their training and deployment. Optimizing these models is crucial for enhancing…

  • Mistral 7B Explained: Towards More Efficient Language Models

    Mistral 7B Explained: Towards More Efficient Language Models RMS Norm, RoPE, GQA, SWA, KV Cache, and more! Part 5 in the “LLMs from Scratch” series — a complete guide to understanding and building Large Language Models. If you are interested in learning more about how these models work I encourage you to read: Part 1: Tokenization — A Complete Guide Part 2:…