Category: data-science

Data Science: From School to Work, Part III

Data Science: From School to Work, Part III Introduction Writing code is about solving problems, but not every problem is predictable. In the real world, your software will encounter unexpected situations: missing files, invalid user inputs, network timeouts, or even hardware failures. This is why handling errors isn’t just a nice-to-have; it’s a critical part…

March 28, 2025
Automate Supply Chain Analytics Workflows with AI Agents using n8n

Automate Supply Chain Analytics Workflows with AI Agents using n8n Why build things the hard way when you can design them the smart way? As a Supply Chain Data Scientist, I’ve explored various frameworks like LangChain and LangGraph to build AI agents using Python. Leveraging LLMs with LangChain for Supply Chain Analytics — A Control Tower Powered by…

March 27, 2025
Uncertainty Quantification in Machine Learning with an Easy Python Interface

Uncertainty Quantification in Machine Learning with an Easy Python Interface Uncertainty quantification (UQ) in a Machine Learning (ML) model allows one to estimate the precision of its predictions. This is extremely important for utilizing its predictions in real-world tasks. For instance, if a machine learning model is trained to predict a property of a material,…

March 27, 2025
The Ultimate AI/ML Roadmap For Beginners

The Ultimate AI/ML Roadmap For Beginners AI is transforming the way businesses operate, and nearly every company is exploring how to leverage this technology. As a result, the demand for AI and machine learning skills has skyrocketed in recent years. With nearly four years of experience in AI/ML, I’ve decided to create the ultimate guide…

March 26, 2025
Data-Driven March Madness Predictions

Data-Driven March Madness Predictions March Madness is infamously unpredictable, a perfect storm where favorites tumble and underdogs rise to do the impossible. Every March, 64 men’s and 64 women’s College Basketball teams battle for glory, while millions of fans, analysts, and betting markets scramble to predict the outcomes. But the odds of picking a perfect…

March 26, 2025
Evolving Product Operating Models in the Age of AI

Evolving Product Operating Models in the Age of AI In a previous article on organizing for AI (link), we looked at how the interplay between three key dimensions — ownership of outcomes, outsourcing of staff, and the geographical proximity of team members — can yield a variety of organizational archetypes for implementing strategic AI initiatives,…

March 22, 2025
No More Tableau Downtime: Metadata API for Proactive Data Health

No More Tableau Downtime: Metadata API for Proactive Data Health In today’s world, the reliability of data solutions is everything. When we build dashboards and reports, one expects that the numbers reflected there are correct and up-to-date. Based on these numbers, insights are drawn and actions are taken. For any unforeseen reason, if the dashboards are…

March 22, 2025
What Germany Currently Is Up To, Debt-Wise

What Germany Currently Is Up To, Debt-Wise €1,600 per second. That’s how much interest Germany has to pay for its debts. In total, the German state has debts ranging into the trillions — more than a thousand billion Euros. And the government is planning to make even more, up to one trillion additional debt is…

March 22, 2025
Google’s Data Science Agent: Can It Really Do Your Job?

Google’s Data Science Agent: Can It Really Do Your Job? On March 3rd, Google officially rolled out its Data Science Agent to most Colab users for free. This is not something brand new — it was first announced in December last year, but it is now integrated into Colab and made widely accessible. Google says…

March 22, 2025
Mastering the Poisson Distribution: Intuition and Foundations

Mastering the Poisson Distribution: Intuition and Foundations You’ve probably used the normal distribution one or two times too many. We all have — It’s a true workhorse. But sometimes, we run into problems. For instance, when predicting or forecasting values, simulating data given a particular data-generating process, or when we try to visualise model output…

March 21, 2025
Six Organizational Models for Data Science

Six Organizational Models for Data Science Introduction Data science teams can operate in myriad ways within a company. These organizational models influence the type of work that the team does, but also the team’s culture, goals, Impact, and overall value to the company. Adopting the wrong organizational model can limit impact, cause delays, and compromise…

March 21, 2025
The Impact of GenAI and Its Implications for Data Scientists

The Impact of GenAI and Its Implications for Data Scientists GenAI systems affect how we work. This general notion is well known. However, we are still unaware of the exact impact of GenAI. For example, how much do these tools affect our work? Do they have a larger impact on certain tasks? What does this…

March 15, 2025
Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and…

March 15, 2025
Forget About Cloud Computing. On-Premises Is All the Rage Again

Forget About Cloud Computing. On-Premises Is All the Rage Again Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins. The tides are turning though. As much…

March 15, 2025
Anatomy of a Parquet File

Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined…

March 14, 2025
Fourier Transform Applications in Literary Analysis

Fourier Transform Applications in Literary Analysis Poetry is often seen as a pure art form, ranging from the rigid structure of a haiku to the fluid, unconstrained nature of free-verse poetry. In analysing these works, though, to what extent can mathematics and Data Analysis be used to glean meaning from this free-flowing literature? Of course,…

March 14, 2025
Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing,…

March 14, 2025
7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow

7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow DBeaver is the most powerful open-source SQL IDE, but there are several features people don’t know about. In this post, I will share with you several features to speed up your workflow, with zero fluff. I’ve learned these as I’m currently digging deeper into…

March 12, 2025
How to Switch from Data Analyst to Data Scientist

How to Switch from Data Analyst to Data Scientist Are you a Data Analyst looking to break into data science? If so, this post is for you. Many people start in analytics because it generally has a lower barrier to entry, but as they gain experience, they realize they want to take on more technical…

March 12, 2025
Experiments Illustrated: Can $1 Change Behavior More Than $100?

Experiments Illustrated: Can $1 Change Behavior More Than $100? I currently lead a small data team at a small tech company. With everything small, we have a lot of autonomy over what, when, and how we run experiments. In this series, I’m opening the vault from our years of experimenting, each story highlighting a key…

March 12, 2025
Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become…

March 12, 2025
How to Develop Complex DAX Expressions

How to Develop Complex DAX Expressions At some point or another, any Power BI developer must write complex Dax expressions to analyze data. But nobody tells you how to do it. What’s the process for doing it? What is the best way to do it, and how supportive can a development process be? These are the questions…

March 12, 2025
Platform-Mesh, Hub and Spoke, and Centralised | 3 Types of data team

Platform-Mesh, Hub and Spoke, and Centralised | 3 Types of data team Introduction In the “ever rapidly changing landscape of Data and AI” (!), understanding data and AI architecture has never been more critical. However something many leaders overlook is the importance of data team structure. While many of you reading this probably identify as the data…

March 11, 2025
Linear Regression in Time Series: Sources of Spurious Regression

Linear Regression in Time Series: Sources of Spurious Regression 1. Introduction It’s pretty clear that most of our work will be automated by AI in the future. This will be possible because many researchers and professionals are working hard to make their work available online. These contributions not only help us understand fundamental concepts but…

March 11, 2025
Experiments Illustrated: How Random Assignment Saved Us $1M in Marketing Spend

Experiments Illustrated: How Random Assignment Saved Us $1M in Marketing Spend Running cool experiments is easily one of my favorite parts of working in data science. Most experiments don’t deliver big wins, so the winners make for fun stories. We’ve had a few of these at IntelyCare, and I’m sharing each story in a way…

March 11, 2025
Experiments Illustrated: How We Optimized Premium Listings on Our Nursing Job Board

Experiments Illustrated: How We Optimized Premium Listings on Our Nursing Job Board Running experiments is a task that often falls to data scientists. If that’s you, congrats! It can be a rewarding and high-impact area of work, but also requires tools found outside the typical ML-heavy data science curriculum. Even with the best tools, only…

March 11, 2025
When You Just Can’t Decide on a Single Action

When You Just Can’t Decide on a Single Action In Game Theory, the players typically have to make assumptions about the other players’ actions. What will the other player do? Will he use rock, paper or scissors? You never know, but in some cases, you might have an idea of the probability of some actions…

March 8, 2025
One-Tailed Vs. Two-Tailed Tests

One-Tailed Vs. Two-Tailed Tests Introduction If you’ve ever analyzed data using built-in t-test functions, such as those in R or SciPy, here’s a question for you: have you ever adjusted the default setting for the alternative hypothesis? If your answer is no—or if you’re not even sure what this means—then this blog post is for…

March 6, 2025
Kubernetes — Understanding and Utilizing Probes Effectively

Kubernetes — Understanding and Utilizing Probes Effectively Introduction Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits. Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container…

March 6, 2025
Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth

Mastering 1:1s as a Data Scientist: From Status Updates to Career Growth I have been a data team manager for six months, and my team has grown from three to five. I wrote about my initial manager experiences back in November. In this article, I want to talk about something that is more essential to…

March 5, 2025
Practical SQL Puzzles That Will Level Up Your Skill

Practical SQL Puzzles That Will Level Up Your Skill There are some Sql patterns that, once you know them, you start seeing them everywhere. The solutions to the puzzles that I will show you today are actually very simple SQL queries, but understanding the concept behind them will surely unlock new solutions to the queries…

March 5, 2025
Data Science: From School to Work, Part II

Data Science: From School to Work, Part II In my previous article, I highlighted the importance of effective project management in Python development. Now, let’s shift our focus to the code itself and explore how to write clean, maintainable code — an essential practice in professional and collaborative environments. Readability & Maintainability: Well-structured code is easier to…

March 4, 2025
I Won’t Change Unless You Do

I Won’t Change Unless You Do In Game Theory, how can players ever come to an end if there still might be a better option to decide for? Maybe one player still wants to change their decision. But if they do, maybe the other player wants to change too. How can they ever hope to…

March 1, 2025
Debugging the Dreaded NaN

Debugging the Dreaded NaN You are training your latest AI model, anxiously watching as the loss steadily decreases when suddenly — boom! Your logs are flooded with NaNs (Not a Number) — your model is irreparably corrupted and you’re left staring at your screen in despair. To make matters worse, the NaNs don’t appear consistently.…

February 28, 2025
The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines

The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines “You don’t have to be an expert to deceive someone, though you might need some expertise to reliably recognize when you are being deceived.” When my co-instructor and I start our quarterly lesson on deceptive visualizations for the data visualization course we teach at the University…

February 27, 2025
Is Python Set to Surpass Its Competitors?

Is Python Set to Surpass Its Competitors? A soufflé is a baked egg dish that originated in France in the 18th century. The process of making an elegant and delicious French soufflé is complex, and in the past, it was typically only prepared by professional French pastry chefs. However, with pre-made soufflé mixes now widely…

February 26, 2025
Efficient Data Handling in Python with Arrow

Efficient Data Handling in Python with Arrow 1. Introduction We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the…

February 26, 2025
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data

The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data What is synthetic data? Data created by a computer intended to replicate or augment existing data. Why is it useful? We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society…

February 22, 2025
Do European M&Ms Actually Taste Better than American M&Ms?

Do European M&Ms Actually Taste Better than American M&Ms? (Oh, I am the only one who’s been asking this question…? Hm. Well, if you have a minute, please enjoy this exploratory Data Analysis — featuring experimental design, statistics, and interactive visualization — applied a bit too earnestly to resolve an international debate.) 1. Introduction 1.1…

February 22, 2025
Talking about Games

Talking about Games Game theory is a field of research that is quite prominent in Economics but rather unpopular in other scientific disciplines. However, the concepts used in game theory can be of interest to a wider audience, including data scientists, statisticians, computer scientists or psychologists, to name just a few. This article is the…

February 22, 2025
Unraveling Spatially Variable Genes: A Statistical Perspective on Spatial Transcriptomics

Unraveling Spatially Variable Genes: A Statistical Perspective on Spatial Transcriptomics [ The article was written by Guanao Yan, Ph.D. student of Statistics and Data Science at UCLA. Guanao is the first author of the Nature Communications review article [1]. Spatially resolved transcriptomics (SRT) is revolutionizing Genomics by enabling the high-throughput measurement of gene expression while…

February 21, 2025
Don’t Let Conda Eat Your Hard Drive

Don’t Let Conda Eat Your Hard Drive If you’re an Anaconda user, you know that conda environments help you manage package dependencies, avoid compatibility conflicts, and share your projects with others. Unfortunately, they can also take over your computer’s hard drive. I write lots of computer tutorials and to keep them organized, each has a dedicated folder…

February 21, 2025
Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge “I train models, analyze data and create dashboards — why should I care about Containers?” Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on…

February 20, 2025
Advanced Time Intelligence in DAX with Performance in Mind

Advanced Time Intelligence in DAX with Performance in Mind We all know the usual Time Intelligence function based on years, quarters, months, and days. But sometimes, we need to perform more exotic timer intelligence calculations. But we should not forget to consider performance while programming the measures. Introduction There are many Dax functions in Power BI…

February 20, 2025
Honestly Uncertain

Honestly Uncertain Ethical issues aside, should you be honest when asked how certain you are about some belief? Of course, it depends. In this blog post, you’ll learn on what. Different ways of evaluating probabilistic predictions come with dramatically different degrees of “optimal honesty”. Perhaps surprisingly, the linear function that assigns +1 to true and fully…

February 19, 2025
The Future of Data: How Decision Intelligence is Revolutionizing Data

The Future of Data: How Decision Intelligence is Revolutionizing Data In the past few years, technology and AI have evolved more than ever. As I read about the new concepts in tech and learn new skills and techniques each day, I feel in a state of limbo — there is so much content to consume and yet,…

February 19, 2025
How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp)

How I Became A Machine Learning Engineer (No CS Degree, No Bootcamp) Machine learning and AI are among the most popular topics nowadays, especially within the tech space. I am fortunate enough to work and develop with these technologies every day as a machine learning engineer! In this article, I will walk you through my…

February 15, 2025
➡️ Start Asking Your Data ‘Why?’ — A Gentle Intro To Causality

➡️ Start Asking Your Data ‘Why?’ — A Gentle Intro To Causality Correlation does not imply causation. It turns out, however, that with some simple ingenious tricks one can, potentially, unveil causal relationships within standard observational data, without having to resort to expensive randomised control trials. This post is targeted towards anyone making data driven…

February 15, 2025
Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning

Roadmap to Becoming a Data Scientist, Part 4: Advanced Machine Learning Introduction Data science is undoubtedly one of the most fascinating fields today. Following significant breakthroughs in machine learning about a decade ago, data science has surged in popularity within the tech community. Each year, we witness increasingly powerful tools that once seemed unimaginable. Innovations such as the Transformer…

February 15, 2025
Publish Interactive Data Visualizations for Free with Python and Marimo

Publish Interactive Data Visualizations for Free with Python and Marimo Working in Data Science, it can be hard to share insights from complex datasets using only static figures. All the facets that describe the shape and meaning of interesting data are not always captured in a handful of pre-generated figures. While we have powerful technologies…

February 15, 2025
Building a Data Engineering Center of Excellence

Building a Data Engineering Center of Excellence As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice…

February 14, 2025
Learnings from a Machine Learning Engineer — Part 1: The Data

Learnings from a Machine Learning Engineer — Part 1: The Data It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you…

February 14, 2025
Method of Moments Estimation with Python Code

Method of Moments Estimation with Python Code Let’s say you are in a customer care center, and you would like to know the probability distribution of the number of calls per minute, or in other words, you want to answer the question: what is the probability of receiving zero, one, two, … etc., calls per…

February 13, 2025
Should Data Scientists Care About Quantum Computing?

Should Data Scientists Care About Quantum Computing? I am sure the quantum hype has reached every person in tech (and outside it, most probably). With some over-the-top claims, like “some company has proved quantum supremacy,” “the quantum revolution is here,” or my favorite, “quantum computers are here, and it will make classical computers obsolete.” I…

February 13, 2025
Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…

February 13, 2025
Build a Decision Tree in Polars from Scratch

Build a Decision Tree in Polars from Scratch Decision Tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn, Lightgbm, xgboost and catboost have done a very good job…

February 12, 2025
Virtualization & Containers for Data Science Newbies

Virtualization & Containers for Data Science Newbies Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak. Many cloud services rely on virtualization. But other technologies, such…

February 12, 2025
4-Dimensional Data Visualization: Time in Bubble Charts

4-Dimensional Data Visualization: Time in Bubble Charts Bubble Charts elegantly compress large amounts of information into a single visualization, with bubble size adding a third dimension. However, comparing “before” and “after” states is often crucial. To address this, we propose adding a transition between these states, creating an intuitive user experience. Since we couldn’t find…

February 12, 2025
Data vs. Business Strategy

Data vs. Business Strategy There seems to be a consensus that leveraging data, analytics, and AI to create a data-driven organization requires a clear strategic approach. However, there is less clarity and agreement on exactly what this strategic approach should look like in practice. This article provides a short overview of what strategy work I…

February 12, 2025
The Gamma Hurdle Distribution

The Gamma Hurdle Distribution Which Outcome Matters? Here is a common scenario : An A/B test was conducted, where a random sample of units (e.g. customers) were selected for a campaign and they received Treatment A. Another sample was selected to receive Treatment B. “A” could be a communication or offer and “B” could be…

February 8, 2025
Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them)

Triangle Forecasting: Why Traditional Impact Estimates Are Inflated (And How to Fix Them) Accurate impact estimations can make or break your business case. Yet, despite its importance, most teams use oversimplified calculations that can lead to inflated projections. These shot-in-the-dark numbers not only destroy credibility with stakeholders but can also result in misallocation of resources and…

February 8, 2025
Synthetic Data Generation with LLMs

Synthetic Data Generation with LLMs Popularity of RAG Over the past two years while working with financial firms, I’ve observed firsthand how they identify and prioritize Generative AI use cases, balancing complexity with potential value. Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation…

February 8, 2025
The Method of Moments Estimator for Gaussian Mixture Models

The Method of Moments Estimator for Gaussian Mixture Models Audio Processing is one of the most important application domains of digital signal processing (DSP) and machine learning. Modeling acoustic environments is an essential step in developing digital audio processing systems such as: speech recognition, speech enhancement, acoustic echo cancellation, etc. Acoustic environments are filled with background…

February 8, 2025
How to Create Network Graph Visualizations in Microsoft PowerBI

How to Create Network Graph Visualizations in Microsoft PowerBI Microsoft PowerBI is a one of the most popular Business Intelligence (BI) tools, and while it has all the features you need to create dynamic analytic reporting for stakeholders across the business, creating some advanced data visualizations is more challenging. This article will walk through how…

February 7, 2025
Introduction to Minimum Cost Flow Optimization in Python

Introduction to Minimum Cost Flow Optimization in Python Minimum cost flow optimization minimizes the cost of moving flow through a network of nodes and edges. Nodes include sources (supply) and sinks (demand), with different costs and capacity limits. The aim is to find the least costly way to move volume from sources to sinks while…

February 7, 2025
Myths vs. Data: Does an Apple a Day Keep the Doctor Away?

Myths vs. Data: Does an Apple a Day Keep the Doctor Away? Introduction “Money can’t buy happiness.” “You can’t judge a book by its cover.” “An apple a day keeps the doctor away.” You’ve probably heard these sayings several times, but do they actually hold up when we look at the data? In this article series,…

February 6, 2025
Neural Networks – Intuitively and Exhaustively Explained

Neural Networks – Intuitively and Exhaustively Explained An in-depth exploration of the most fundamental architecture in modern AI “The Thinking Part” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. Article originally made available on Intuitively and Exhaustively Explained. In this article we’ll form a thorough understanding of the neural network,…

February 4, 2025
How to Get Promoted as a Data Scientist

How to Get Promoted as a Data Scientist Image artificially generated using Grok 2. Introduction I have been working as a Data Scientist since 2017, and during that time I have been promoted from a junior/mid-level to a senior, and most recently to a Lead Data Scientist. There is a lot of content online regarding…

February 4, 2025
How to Find Seasonality Patterns in Time Series

How to Find Seasonality Patterns in Time Series Using Fourier Transforms to detect seasonal components In my professional life as a data scientist, I have encountered time series multiple times. Most of my knowledge comes from my academic experience, specifically my courses in Econometrics (I have a degree in Economics), where we studied statistical properties…

February 4, 2025
Awesome Plotly with code series (Part 9): To dot, to slope or to stack?

Awesome Plotly with code series (Part 9): To dot, to slope or to stack? Simple methods to replace cluttered bar charts with crisp, reader-friendly visuals. Continue reading on Towards Data Science » Jose Parreño Go to original source

February 3, 2025
5 Essential Tips Learned from My Data Science Journey

5 Essential Tips Learned from My Data Science Journey Personal reflections on my 10-year data odyssey Continue reading on Towards Data Science » Federico Rucci Go to original source

February 3, 2025
How to Make a Data Science Portfolio That Stands Out

How to Make a Data Science Portfolio That Stands Out Create a data science portfolio with Cloud-flare and HUGO Continue reading on Towards Data Science » Egor Howell Go to original source

February 3, 2025
Sparse AutoEncoder: from Superposition to interpretable features

Sparse AutoEncoder: from Superposition to interpretable features Disentangle features in complex Neural Network with superpositions Complex neural networks, such as Large Language Models (LLMs), suffer quite often from interpretability challenges. One of the most important reasons for such difficulty is superposition — a phenomenon of the neural network having fewer dimensions than the number of features it…

February 2, 2025
Are Data Scientists at Risk in 2025?

Are Data Scientists at Risk in 2025? The impact of AI on data science jobs. Continue reading on Towards Data Science » Natassha Selvaraj Go to original source

February 2, 2025
DeepSeek V3: A New Contender in AI-Powered Data Science

DeepSeek V3: A New Contender in AI-Powered Data Science How DeepSeek’s budget-friendly AI model stacks up against ChatGPT, Claude, and Gemini in SQL, EDA, and machine learning Continue reading on Towards Data Science » Yu Dong Go to original source

February 2, 2025
How Likely Is a Six Nations Grand Slam in 2025?

How Likely Is a Six Nations Grand Slam in 2025? Quantifying uncertainty in sports fixtures Photo by Thomas Serer on Unsplash Introduction For rugby fans the long wait is nearly over, like Christmas the Six Nations comes once a year to lift our spirits in the cold winter months. If you’re not very familiar with rugby, the…

February 1, 2025
2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy

2-Bit VPTQ: 6.5x Smaller LLMs While Preserving 95% Accuracy Very accurate 2-bit quantization for running 70B LLMs on a 24 GB GPU Continue reading on Towards Data Science » Benjamin Marie Go to original source

February 1, 2025
Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data How much data does AI really need? TLDR: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST¹ to classify handwritten digits. Best runs for “furthest-from-centroid” selection compared to full dataset. Image by author. What if I told you…

January 31, 2025
Actually, Being a Data Scientist is Awesome

Actually, Being a Data Scientist is Awesome Don’t let the doom and gloom get to you Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

January 31, 2025
Navigating Data Science Content: Recognizing Common Pitfalls, Part 1

Navigating Data Science Content: Recognizing Common Pitfalls, Part 1 Uncovering and correcting misconceptions in online data science content to help you learn more effectively Continue reading on Towards Data Science » Geremie Yeo Go to original source

January 31, 2025
Great Books for AI Engineering

Great Books for AI Engineering 10 books with valuable insights about AI science and engineering Great books for AI Engineering — Plus ‘Brave New Words’ (Image is Author’s own work) A few years ago I recommended 21 books in Great Books for Data Science and Great Books for Data Science 2. Since then a lot has changed. While…

January 30, 2025
NLP Illustrated, Part 3: Word2Vec

NLP Illustrated, Part 3: Word2Vec An exhaustive and illustrated guide to Word2Vec with code! Continue reading on Towards Data Science » Shreya Rao Go to original source

January 30, 2025
The Challenges and Realities of Being a Data Scientist

The Challenges and Realities of Being a Data Scientist Some harsh truths behind the field of data science Continue reading on Towards Data Science » Egor Howell Go to original source

January 30, 2025
Machine Learning Incidents in AdTech

Machine Learning Incidents in AdTech Source: https://unsplash.com/photos/a-couple-of-signs-that-are-on-a-fence-xXbQIrWH2_A Challenges with deep learning in production One of the biggest challenges I encountered in my career as a data scientist was migrating the core algorithms in a mobile AdTech platform from classic machine learning models to deep learning. I worked on a Demand Side Platform (DSP) for user…

January 30, 2025
Basics of Probability Notations

Basics of Probability Notations Union, Intersection, Independence, Disjoint, Complement: Advanced Probability for Data Science Series (1) Continue reading on Towards Data Science » Sunghyun Ahn Go to original source

January 29, 2025
How GenAI Tools Have Changed My Work as a Data Scientist

How GenAI Tools Have Changed My Work as a Data Scientist An overview of the 4 use cases and 6 GenAI tools I use Continue reading on Towards Data Science » Jonte Dancker Go to original source

January 29, 2025
Who is Right? The Dean or the Students?

Who is Right? The Dean or the Students? A cautionary tale on two perspectives on averaging Continue reading on Towards Data Science » Paolo Molignini, PhD Go to original source

January 29, 2025
Build a Decision Tree in Polars from Scratch

Build a Decision Tree in Polars from Scratch Explore decision trees with polars backend Photo by Leonard Laub on Unsplash Decision tree algorithms have always fascinated me. They are easy to implement and achieve good results on various classification and regression tasks. Combined with boosting, decision trees are still state-of-the-art in many applications. Frameworks such as sklearn,…

January 28, 2025
Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus

Water Cooler Small Talk, Ep 7: Anscombe’s Quartet and the Datasaurus Why descriptive statistics aren’t enough and plotting your data is always essential Continue reading on Towards Data Science » Maria Mouschoutzi, PhD Go to original source

January 28, 2025
Your Neural Network Can’t Explain This. TMLE to the Rescue!

Your Neural Network Can’t Explain This. TMLE to the Rescue! Targeted Maximum Likelihood Estimation (TMLE) helps you explain patterns where other techniques fall short Continue reading on Towards Data Science » Ari Joury, PhD Go to original source

January 27, 2025
Optimising Budgets With Marketing Mix Models In Python

Optimising Budgets With Marketing Mix Models In Python Part 3 of a hands-on guide to help you master MMM in pymc Photo by Towfiqu barbhuiya on Unsplash What is this series about? Welcome to part 3 of my series on marketing mix modelling (MMM), a hands-on guide to help you master MMM. Throughout this series, we’ll cover key…

January 27, 2025
How Cheap Mortgages Transformed Poland’s Real Estate Market

How Cheap Mortgages Transformed Poland’s Real Estate Market Insights from a synthetic control group Continue reading on Towards Data Science » Lukasz Szubelak Go to original source

January 26, 2025
Deep Learning for Click Prediction in Mobile AdTech

Deep Learning for Click Prediction in Mobile AdTech Source: https://pixabay.com/illustrations/rays-stars-light-explosion-galaxy-9350519/ Machine Learning for Real-Time Bidding The past few years were a revolution for the mobile advertising and gaming industries, with the broad adoption of neural networks for advertising tasks, including click prediction. This migration occurred prior to the success of Large Language Models (LLMs) and…

January 25, 2025
Multi-Headed Cross Attention — By Hand

Multi-Headed Cross Attention — By Hand Hand computing a fundamental component of multimodal models Continue reading on Towards Data Science » Daniel Warfield Go to original source

January 25, 2025
Does It Matter That Online Experiments Interact?

Does It Matter That Online Experiments Interact? What interactions do, why they are just like any other change in the environment post-experiment, and some reassurance Photo by Uriel Soberanes on Unsplash Experiments do not run one at a time. At any moment, hundreds to thousands of experiments run on a mature website. The question comes up:…

January 25, 2025
Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 2

Avoid These Easily Missed Mistakes in Machine Learning Workflows — Part 2 Using Unavailable Data at Prediction Time and Mixing Magic Numbers with Real Numbers Continue reading on Towards Data Science » Thomas A Dorfer Go to original source

January 25, 2025
A Derivation and Application of Restricted Boltzmann Machines (2024 Nobel Prize)

A Derivation and Application of Restricted Boltzmann Machines (2024 Nobel Prize) Investigating Geoffrey Hinton’s Nobel Prize-winning work and building it from scratch using PyTorch One recipient of the 2024 Nobel Prize in Physics was Geoffrey Hinton for his contributions in the field of AI and machine learning. A lot of people know he worked on neural…

January 24, 2025
On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI?

On a Time Crunch but Still Want to Learn to Develop Multi-Agent AI? These 3 starter projects only take a weekend (and a few cups of coffee, maybe) Continue reading on Towards Data Science » Thuwarakesh Murallie Go to original source

January 24, 2025
The Solar Cycle(s): history, data analysis and trend forecasting.

The Solar Cycle(s): history, data analysis and trend forecasting. The Solar Cycle(s): History, Data Analysis and Trend Forecasting A brief article on the Solar Cycles, the history behind their observation, data analysis and time series forecasting for the incoming solar maximum in 2025–2026 and the next decades You have probably heard about the 11-year Solar Cycle…

January 24, 2025
Harmonizing and Pooling Datasets for Health Research in R

Harmonizing and Pooling Datasets for Health Research in R R code to extract data from unique datasets and combine them in one harmonized dataset ready for seamless analysis Continue reading on Towards Data Science » Rodrigo M Carrillo Larco, MD, PhD Go to original source

January 23, 2025