Tag: data

The Data Team’s Survival Guide for the Next Era of Data

The Data Team’s Survival Guide for the Next Era of Data 6 pillars to declutter your stack, escape the service trap, and build the missing foundations for the new primary data consumer: the AI agent. The post The Data Team’s Survival Guide for the Next Era of Data appeared first on Towards Data Science. Mahdi…

March 7, 2026
The Gap Between Junior and Senior Data Scientists Isn’t Code

The Gap Between Junior and Senior Data Scientists Isn’t Code Why my obsession with complex algorithms was actually holding my career back. The post The Gap Between Junior and Senior Data Scientists Isn’t Code appeared first on Towards Data Science. Benjamin Nweke Go to original source

February 28, 2026
Unsupervised Continual Learning for Amortized Bayesian Inference

Unsupervised Continual Learning for Amortized Bayesian Inference arXiv:2602.22884v1 Announce Type: new Abstract: Amortized Bayesian Inference (ABI) enables efficient posterior estimation using generative neural networks trained on simulated data, but often suffers from performance degradation under model misspecification. While self-consistency (SC) training on unlabeled empirical data can enhance network robustness, current approaches are limited to static,…

February 27, 2026
Designing Data and AI Systems That Hold Up in Production

Designing Data and AI Systems That Hold Up in Production A system-level perspective on architecture, agents, and responsible scale The post Designing Data and AI Systems That Hold Up in Production appeared first on Towards Data Science. TDS Editors Go to original source

February 27, 2026
Amortized Bayesian inference for actigraph time sheet data from mobile devices

Amortized Bayesian inference for actigraph time sheet data from mobile devices arXiv:2602.20611v1 Announce Type: new Abstract: Mobile data technologies use “actigraphs” to furnish information on health variables as a function of a subject’s movement. The advent of wearable devices and related technologies has propelled the creation of health databases consisting of human movement data to…

February 25, 2026
Is the AI and Data Job Market Dead?

Is the AI and Data Job Market Dead? What you should be doing in the current job market The post Is the AI and Data Job Market Dead? appeared first on Towards Data Science. Egor Howell Go to original source

February 24, 2026
AI in Multiple GPUs: Gradient Accumulation & Data Parallelism

AI in Multiple GPUs: Gradient Accumulation & Data Parallelism Learn and implement gradient accum and data parallelism from scratch in PyTorch The post AI in Multiple GPUs: Gradient Accumulation & Data Parallelism appeared first on Towards Data Science. Lorenzo Cesconetto Go to original source

February 24, 2026
Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget arXiv:2602.17894v1 Announce Type: new Abstract: Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical…

February 23, 2026
Data Catalog Tool – Sanity Check

Data Catalog Tool – Sanity Check submitted by /u/FirCoat [link] [comments] /u/FirCoat Go to original source

February 23, 2026
From Monolith to Contract-Driven Data Mesh

From Monolith to Contract-Driven Data Mesh A pragmatic journey using website analytics as a real-world example The post From Monolith to Contract-Driven Data Mesh appeared first on Towards Data Science. Corné POTGIETER Go to original source

February 21, 2026
Anti-causal domain generalization: Leveraging unlabeled data

Anti-causal domain generalization: Leveraging unlabeled data arXiv:2602.17187v1 Announce Type: new Abstract: The problem of domain generalization concerns learning predictive models that are robust to distribution shifts when deployed in new, previously unseen environments. Existing methods typically require labeled data from multiple training environments, limiting their applicability when labeled data are scarce. In this work, we…

February 20, 2026
The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents

The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding Agents AI can write the code, but you have to steer the ship. Master the knowledge to keep you relevant in the age of AI. The post The Missing Curriculum: Essential Concepts For Data Scientists in the Age of AI Coding…

February 20, 2026
Why Every Analytics Engineer Needs to Understand Data Architecture

Why Every Analytics Engineer Needs to Understand Data Architecture Get the data architecture right, and everything else becomes easier. I know it sounds simple, but in reality, little nuances in designing your data architecture may have costly implications. This article provides a crash course on the architectures that shape your daily decisions – from relational…

February 19, 2026
LLMs for data pipelines without losing control (API → DuckDB in ~10 mins)

LLMs for data pipelines without losing control (API → DuckDB in ~10 mins) Hey folks, I’ve been doing data engineering long enough to believe that “real” pipelines meant writing every parser by hand, dealing with pagination myself, and debugging nested JSON until it finally stopped exploding. I’ve also been pretty skeptical of the “just prompt…

February 16, 2026
Best technique for training models on a sample of data?

Best technique for training models on a sample of data? Due to memory limits on my work computer I’m unable to train machine learning models on our entire analysis dataset. Given my data is highly imbalanced I’m under-sampling from the majority class of the binary outcome. What is the proper method to train ML models…

February 16, 2026
Your First 90 Days as a Data Scientist

Your First 90 Days as a Data Scientist A practical onboarding checklist for building trust, business fluency, and data intuition The post Your First 90 Days as a Data Scientist appeared first on Towards Data Science. Yu Dong Go to original source

February 15, 2026
Building an AI Agent to Detect and Handle Anomalies in Time-Series Data

Building an AI Agent to Detect and Handle Anomalies in Time-Series Data Combining statistical detection with agentic decision-making The post Building an AI Agent to Detect and Handle Anomalies in Time-Series Data appeared first on Towards Data Science. MADHURA RAUT Go to original source

February 12, 2026
Thoughts about going from Senior data scientist at company A to Senior Data Analyst at Company B

Thoughts about going from Senior data scientist at company A to Senior Data Analyst at Company B The senior data analyst at company B is significant higher pay ($50k/year more) and scope seems to be bigger with more ownership What kind of setback (if any) does losing the data scientist title have? submitted by /u/StatGoddess…

February 9, 2026
Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently

Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently The real value lies in writing clearer code and using your tools right The post Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently appeared first on Towards Data Science. Mike Huls Go to original source

February 7, 2026
Creating a Data Pipeline to Monitor Local Crime Trends

Creating a Data Pipeline to Monitor Local Crime Trends A walkthough of creating an ETL pipeline to extract local crime data and visualize it in Metabase. The post Creating a Data Pipeline to Monitor Local Crime Trends appeared first on Towards Data Science. Jimin Kang Go to original source

February 4, 2026
Am I drifting away from Data Science, or building useful foundations? (2 YOE working in a startup, no coding)

Am I drifting away from Data Science, or building useful foundations? (2 YOE working in a startup, no coding) I’m looking for some career perspective and would really appreciate advice from people working in or around data science. I’m currently not sure where exactly is my career heading and want to start a business eventually…

February 2, 2026
What separates data scientists who earn a good living (100k-200k) from those who earn 300k+ at FAANG?

What separates data scientists who earn a good living (100k-200k) from those who earn 300k+ at FAANG? Is it just stock options and vesting? Or is it just FAANG is a lot of work. Why do some data scientists deserve that much? I work at a Fortune 500 and the ceiling for IC data scientists…

February 2, 2026
TDS Newsletter: January Must-Reads on Data Platforms, Infinite Context, and More

TDS Newsletter: January Must-Reads on Data Platforms, Infinite Context, and More Don’t miss our most-read and -shared stories of the past month The post TDS Newsletter: January Must-Reads on Data Platforms, Infinite Context, and More appeared first on Towards Data Science. TDS Editors Go to original source

January 31, 2026
Optimizing Vector Search: Why You Should Flatten Structured Data

Optimizing Vector Search: Why You Should Flatten Structured Data An analysis of how flattening structured data can boost precision and recall by up to 20% The post Optimizing Vector Search: Why You Should Flatten Structured Data appeared first on Towards Data Science. Oleg Tereshin Go to original source

January 30, 2026
Data Science as Engineering: Foundations, Education, and Professional Identity

Data Science as Engineering: Foundations, Education, and Professional Identity Recognize data science as an engineering practice and structure education accordingly. The post Data Science as Engineering: Foundations, Education, and Professional Identity appeared first on Towards Data Science. Tom Narock Go to original source

January 28, 2026
Data-Driven Information-Theoretic Causal Bounds under Unmeasured Confounding

Data-Driven Information-Theoretic Causal Bounds under Unmeasured Confounding arXiv:2601.17160v1 Announce Type: new Abstract: We develop a data-driven information-theoretic framework for sharp partial identification of causal effects under unmeasured confounding. Existing approaches often rely on restrictive assumptions, such as bounded or discrete outcomes; require external inputs (for example, instrumental variables, proxies, or user-specified sensitivity parameters); necessitate full…

January 27, 2026
Boosting methods for interval-censored data with regression and classification

Boosting methods for interval-censored data with regression and classification arXiv:2601.17973v1 Announce Type: new Abstract: Boosting has garnered significant interest across both machine learning and statistical communities. Traditional boosting algorithms, designed for fully observed random samples, often struggle with real-world problems, particularly with interval-censored data. This type of data is common in survival analysis and time-to-event…

January 27, 2026
Causal ML for the Aspiring Data Scientist

Causal ML for the Aspiring Data Scientist An accessible introduction to causal inference and ML The post Causal ML for the Aspiring Data Scientist appeared first on Towards Data Science. Ross Lauterbach Go to original source

January 27, 2026
Air for Tomorrow: Mapping the Digital Air-Quality Landscape, from Repositories and Data Types to Starter Code

Air for Tomorrow: Mapping the Digital Air-Quality Landscape, from Repositories and Data Types to Starter Code Understand air quality: access the available data, interpret data types, and execute starter codes The post Air for Tomorrow: Mapping the Digital Air-Quality Landscape, from Repositories and Data Types to Starter Code appeared first on Towards Data Science. Prithviraj…

January 25, 2026
Optimizing Data Transfer in Distributed AI/ML Training Workloads

Optimizing Data Transfer in Distributed AI/ML Training Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 3 The post Optimizing Data Transfer in Distributed AI/ML Training Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

January 24, 2026
Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026

Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026 How I use analytics, automation, and AI to build better SaaS The post Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026 appeared first on Towards Data Science. Yassin Zehar Go to original source

January 23, 2026
Large Data Limits of Laplace Learning for Gaussian Measure Data in Infinite Dimensions

Large Data Limits of Laplace Learning for Gaussian Measure Data in Infinite Dimensions arXiv:2601.14515v1 Announce Type: new Abstract: Laplace learning is a semi-supervised method, a solution for finding missing labels from a partially labeled dataset utilizing the geometry given by the unlabeled data points. The method minimizes a Dirichlet energy defined on a (discrete) graph…

January 22, 2026
Google Trends is Misleading You: How to Do Machine Learning with Google Trends Data

Google Trends is Misleading You: How to Do Machine Learning with Google Trends Data Google Trends is one of the most widely used tools for analysing human behaviour at scale. Journalists use it. Data scientists use it. Entire papers are built on it. But there is a fundamental property of Google Trends data that makes…

January 22, 2026
If You Want to Become a Data Scientist in 2026, Do This

If You Want to Become a Data Scientist in 2026, Do This Learn from my mistakes and fast track your data science career The post If You Want to Become a Data Scientist in 2026, Do This appeared first on Towards Data Science. Egor Howell Go to original source

January 22, 2026
Building a Self-Healing Data Pipeline That Fixes Its Own Python Errors

Building a Self-Healing Data Pipeline That Fixes Its Own Python Errors How I built a self-healing pipeline that automatically fixes bad CSVs, schema changes, and weird delimiters. The post Building a Self-Healing Data Pipeline That Fixes Its Own Python Errors appeared first on Towards Data Science. Benjamin Nweke Go to original source

January 22, 2026
Data Poisoning in Machine Learning: Why and How People Manipulate Training Data

Data Poisoning in Machine Learning: Why and How People Manipulate Training Data Do you know where your data has been? The post Data Poisoning in Machine Learning: Why and How People Manipulate Training Data appeared first on Towards Data Science. Stephanie Kirmer Go to original source

January 18, 2026
The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling

The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling Acquisitions, venture, and an increasingly competitive landscape all point to a market ceiling The post The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling appeared first on Towards Data Science. Hugo Lu Go to original source

January 17, 2026
The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon

The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon Designing a centralized system to track daily habits and long-term goals The post The 2026 Goal Tracker: How I Built a Data-Driven Vision Board Using Python, Streamlit, and Neon appeared first on Towards Data Science. Sabrine Bendimerad Go to…

January 16, 2026
Why Human-Centered Data Analytics Matters More Than Ever

Why Human-Centered Data Analytics Matters More Than Ever From optimizing metrics to designing meaning: putting people back into data-driven decisions The post Why Human-Centered Data Analytics Matters More Than Ever appeared first on Towards Data Science. Rashi Desai Go to original source

January 15, 2026
Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries

Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries Seeded topic modeling, integration with LLMs, and training on summarized data are the fresh parts of the NLP toolkit. The post Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries appeared first on Towards Data Science. Petr Koráb Go to…

January 15, 2026
Under the Uzès Sun: When Historical Data Reveals the Climate Change

Under the Uzès Sun: When Historical Data Reveals the Climate Change Longer summers, milder winters: analysis of temperature trends in Uzès, France, year after year. The post Under the Uzès Sun: When Historical Data Reveals the Climate Change appeared first on Towards Data Science. Marc Polizzi Go to original source

January 14, 2026
The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks

The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks arXiv:2601.06961v1 Announce Type: new Abstract: The success of deep neural networks largely depends on the statistical structure of the training data. While learning dynamics and generalization on isotropic data are well-established, the impact of pronounced anisotropy on these crucial…

January 13, 2026
Optimizing Data Transfer in Batched AI/ML Inference Workloads

Optimizing Data Transfer in Batched AI/ML Inference Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 2 The post Optimizing Data Transfer in Batched AI/ML Inference Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

January 13, 2026
Multi-task Modeling for Engineering Applications with Sparse Data

Multi-task Modeling for Engineering Applications with Sparse Data arXiv:2601.05910v1 Announce Type: new Abstract: Modern engineering and scientific workflows often require simultaneous predictions across related tasks and fidelity levels, where high-fidelity data is scarce and expensive, while low-fidelity data is more abundant. This paper introduces an Multi-Task Gaussian Processes (MTGP) framework tailored for engineering systems characterized…

January 12, 2026
Federated Learning, Part 1: The Basics of Training Models Where the Data Lives

Federated Learning, Part 1: The Basics of Training Models Where the Data Lives Understanding the foundations of federated learning The post Federated Learning, Part 1: The Basics of Training Models Where the Data Lives appeared first on Towards Data Science. Parul Pandey Go to original source

January 11, 2026
Data Science Spotlight: Selected Problems from Advent of Code 2025

Data Science Spotlight: Selected Problems from Advent of Code 2025 Hands-on walkthroughs of problems and solution approaches that power real‑world data science use cases The post Data Science Spotlight: Selected Problems from Advent of Code 2025 appeared first on Towards Data Science. Chinmay Kakatkar Go to original source

January 10, 2026
Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer

Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer Forget stiff lines and wild polynomials. Discover why Splines are the “Goldilocks” of feature engineering, offering the perfect balance of flexibility and discipline for non-linear data using Scikit-Learn’s SplineTransformer. The post Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer appeared first on Towards Data Science. Gustavo Santos…

January 10, 2026
TDS Newsletter: December Must-Reads on GraphRAG, Data Contracts, and More

TDS Newsletter: December Must-Reads on GraphRAG, Data Contracts, and More Don’t miss our most popular articles of the previous month The post TDS Newsletter: December Must-Reads on GraphRAG, Data Contracts, and More appeared first on Towards Data Science. TDS Editors Go to original source

January 9, 2026
Why Supply Chain is the Best Domain for Data Scientists in 2026 (And How to Learn It)

Why Supply Chain is the Best Domain for Data Scientists in 2026 (And How to Learn It) My take after 10 years in Supply Chain on why this can be an excellent playground for data scientists who want to see their skills valued. The post Why Supply Chain is the Best Domain for Data Scientists in…

January 8, 2026
First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data

First Provably Optimal Asynchronous SGD for Homogeneous and Heterogeneous Data arXiv:2601.02523v1 Announce Type: cross Abstract: Artificial intelligence has advanced rapidly through large neural networks trained on massive datasets using thousands of GPUs or TPUs. Such training can occupy entire data centers for weeks and requires enormous computational and energy resources. Yet the optimization algorithms behind…

January 7, 2026
The Best Data Scientists Are Always Learning

The Best Data Scientists Are Always Learning Part 2: Avoiding burnout, learning strategies and the superpower of solitude The post The Best Data Scientists Are Always Learning appeared first on Towards Data Science. Jarom Hulet Go to original source

January 7, 2026
Stop Blaming the Data: A Better Way to Handle Covariance Shift

Stop Blaming the Data: A Better Way to Handle Covariance Shift Instead of using shift as an excuse for poor performance, use Inverse Probability Weighting to estimate how your model should perform in the new environment The post Stop Blaming the Data: A Better Way to Handle Covariance Shift appeared first on Towards Data Science.…

January 6, 2026
Active learning for data-driven reduced models of parametric differential systems with Bayesian operator inference

Active learning for data-driven reduced models of parametric differential systems with Bayesian operator inference arXiv:2601.00038v1 Announce Type: new Abstract: This work develops an active learning framework to intelligently enrich data-driven reduced-order models (ROMs) of parametric dynamical systems, which can serve as the foundation of virtual assets in a digital twin. Data-driven ROMs are explainable, computationally…

January 5, 2026
Tips for standing out in this market?

Tips for standing out in this market? Hey all, I just finished my master’s in data science last month and I want to see what it takes to break into a mid level DS role. I haven’t had a chance to sterilize my resume yet (2 young kids and a lot of recent travel), but…

January 5, 2026
Optimizing Data Transfer in AI/ML Workloads

Optimizing Data Transfer in AI/ML Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems The post Optimizing Data Transfer in AI/ML Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

January 4, 2026
Off-Beat Careers That Are the Future Of Data

Off-Beat Careers That Are the Future Of Data The unconventional career paths you need to explore The post Off-Beat Careers That Are the Future Of Data appeared first on Towards Data Science. Rashi Desai Go to original source

January 3, 2026
The Real Challenge in Data Storytelling: Getting Buy-In for Simplicity

The Real Challenge in Data Storytelling: Getting Buy-In for Simplicity What happens when your clear dashboard meets stakeholders who want everything on one screen The post The Real Challenge in Data Storytelling: Getting Buy-In for Simplicity appeared first on Towards Data Science. Benjamin Nweke Go to original source

January 3, 2026
What Advent of Code Has Taught Me About Data Science

What Advent of Code Has Taught Me About Data Science Five key learnings that I discovered during a programming challenge and how they apply to data science The post What Advent of Code Has Taught Me About Data Science appeared first on Towards Data Science. Jasper Schroeder Go to original source

January 1, 2026
PhD microbiologist pivoting to GCC data analytics. Is a master’s needed or portfolio and projects sufficient?

PhD microbiologist pivoting to GCC data analytics. Is a master’s needed or portfolio and projects sufficient? I am finishing a wet-lab microbiology PhD. Over the last year I realised that I prefer data work. I use R, Excel and command line regularly and want to move toward analytics roles in industry rather than academic biology.…

December 29, 2025
Exploring TabPFN: A Foundation Model Built for Tabular Data

Exploring TabPFN: A Foundation Model Built for Tabular Data Understanding the architecture, training pipeline and implementing TabPFN in practice The post Exploring TabPFN: A Foundation Model Built for Tabular Data appeared first on Towards Data Science. Parul Pandey Go to original source

December 28, 2025
Learning from Neighbors with PHIBP: Predicting Infectious Disease Dynamics in Data-Sparse Environments

Learning from Neighbors with PHIBP: Predicting Infectious Disease Dynamics in Data-Sparse Environments arXiv:2512.21005v1 Announce Type: new Abstract: Modeling sparse count data, which arise across numerous scientific fields, presents significant statistical challenges. This chapter addresses these challenges in the context of infectious disease prediction, with a focus on predicting outbreaks in geographic regions that have historically…

December 25, 2025
Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models

Gaussian Process Assisted Meta-learning for Image Classification and Object Detection Models arXiv:2512.20021v1 Announce Type: new Abstract: Collecting operationally realistic data to inform machine learning models can be costly. Before collecting new data, it is helpful to understand where a model is deficient. For example, object detectors trained on images of rare objects may not be…

December 24, 2025
New Data Science Team Lead struggling with aggressive PM on timelines and model expectations

New Data Science Team Lead struggling with aggressive PM on timelines and model expectations I’m a data scientist who was recently promoted to be a data science team lead. Overall I enjoy the role, but I’m running into a recurring challenge with a very aggressive product manager (also a leader) that I’m not sure how…

December 22, 2025
4 Ways to Supercharge Your Data Science Workflow with Google AI Studio

4 Ways to Supercharge Your Data Science Workflow with Google AI Studio With concrete examples of using AI Studio Build mode to learn faster, prototype smarter, communicate clearer, and automate quicker. The post 4 Ways to Supercharge Your Data Science Workflow with Google AI Studio appeared first on Towards Data Science. Shuai Guo Go to…

December 19, 2025
Interval Fisher’s Discriminant Analysis and Visualisation

Interval Fisher’s Discriminant Analysis and Visualisation arXiv:2512.11945v1 Announce Type: new Abstract: In Data Science, entities are typically represented by single valued measurements. Symbolic Data Analysis extends this framework to more complex structures, such as intervals and histograms, that express internal variability. We propose an extension of multiclass Fisher’s Discriminant Analysis to interval-valued data, using Moore’s…

December 16, 2025
Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data

Efficient Level-Crossing Probability Calculation for Gaussian Process Modeled Data arXiv:2512.12442v1 Announce Type: new Abstract: Almost all scientific data have uncertainties originating from different sources. Gaussian process regression (GPR) models are a natural way to model data with Gaussian-distributed uncertainties. GPR also has the benefit of reducing I/O bandwidth and storage requirements for large scientific simulations.…

December 16, 2025
6 Technical Skills That Make You a Senior Data Scientist

6 Technical Skills That Make You a Senior Data Scientist Beyond writing code, these are the design-level decisions, trade-offs, and habits that quietly separate senior data scientists from everyone else. The post 6 Technical Skills That Make You a Senior Data Scientist appeared first on Towards Data Science. Piero Paialunga Go to original source

December 16, 2025
Geospatial exploratory data analysis with GeoPandas and DuckDB

Geospatial exploratory data analysis with GeoPandas and DuckDB In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through…

December 16, 2025
Data-Driven Model Reduction using WeldNet: Windowed Encoders for Learning Dynamics

Data-Driven Model Reduction using WeldNet: Windowed Encoders for Learning Dynamics arXiv:2512.11090v1 Announce Type: new Abstract: Many problems in science and engineering involve time-dependent, high dimensional datasets arising from complex physical processes, which are costly to simulate. In this work, we propose WeldNet: Windowed Encoders for Learning Dynamics, a data-driven nonlinear model reduction framework to build…

December 15, 2025
Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case

Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case Introduction If you work in data science, data engineering, or as as a frontend/backend developer, you deal with JSON. For professionals, its basically only death, taxes, and JSON-parsing that is inevitable. The issue is that parsing JSON is often a serious pain. Whether you are…

December 15, 2025
EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas

EDA in Public (Part 1): Cleaning and Exploring Sales Data with Pandas Hey everyone! Welcome to the start of a major data journey that I’m calling “EDA in Public.” For those who know me, I believe the best way to learn anything is to tackle a real-world problem and share the entire messy process — including mistakes, victories,…

December 13, 2025
7 Pandas Performance Tricks Every Data Scientist Should Know

7 Pandas Performance Tricks Every Data Scientist Should Know What I’ve learned about making Pandas faster after too many slow notebooks and frozen sessions The post 7 Pandas Performance Tricks Every Data Scientist Should Know appeared first on Towards Data Science. Benjamin Nweke Go to original source

December 12, 2025
Functional Random Forest with Adaptive Cost-Sensitive Splitting for Imbalanced Functional Data Classification

Functional Random Forest with Adaptive Cost-Sensitive Splitting for Imbalanced Functional Data Classification arXiv:2512.07888v1 Announce Type: new Abstract: Classification of functional data where observations are curves or trajectories poses unique challenges, particularly under severe class imbalance. Traditional Random Forest algorithms, while robust for tabular data, often fail to capture the intrinsic structure of functional observations and…

December 10, 2025
Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data arXiv:2512.05456v1 Announce Type: new Abstract: As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for…

December 8, 2025
Lost and Feel Like a Fraud

Lost and Feel Like a Fraud This might not be the appropriate place to say this, but I honestly feel like the biggest fraud ever. If I could go back, I don’t think I would have went into data science. I did my undergraduate in biology, and then did a masters in data science. I’ve…

December 8, 2025
How to Climb the Hidden Career Ladder of Data Science

How to Climb the Hidden Career Ladder of Data Science The behaviors that get you promoted The post How to Climb the Hidden Career Ladder of Data Science appeared first on Towards Data Science. Greg Rafferty Go to original source

December 8, 2025
A Product Data Scientist’s Take on LinkedIn Games After 500 Days of Play

A Product Data Scientist’s Take on LinkedIn Games After 500 Days of Play What a simple puzzle game reveals about experimentation, product thinking, and data science The post A Product Data Scientist’s Take on LinkedIn Games After 500 Days of Play appeared first on Towards Data Science. Yu Dong Go to original source

December 6, 2025
Bootstrap a Data Lakehouse in an Afternoon

Bootstrap a Data Lakehouse in an Afternoon Using Apache Iceberg on AWS with Athena, Glue/Spark and DuckDB The post Bootstrap a Data Lakehouse in an Afternoon appeared first on Towards Data Science. Thomas Reid Go to original source

December 5, 2025
The Best Data Scientists are Always Learning

The Best Data Scientists are Always Learning Why continuous learning matters & how to come up with topics to study The post The Best Data Scientists are Always Learning appeared first on Towards Data Science. Jarom Hulet Go to original source

December 5, 2025
Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch

Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch PyTorch Model Performance Analysis and Optimization — Part 11 The post Overcoming the Hidden Performance Traps of Variable-Shaped Tensors: Efficient Data Sampling in PyTorch appeared first on Towards Data Science. Chaim Rand Go to original source

December 4, 2025
How to Use Simple Data Contracts in Python for Data Scientists

How to Use Simple Data Contracts in Python for Data Scientists Stop your pipelines from breaking on Friday afternoons using simple, open-source validation with Pandera. The post How to Use Simple Data Contracts in Python for Data Scientists appeared first on Towards Data Science. Eirik Berge Go to original source

December 3, 2025
DAISI: Data Assimilation with Inverse Sampling using Stochastic Interpolants

DAISI: Data Assimilation with Inverse Sampling using Stochastic Interpolants arXiv:2512.00252v1 Announce Type: new Abstract: Data assimilation (DA) is a cornerstone of scientific and engineering applications, combining model forecasts with sparse and noisy observations to estimate latent system states. Classical DA methods, such as the ensemble Kalman filter, rely on Gaussian approximations and heuristic tuning (e.g.,…

December 2, 2025
MSE-DS or OMSCS?

MSE-DS or OMSCS? I’ve gotten a lot of mixed responses about this on other subreddits, so I wanted to ask here I was recently accepted to UPenn’s online part-time MSE-DS program. I graduated from college this past May from a top 20 school with a degree in data science. To be honest, I originally applied…

December 1, 2025
Data Science in 2026: Is It Still Worth It?

Data Science in 2026: Is It Still Worth It? An honest view from a 10-year AI Engineer The post Data Science in 2026: Is It Still Worth It? appeared first on Towards Data Science. Sabrine Bendimerad Go to original source

November 29, 2025
Prequential posteriors

Prequential posteriors arXiv:2511.17721v1 Announce Type: new Abstract: Data assimilation is a fundamental task in updating forecasting models upon observing new data, with applications ranging from weather prediction to online reinforcement learning. Deep generative forecasting models (DGFMs) have shown excellent performance in these areas, but assimilating data into such models is challenging due to their intractable…

November 25, 2025
Struggling with Data Science? 5 Common Beginner Mistakes

Struggling with Data Science? 5 Common Beginner Mistakes Avoid these mistakes to fast track your data science career. The post Struggling with Data Science? 5 Common Beginner Mistakes appeared first on Towards Data Science. Egor Howell Go to original source

November 25, 2025
Indeed’s Job Report Shows 13% YoY Drop in Data & Analytics Roles

Indeed’s Job Report Shows 13% YoY Drop in Data & Analytics Roles “Roles like business analyst, data analyst, data scientist, and BI developer are drawing large talent pools that outpace the number of job postings, creating a fiercely competitive market.” do you agree with these findings – are data & analytics roles the hardest-hit in…

November 24, 2025
Natural Language Visualization and the Future of Data Analysis and Presentation

Natural Language Visualization and the Future of Data Analysis and Presentation Will conversational interaction replace SQL queries, KPI reports, and dashboards? The post Natural Language Visualization and the Future of Data Analysis and Presentation appeared first on Towards Data Science. Michal Szudejko Go to original source

November 22, 2025
TDS Newsletter: How to Build Robust Data and AI Systems

TDS Newsletter: How to Build Robust Data and AI Systems Many practitioners like to jump headfirst into the nitty-gritty details of implementing AI-powered tools. We get it: tinkering your way into a solution can sometimes save you time, and it’s often a fun way to go about learning. As the articles we’re highlighting this week show,…

November 22, 2025
Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair)

Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair) An explanation of time-series visualization, including in-depth code examples in Matplotlib, Plotly, and Altair. The post Data Visualization Explained (Part 5): Visualizing Time-Series Data in Python (Matplotlib, Plotly, and Altair) appeared first on Towards Data Science. Murtaza Ali Go to original…

November 21, 2025
Latent space analysis and generalization to out-of-distribution data

Latent space analysis and generalization to out-of-distribution data arXiv:2511.15010v1 Announce Type: new Abstract: Understanding the relationships between data points in the latent decision space derived by the deep learning system is critical to evaluating and interpreting the performance of the system on real world data. Detecting textit{out-of-distribution} (OOD) data for deep learning systems continues to…

November 20, 2025
Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data

Heterogeneous Multisource Transfer Learning via Model Averaging for Positive-Unlabeled Data arXiv:2511.10919v1 Announce Type: new Abstract: Positive-Unlabeled (PU) learning presents unique challenges due to the lack of explicitly labeled negative samples, particularly in high-stakes domains such as fraud detection and medical diagnosis. To address data scarcity and privacy constraints, we propose a novel transfer learning with…

November 17, 2025
Where to Go After Data Science: Unconventional / Weird Exits?

Where to Go After Data Science: Unconventional / Weird Exits? Data science careers often feel like they funnel into the same few paths—FAANG, ML/AI engineering, or analytics leadership—but people actually branch into wildly unexpected directions. I’m curious about those off-the-beaten-path exits: roles in unexpected industries, analytics-adjacent pivots, international moves, or entirely new ventures. Would love…

November 17, 2025
Masked Mineral Modeling: Continent-Scale Mineral Prospecting via Geospatial Infilling

Masked Mineral Modeling: Continent-Scale Mineral Prospecting via Geospatial Infilling arXiv:2511.09722v1 Announce Type: new Abstract: Minerals play a critical role in the advanced energy technologies necessary for decarbonization, but characterizing mineral deposits hidden underground remains costly and challenging. Inspired by recent progress in generative modeling, we develop a learning method which infers the locations of minerals…

November 14, 2025
The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example)

The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning, or an LLM (Explained with One Example) A practical use case to describe how the data scientist job changed across three generations of machine learning The post The Three Ages of Data Science: When to Use Traditional Machine Learning, Deep Learning,…

November 12, 2025
Why Storytelling With Data Matters for Business and Data Analysts

Why Storytelling With Data Matters for Business and Data Analysts Data is driving the future of business and here’s how you can be prepared for that future The post Why Storytelling With Data Matters for Business and Data Analysts appeared first on Towards Data Science. Rashi Desai Go to original source

November 11, 2025
Does More Data Always Yield Better Performance?

Does More Data Always Yield Better Performance? Exploring and challenging the conventional wisdom of “more data → better performance” by experimenting with the interactions between sample size, attribute set, and model complexity. The post Does More Data Always Yield Better Performance? appeared first on Towards Data Science. Mohannad Elhamod Go to original source

November 11, 2025
Data Culture Is the Symptom, Not the Solution

Data Culture Is the Symptom, Not the Solution The hidden reason your data investments fail The post Data Culture Is the Symptom, Not the Solution appeared first on Towards Data Science. Jens Linden Go to original source

November 11, 2025
Prototype Selection Using Topological Data Analysis

Prototype Selection Using Topological Data Analysis arXiv:2511.04873v1 Announce Type: new Abstract: Recently, there has been an explosion in statistical learning literature to represent data using topological principles to capture structure and relationships. We propose a topological data analysis (TDA)-based framework, named Topological Prototype Selector (TPS), for selecting representative subsets (prototypes) from large datasets. We demonstrate…

November 10, 2025
A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights

A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights arXiv:2511.05159v1 Announce Type: new Abstract: Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd’s $k$-means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges them.…

November 10, 2025