Category: data-engineering

The Data Team’s Survival Guide for the Next Era of Data

The Data Team’s Survival Guide for the Next Era of Data 6 pillars to declutter your stack, escape the service trap, and build the missing foundations for the new primary data consumer: the AI agent. The post The Data Team’s Survival Guide for the Next Era of Data appeared first on Towards Data Science. Mahdi…

March 7, 2026
5 Ways to Implement Variable Discretization

5 Ways to Implement Variable Discretization An overview of powerful methods for transforming continuous variables into discrete ones The post 5 Ways to Implement Variable Discretization appeared first on Towards Data Science. Rukshan Pramoditha Go to original source

March 5, 2026
Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? A case study on techniques to maximize your clusters The post Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? appeared first on Towards Data Science. Hector Mejia Go to original source

March 1, 2026
PySpark for Pandas Users

PySpark for Pandas Users Common Pandas operations and their equivalents in PySpark The post PySpark for Pandas Users appeared first on Towards Data Science. Thomas Reid Go to original source

February 24, 2026
From Monolith to Contract-Driven Data Mesh

From Monolith to Contract-Driven Data Mesh A pragmatic journey using website analytics as a real-world example The post From Monolith to Contract-Driven Data Mesh appeared first on Towards Data Science. Corné POTGIETER Go to original source

February 21, 2026
Why Every Analytics Engineer Needs to Understand Data Architecture

Why Every Analytics Engineer Needs to Understand Data Architecture Get the data architecture right, and everything else becomes easier. I know it sounds simple, but in reality, little nuances in designing your data architecture may have costly implications. This article provides a crash course on the architectures that shape your daily decisions – from relational…

February 19, 2026
Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently

Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently The real value lies in writing clearer code and using your tools right The post Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently appeared first on Towards Data Science. Mike Huls Go to original source

February 7, 2026
Creating a Data Pipeline to Monitor Local Crime Trends

Creating a Data Pipeline to Monitor Local Crime Trends A walkthough of creating an ETL pipeline to extract local crime data and visualize it in Metabase. The post Creating a Data Pipeline to Monitor Local Crime Trends appeared first on Towards Data Science. Jimin Kang Go to original source

February 4, 2026
Layered Architecture for Building Readable, Robust, and Extensible Apps

Layered Architecture for Building Readable, Robust, and Extensible Apps If adding a feature feels like open-heart surgery on your codebase, the problem isn’t bugs, it’s structure. This article shows how better architecture reduces risk, speeds up change, and keeps teams moving. The post Layered Architecture for Building Readable, Robust, and Extensible Apps appeared first on…

January 28, 2026
Optimizing Data Transfer in Distributed AI/ML Training Workloads

Optimizing Data Transfer in Distributed AI/ML Training Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 3 The post Optimizing Data Transfer in Distributed AI/ML Training Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

January 24, 2026
Why Package Installs Are Slow (And How to Fix It)

Why Package Installs Are Slow (And How to Fix It) How sharded indexing patterns solve a scaling problem in package management The post Why Package Installs Are Slow (And How to Fix It) appeared first on Towards Data Science. Dan Yeaw Go to original source

January 21, 2026
The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling

The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling Acquisitions, venture, and an increasingly competitive landscape all point to a market ceiling The post The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling appeared first on Towards Data Science. Hugo Lu Go to original source

January 17, 2026
From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric

From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric Dataflows were (rightly?) considered “the slowest and least performant option” for ingesting data into Power BI/Microsoft Fabric. However, things are changing rapidly and the latest Dataflow enhancements changes how we play the game The post From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in…

January 14, 2026
Optimizing Data Transfer in Batched AI/ML Inference Workloads

Optimizing Data Transfer in Batched AI/ML Inference Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 2 The post Optimizing Data Transfer in Batched AI/ML Inference Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

January 13, 2026
Faster Is Not Always Better: Choosing the Right PostgreSQL Insert Strategy in Python (+Benchmarks)

Faster Is Not Always Better: Choosing the Right PostgreSQL Insert Strategy in Python (+Benchmarks) PostgreSQL is fast. Whether your Python code can or should keep up depends on context. This article compares and benchmarks various insert strategies, focusing not on micro-benchmarks but on trade-offs between safety, abstraction, and throughput — and choosing the right tool…

January 9, 2026
How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard A step-by-step guide from weather API ETL to dashboard on Databricks The post How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard appeared first on Towards Data Science. Gustavo Santos Go to…

December 27, 2025
Geospatial exploratory data analysis with GeoPandas and DuckDB

Geospatial exploratory data analysis with GeoPandas and DuckDB In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through…

December 16, 2025
Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case

Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case Introduction If you work in data science, data engineering, or as as a frontend/backend developer, you deal with JSON. For professionals, its basically only death, taxes, and JSON-parsing that is inevitable. The issue is that parsing JSON is often a serious pain. Whether you are…

December 15, 2025
Bootstrap a Data Lakehouse in an Afternoon

Bootstrap a Data Lakehouse in an Afternoon Using Apache Iceberg on AWS with Athena, Glue/Spark and DuckDB The post Bootstrap a Data Lakehouse in an Afternoon appeared first on Towards Data Science. Thomas Reid Go to original source

December 5, 2025
JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability

JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability Benchmarking JSON libraries for large payloads The post JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability appeared first on Towards Data Science. Subha Ganapathi Go to original source

December 3, 2025
Critical Mistakes Companies Make When Integrating AI/ML into Their Processes

Critical Mistakes Companies Make When Integrating AI/ML into Their Processes What I’ve learned leading AI teams across industries The post Critical Mistakes Companies Make When Integrating AI/ML into Their Processes appeared first on Towards Data Science. Andrey Chubin Go to original source

November 15, 2025
Building a Geospatial Lakehouse with Open Source and Databricks

Building a Geospatial Lakehouse with Open Source and Databricks An example workflow for vector geospatial data science The post Building a Geospatial Lakehouse with Open Source and Databricks appeared first on Towards Data Science. Robert Constable Go to original source

October 26, 2025
10 Data + AI Observations for Fall 2025

10 Data + AI Observations for Fall 2025 What’s happening—and what’s next— for data and AI at the close of 2025. The post 10 Data + AI Observations for Fall 2025 appeared first on Towards Data Science. Barr Moses Go to original source

October 11, 2025
Past is Prologue: How Conversational Analytics Is Changing Data Work

Past is Prologue: How Conversational Analytics Is Changing Data Work The future of reporting will be about encoding the value proposition of a product into prompt design. The post Past is Prologue: How Conversational Analytics Is Changing Data Work appeared first on Towards Data Science. Whitney Marks Go to original source

October 10, 2025
The Beauty of Space-Filling Curves: Understanding the Hilbert Curve

The Beauty of Space-Filling Curves: Understanding the Hilbert Curve A quick journey from theory to implementation and application The post The Beauty of Space-Filling Curves: Understanding the Hilbert Curve appeared first on Towards Data Science. Paul Fröhling Go to original source

September 8, 2025
The Generalist: The New All-Around Type of Data Professional?

The Generalist: The New All-Around Type of Data Professional? Is over-specialization ending and are data generalists on the rise? The post The Generalist: The New All-Around Type of Data Professional? appeared first on Towards Data Science. Loizos Loizou Go to original source

September 2, 2025
Everything You Need to Know About the New Power BI Storage Mode

Everything You Need to Know About the New Power BI Storage Mode 50 Shades of Direct Lake The post Everything You Need to Know About the New Power BI Storage Mode appeared first on Towards Data Science. Nikola Ilic Go to original source

August 21, 2025
Data Mesh Diaries: Realities from Early Adopters

Data Mesh Diaries: Realities from Early Adopters Early-adopter realities gathered from real data mesh implementations The post Data Mesh Diaries: Realities from Early Adopters appeared first on Towards Data Science. Corné POTGIETER Go to original source

August 14, 2025
Change-Aware Data Validation with Column-Level Lineage

Change-Aware Data Validation with Column-Level Lineage Data transformation tools like dbt make constructing SQL data pipelines easy and systematic. But even with the added structure and clearly defined data models, pipelines can still become complex, which makes debugging issues and validating changes to data models difficult. The post Change-Aware Data Validation with Column-Level Lineage appeared…

July 5, 2025
The Mythical Pivot Point from Buy to Build for Data Platforms

The Mythical Pivot Point from Buy to Build for Data Platforms For companies with data-intensive architectures, there often comes a pivotal point where building in-house data platforms makes more sense than buying off-the-shelf solutions The post The Mythical Pivot Point from Buy to Build for Data Platforms appeared first on Towards Data Science. Ming Gao…

June 27, 2025
From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle A step-by-step guide to leverage AWS services for efficient data pipeline automation The post From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle appeared first on Towards Data Science. Jiayan Yin Go to original…

June 20, 2025
How to Reduce Your Power BI Model Size by 90%

How to Reduce Your Power BI Model Size by 90% Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? Learn on a real-life example about data model optimization and general rules for reducing data model The post How to Reduce Your Power BI Model Size by 90%…

May 27, 2025
The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated

The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big…

May 15, 2025
Efficient Graph Storage for Entity Resolution Using Clique-Based Compression

Efficient Graph Storage for Entity Resolution Using Clique-Based Compression In the world of entity resolution (ER), one of the central challenges is managing and maintaining the complex relationships between records. At its core, Tilores models entities as graphs: each node represents a record, and edges represent rule-based matches between those records. This approach gives us…

May 15, 2025
Parquet File Format – Everything You Need to Know!

Parquet File Format – Everything You Need to Know! With the amount of Data growing exponentially in the last few years, one of the biggest challenges has become finding the most optimal way to store various data flavors. Unlike in the (not so far) past, when relational databases were considered the only way to go,…

May 15, 2025
Running Python Programs in Your Browser

Running Python Programs in Your Browser In recent years, WebAssembly (often abbreviated as WASM) has emerged as an interesting technology that extends web browsers’ capabilities far beyond the traditional realms of HTML, CSS, and JavaScript. As a Python developer, one particularly exciting application is the ability to run Python code directly in the browser. In this…

May 13, 2025
The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79%

The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79% TL;DR A fast‑growing SaaS woke up to a silent auto‑scale from M20 → M60, adding 20 % to their cloud bill overnight. In a frantic 48‑hour sprint we: flattened N + 1 waterfalls with $lookup , tamed unbounded cursors with projection,…

May 3, 2025
Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ?

Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ? If you’ve followed me for a while, you probably know I started my career as a QA engineer before transitioning into the world of data analytics. I didn’t go to school for it, didn’t have a mentor, and didn’t land in a formal training…

April 30, 2025
NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used

NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used Browsing GitHub the other day, I came across a library I’d never heard of before. It was called NumExpr. I was immediately interested because of some claims made about the library. In particular, it stated that for some complex numerical calculations, it was…

April 29, 2025
AWS: Deploying a FastAPI App on EC2 in Minutes

AWS: Deploying a FastAPI App on EC2 in Minutes Introduction AWS is a popular cloud provider that enables the deployment and scaling of large applications. Mastering at least one cloud platform is an essential skill for software engineers and data scientists. Running an application locally is not enough to make it usable in production — it…

April 25, 2025
Exporting MLflow Experiments from Restricted HPC Systems

Exporting MLflow Experiments from Restricted HPC Systems Many High-Performance Computing (HPC) environments, especially in research and educational institutions, restrict communications to outbound TCP connections. Running a simple command-line ping or curl with the MLflow tracking URL on the HPC bash shell to check packet transfer can be successful. However, communication fails and times out while…

April 24, 2025
MapReduce: How It Powers Scalable Data Processing

MapReduce: How It Powers Scalable Data Processing In this article, I’ll give a brief introduction to the MapReduce programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational…

April 23, 2025
Beginner’s Guide to Creating a S3 Storage on AWS

Beginner’s Guide to Creating a S3 Storage on AWS Introduction AWS is a well-known cloud provider whose primary goal is to allocate server resources for software engineers to deploy their applications. AWS offers many services, one of which is EC2, providing virtual machines for running software applications in the cloud. However, for data-intensive applications, storing…

April 22, 2025
A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration

A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration When I talk to [large] organisations that have not yet properly started with Data Science (DS) and Machine Learning (ML), they often tell me that they have to run a data integration project first, because “…all the data is scattered…

March 29, 2025
Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and…

March 15, 2025
Forget About Cloud Computing. On-Premises Is All the Rage Again

Forget About Cloud Computing. On-Premises Is All the Rage Again Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins. The tides are turning though. As much…

March 15, 2025
Anatomy of a Parquet File

Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined…

March 14, 2025
Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing,…

March 14, 2025
7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow

7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow DBeaver is the most powerful open-source SQL IDE, but there are several features people don’t know about. In this post, I will share with you several features to speed up your workflow, with zero fluff. I’ve learned these as I’m currently digging deeper into…

March 12, 2025
Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become…

March 12, 2025
Kubernetes — Understanding and Utilizing Probes Effectively

Kubernetes — Understanding and Utilizing Probes Effectively Introduction Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits. Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container…

March 6, 2025
Practical SQL Puzzles That Will Level Up Your Skill

Practical SQL Puzzles That Will Level Up Your Skill There are some Sql patterns that, once you know them, you start seeing them everywhere. The solutions to the puzzles that I will show you today are actually very simple SQL queries, but understanding the concept behind them will surely unlock new solutions to the queries…

March 5, 2025
Don’t Let Conda Eat Your Hard Drive

Don’t Let Conda Eat Your Hard Drive If you’re an Anaconda user, you know that conda environments help you manage package dependencies, avoid compatibility conflicts, and share your projects with others. Unfortunately, they can also take over your computer’s hard drive. I write lots of computer tutorials and to keep them organized, each has a dedicated folder…

February 21, 2025
Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge “I train models, analyze data and create dashboards — why should I care about Containers?” Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on…

February 20, 2025
Building a Data Engineering Center of Excellence

Building a Data Engineering Center of Excellence As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice…

February 14, 2025
Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…

February 13, 2025
ML Feature Management: A Practical Evolution Guide

ML Feature Management: A Practical Evolution Guide In the world of machine learning, we obsess over model architectures, training pipelines, and hyper-parameter tuning, yet often overlook a fundamental aspect: how our features live and breathe throughout their lifecycle. From in-memory calculations that vanish after each prediction to the challenge of reproducing exact feature values months…

February 5, 2025
Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code

Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Valuable tips to reduce your DAGs’ parse time and save resources. Photo by Dan Roizer on Unsplash Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows…

January 31, 2025
Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data How much data does AI really need? TLDR: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST¹ to classify handwritten digits. Best runs for “furthest-from-centroid” selection compared to full dataset. Image by author. What if I told you…

January 31, 2025
Optimize the dbt Doc Function with a CI

Optimize the dbt Doc Function with a CI How to set an automated check to improve your dbt documentation Image by the author (generated with chatgpt) In large dbt projects, maintaining consistent and up-to-date documentation can be a challenge. Although dbt’s {{ doc() }} function allows you to store and reuse descriptions for the columns of…

January 23, 2025
Modern Data And Application Engineering Breaks the Loss of Business Context

Modern Data And Application Engineering Breaks the Loss of Business Context Here’s how your data retains its business relevance as it travels through your enterprise Continue reading on Towards Data Science » Bernd Wessely Go to original source

January 21, 2025
The Concepts Data Professionals Should Know in 2025: Part 1

The Concepts Data Professionals Should Know in 2025: Part 1 From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT. Continue reading on Towards Data Science » Sarah Lea Go to original source

January 20, 2025
How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW…

How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW… Make the right choice for YOU Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

January 20, 2025
Top 3 Questions to Ask in Near Real-Time Data Solutions

Top 3 Questions to Ask in Near Real-Time Data Solutions Questions that guide architectural decisions to balance functional requirements with non-functional ones, like latency and scalability Continue reading on Towards Data Science » Shawn Shi Go to original source

January 17, 2025
The Data Analyst Every CEO Wants

The Data Analyst Every CEO Wants Data Analyst is probably the most underrated job in the data industry Continue reading on Towards Data Science » Benoit Pimpaud Go to original source

January 17, 2025
Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist

Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist What does a data engineer do differently to a data scientist? Continue reading on Towards Data Science » Sarah Lea Go to original source

December 6, 2024
Query Optimization for Mere Humans in PostgreSQL

Query Optimization for Mere Humans in PostgreSQL PostgreSQL: Query Optimization for Mere Humans Understanding a PostgreSQL execution plan with practical examples Photo by Greg Rakozy on Unsplash Today, users have high expectations for the programs they use. Users expect programs to have amazing features, to be fast, and to consume a reasonable amount of resources. As developers,…

December 4, 2024