Category: data-engineering

  • The Data Team’s Survival Guide for the Next Era of Data

    The Data Team’s Survival Guide for the Next Era of Data 6 pillars to declutter your stack, escape the service trap, and build the missing foundations for the new primary data consumer: the AI agent. The post The Data Team’s Survival Guide for the Next Era of Data appeared first on Towards Data Science. Mahdi…

  • 5 Ways to Implement Variable Discretization

    5 Ways to Implement Variable Discretization An overview of powerful methods for transforming continuous variables into discrete ones The post 5 Ways to Implement Variable Discretization appeared first on Towards Data Science. Rukshan Pramoditha Go to original source

  • Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

    Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? A case study on techniques to maximize your clusters The post Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? appeared first on Towards Data Science. Hector Mejia Go to original source

  • PySpark for Pandas Users

    PySpark for Pandas Users Common Pandas operations and their equivalents in PySpark The post PySpark for Pandas Users appeared first on Towards Data Science. Thomas Reid Go to original source

  • From Monolith to Contract-Driven Data Mesh

    From Monolith to Contract-Driven Data Mesh A pragmatic journey using website analytics as a real-world example The post From Monolith to Contract-Driven Data Mesh appeared first on Towards Data Science. Corné POTGIETER Go to original source

  • Why Every Analytics Engineer Needs to Understand Data Architecture

    Why Every Analytics Engineer Needs to Understand Data Architecture Get the data architecture right, and everything else becomes easier. I know it sounds simple, but in reality, little nuances in designing your data architecture may have costly implications. This article provides a crash course on the architectures that shape your daily decisions – from relational…

  • Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently

    Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently The real value lies in writing clearer code and using your tools right The post Pydantic Performance: 4 Tips on How to Validate Large Amounts of Data Efficiently appeared first on Towards Data Science. Mike Huls Go to original source

  • Creating a Data Pipeline to Monitor Local Crime Trends

    Creating a Data Pipeline to Monitor Local Crime Trends A walkthough of creating an ETL pipeline to extract local crime data and visualize it in Metabase. The post Creating a Data Pipeline to Monitor Local Crime Trends appeared first on Towards Data Science. Jimin Kang Go to original source

  • Layered Architecture for Building Readable, Robust, and Extensible Apps

    Layered Architecture for Building Readable, Robust, and Extensible Apps If adding a feature feels like open-heart surgery on your codebase, the problem isn’t bugs, it’s structure. This article shows how better architecture reduces risk, speeds up change, and keeps teams moving. The post Layered Architecture for Building Readable, Robust, and Extensible Apps appeared first on…

  • Optimizing Data Transfer in Distributed AI/ML Training Workloads

    Optimizing Data Transfer in Distributed AI/ML Training Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 3 The post Optimizing Data Transfer in Distributed AI/ML Training Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

  • The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling

    The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling Acquisitions, venture, and an increasingly competitive landscape all point to a market ceiling The post The Great Data Closure: Why Databricks and Snowflake Are Hitting Their Ceiling appeared first on Towards Data Science. Hugo Lu Go to original source

  • From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric

    From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in Microsoft Fabric Dataflows were (rightly?) considered “the slowest and least performant option” for ingesting data into Power BI/Microsoft Fabric. However, things are changing rapidly and the latest Dataflow enhancements changes how we play the game The post From ‘Dataslows’ to Dataflows: The Gen2 Performance Revolution in…

  • Optimizing Data Transfer in Batched AI/ML Inference Workloads

    Optimizing Data Transfer in Batched AI/ML Inference Workloads A deep dive on data transfer bottlenecks, their identification, and their resolution with the help of NVIDIA Nsight™ Systems – part 2 The post Optimizing Data Transfer in Batched AI/ML Inference Workloads appeared first on Towards Data Science. Chaim Rand Go to original source

  • Faster Is Not Always Better: Choosing the Right PostgreSQL Insert Strategy in Python (+Benchmarks)

    Faster Is Not Always Better: Choosing the Right PostgreSQL Insert Strategy in Python (+Benchmarks) PostgreSQL is fast. Whether your Python code can or should keep up depends on context. This article compares and benchmarks various insert strategies, focusing not on micro-benchmarks but on trade-offs between safety, abstraction, and throughput — and choosing the right tool…

  • How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

    How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard A step-by-step guide from weather API ETL to dashboard on Databricks The post How to Build an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard appeared first on Towards Data Science. Gustavo Santos Go to…

  • Geospatial exploratory data analysis with GeoPandas and DuckDB

    Geospatial exploratory data analysis with GeoPandas and DuckDB In this article, I’ll show you how to use two popular Python libraries to carry out some geospatial analysis of traffic accident data within the UK. I was a relatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through…

  • Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case

    Stop Writing Spaghetti if-else Chains: Parsing JSON with Python’s match-case Introduction If you work in data science, data engineering, or as as a frontend/backend developer, you deal with JSON. For professionals, its basically only death, taxes, and JSON-parsing that is inevitable. The issue is that parsing JSON is often a serious pain. Whether you are…

  • Bootstrap a Data Lakehouse in an Afternoon

    Bootstrap a Data Lakehouse in an Afternoon Using Apache Iceberg on AWS with Athena, Glue/Spark and DuckDB The post Bootstrap a Data Lakehouse in an Afternoon appeared first on Towards Data Science. Thomas Reid Go to original source

  • JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability

    JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability Benchmarking JSON libraries for large payloads The post JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability appeared first on Towards Data Science. Subha Ganapathi Go to original source

  • Critical Mistakes Companies Make When Integrating AI/ML into Their Processes

    Critical Mistakes Companies Make When Integrating AI/ML into Their Processes What I’ve learned leading AI teams across industries The post Critical Mistakes Companies Make When Integrating AI/ML into Their Processes appeared first on Towards Data Science. Andrey Chubin Go to original source

  • Building a Geospatial Lakehouse with Open Source and Databricks

    Building a Geospatial Lakehouse with Open Source and Databricks An example workflow for vector geospatial data science The post Building a Geospatial Lakehouse with Open Source and Databricks appeared first on Towards Data Science. Robert Constable Go to original source

  • 10 Data + AI Observations for Fall 2025

    10 Data + AI Observations for Fall 2025 What’s happening—and what’s next— for data and AI at the close of 2025. The post 10 Data + AI Observations for Fall 2025 appeared first on Towards Data Science. Barr Moses Go to original source

  • Past is Prologue: How Conversational Analytics Is Changing Data Work

    Past is Prologue: How Conversational Analytics Is Changing Data Work The future of reporting will be about encoding the value proposition of a product into prompt design. The post Past is Prologue: How Conversational Analytics Is Changing Data Work appeared first on Towards Data Science. Whitney Marks Go to original source

  • The Beauty of Space-Filling Curves: Understanding the Hilbert Curve

    The Beauty of Space-Filling Curves: Understanding the Hilbert Curve A quick journey from theory to implementation and application The post The Beauty of Space-Filling Curves: Understanding the Hilbert Curve appeared first on Towards Data Science. Paul Fröhling Go to original source

  • The Generalist: The New All-Around Type of Data Professional?

    The Generalist: The New All-Around Type of Data Professional? Is over-specialization ending and are data generalists on the rise? The post The Generalist: The New All-Around Type of Data Professional? appeared first on Towards Data Science. Loizos Loizou Go to original source

  • Everything You Need to Know About the New Power BI Storage Mode

    Everything You Need to Know About the New Power BI Storage Mode 50 Shades of Direct Lake The post Everything You Need to Know About the New Power BI Storage Mode appeared first on Towards Data Science. Nikola Ilic Go to original source

  • Data Mesh Diaries: Realities from Early Adopters

    Data Mesh Diaries: Realities from Early Adopters Early-adopter realities gathered from real data mesh implementations The post Data Mesh Diaries: Realities from Early Adopters appeared first on Towards Data Science. Corné POTGIETER Go to original source

  • Change-Aware Data Validation with Column-Level Lineage

    Change-Aware Data Validation with Column-Level Lineage Data transformation tools like dbt make constructing SQL data pipelines easy and systematic. But even with the added structure and clearly defined data models, pipelines can still become complex, which makes debugging issues and validating changes to data models difficult. The post Change-Aware Data Validation with Column-Level Lineage appeared…

  • The Mythical Pivot Point from Buy to Build for Data Platforms

    The Mythical Pivot Point from Buy to Build for Data Platforms For companies with data-intensive architectures, there often comes a pivotal point where building in-house data platforms makes more sense than buying off-the-shelf solutions The post The Mythical Pivot Point from Buy to Build for Data Platforms appeared first on Towards Data Science. Ming Gao…

  • From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle

    From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle A step-by-step guide to leverage AWS services for efficient data pipeline automation The post From Configuration to Orchestration: Building an ETL Workflow with AWS Is No Longer a Struggle appeared first on Towards Data Science. Jiayan Yin Go to original…

  • How to Reduce Your Power BI Model Size by 90%

    How to Reduce Your Power BI Model Size by 90% Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? Learn on a real-life example about data model optimization and general rules for reducing data model The post How to Reduce Your Power BI Model Size by 90%…

  • The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated

    The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big…

  • Efficient Graph Storage for Entity Resolution Using Clique-Based Compression

    Efficient Graph Storage for Entity Resolution Using Clique-Based Compression In the world of entity resolution (ER), one of the central challenges is managing and maintaining the complex relationships between records. At its core, Tilores models entities as graphs: each node represents a record, and edges represent rule-based matches between those records. This approach gives us…

  • Parquet File Format – Everything You Need to Know!

    Parquet File Format – Everything You Need to Know! With the amount of Data growing exponentially in the last few years, one of the biggest challenges has become finding the most optimal way to store various data flavors. Unlike in the (not so far) past, when relational databases were considered the only way to go,…

  • Running Python Programs in Your Browser

    Running Python Programs in Your Browser In recent years, WebAssembly (often abbreviated as WASM) has emerged as an interesting technology that extends web browsers’ capabilities far beyond the traditional realms of HTML, CSS, and JavaScript.  As a Python developer, one particularly exciting application is the ability to run Python code directly in the browser. In this…

  • The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79%

    The Shape‑First Tune‑Up Provides Organizations with a Means to Reduce MongoDB Expenses by 79% TL;DR A fast‑growing SaaS woke up to a silent auto‑scale from M20 → M60, adding 20 % to their cloud bill overnight. In a frantic 48‑hour sprint we: flattened N + 1 waterfalls with $lookup , tamed unbounded cursors with projection,…

  • Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ?

    Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ? If you’ve followed me for a while, you probably know I started my career as a QA engineer before transitioning into the world of data analytics. I didn’t go to school for it, didn’t have a mentor, and didn’t land in a formal training…

  • NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used

    NumExpr: The “Faster than Numpy” Library Most Data Scientists Have Never Used Browsing GitHub the other day, I came across a library I’d never heard of before. It was called NumExpr. I was immediately interested because of some claims made about the library. In particular, it stated that for some complex numerical calculations, it was…

  • AWS: Deploying a FastAPI App on EC2 in Minutes

    AWS: Deploying a FastAPI App on EC2 in Minutes Introduction AWS is a popular cloud provider that enables the deployment and scaling of large applications. Mastering at least one cloud platform is an essential skill for software engineers and data scientists. Running an application locally is not enough to make it usable in production — it…

  • Exporting MLflow Experiments from Restricted HPC Systems

    Exporting MLflow Experiments from Restricted HPC Systems Many High-Performance Computing (HPC) environments, especially in research and educational institutions, restrict communications to outbound TCP connections. Running a simple command-line ping or curl with the MLflow tracking URL on the HPC bash shell to check packet transfer can be successful. However, communication fails and times out while…

  • MapReduce: How It Powers Scalable Data Processing

    MapReduce: How It Powers Scalable Data Processing In this article, I’ll give a brief introduction to the MapReduce programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational…

  • Beginner’s Guide to Creating a S3 Storage on AWS

    Beginner’s Guide to Creating a S3 Storage on AWS Introduction AWS is a well-known cloud provider whose primary goal is to allocate server resources for software engineers to deploy their applications. AWS offers many services, one of which is EC2, providing virtual machines for running software applications in the cloud. However, for data-intensive applications, storing…

  • A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration

    A Little More Conversation, A Little Less Action — A Case Against Premature Data Integration When I talk to [large] organisations that have not yet properly started with Data Science (DS) and Machine Learning (ML), they often tell me that they have to run a data integration project first, because “…all the data is scattered…

  • Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

    Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and…

  • Forget About Cloud Computing. On-Premises Is All the Rage Again

    Forget About Cloud Computing. On-Premises Is All the Rage Again Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins. The tides are turning though. As much…

  • Anatomy of a Parquet File

    Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined…

  • Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

    Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing,…

  • 7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow

    7 Powerful DBeaver Tips and Tricks to Improve Your SQL Workflow DBeaver is the most powerful open-source SQL IDE, but there are several features people don’t know about. In this post, I will share with you several features to speed up your workflow, with zero fluff. I’ve learned these as I’m currently digging deeper into…

  • Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

    Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become…

  • Kubernetes — Understanding and Utilizing Probes Effectively

    Kubernetes — Understanding and Utilizing Probes Effectively Introduction Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits. Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container…

  • Practical SQL Puzzles That Will Level Up Your Skill

    Practical SQL Puzzles That Will Level Up Your Skill There are some Sql patterns that, once you know them, you start seeing them everywhere. The solutions to the puzzles that I will show you today are actually very simple SQL queries, but understanding the concept behind them will surely unlock new solutions to the queries…

  • Don’t Let Conda Eat Your Hard Drive

    Don’t Let Conda Eat Your Hard Drive If you’re an Anaconda user, you know that conda environments help you manage package dependencies, avoid compatibility conflicts, and share your projects with others. Unfortunately, they can also take over your computer’s hard drive. I write lots of computer tutorials and to keep them organized, each has a dedicated folder…

  • Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge

    Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge “I train models, analyze data and create dashboards — why should I care about Containers?” Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on…

  • Building a Data Engineering Center of Excellence

    Building a Data Engineering Center of Excellence As data continues to grow in importance and become more complex, the need for skilled data engineers has never been greater. But what is data engineering, and why is it so important? In this blog post, we will discuss the essential components of a functioning data engineering practice…

  • Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

    Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…

  • ML Feature Management: A Practical Evolution Guide

    ML Feature Management: A Practical Evolution Guide In the world of machine learning, we obsess over model architectures, training pipelines, and hyper-parameter tuning, yet often overlook a fundamental aspect: how our features live and breathe throughout their lifecycle. From in-memory calculations that vanish after each prediction to the challenge of reproducing exact feature values months…

  • Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code

    Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Stop Creating Bad DAGs — Optimize Your Airflow Environment By Improving Your Python Code Valuable tips to reduce your DAGs’ parse time and save resources. Photo by Dan Roizer on Unsplash Apache Airflow is one of the most popular orchestration tools in the data field, powering workflows…

  • Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data

    Data Pruning MNIST: How I Hit 99% Accuracy Using Half the Data How much data does AI really need? TLDR: Data-centric AI can create more efficient and accurate models. I experimented with data pruning on MNIST¹ to classify handwritten digits. Best runs for “furthest-from-centroid” selection compared to full dataset. Image by author. What if I told you…

  • Optimize the dbt Doc Function with a CI

    Optimize the dbt Doc Function with a CI How to set an automated check to improve your dbt documentation Image by the author (generated with chatgpt) In large dbt projects, maintaining consistent and up-to-date documentation can be a challenge. Although dbt’s {{ doc() }} function allows you to store and reuse descriptions for the columns of…

  • Modern Data And Application Engineering Breaks the Loss of Business Context

    Modern Data And Application Engineering Breaks the Loss of Business Context Here’s how your data retains its business relevance as it travels through your enterprise Continue reading on Towards Data Science » Bernd Wessely Go to original source

  • The Concepts Data Professionals Should Know in 2025: Part 1

    The Concepts Data Professionals Should Know in 2025: Part 1 From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT. Continue reading on Towards Data Science » Sarah Lea Go to original source

  • How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW…

    How to Pick Between Data Science, Data Analytics, Data Engineering, ML Engineering, and SW… Make the right choice for YOU Continue reading on Towards Data Science » Marina Wyss – Gratitude Driven Go to original source

  • Top 3 Questions to Ask in Near Real-Time Data Solutions

    Top 3 Questions to Ask in Near Real-Time Data Solutions Questions that guide architectural decisions to balance functional requirements with non-functional ones, like latency and scalability Continue reading on Towards Data Science » Shawn Shi Go to original source

  • The Data Analyst Every CEO Wants

    The Data Analyst Every CEO Wants Data Analyst is probably the most underrated job in the data industry Continue reading on Towards Data Science » Benoit Pimpaud Go to original source

  • Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist

    Who Does What in Data? A Practical Introduction to the Role of a Data Engineer & Data Scientist What does a data engineer do differently to a data scientist? Continue reading on Towards Data Science » Sarah Lea Go to original source

  • Query Optimization for Mere Humans in PostgreSQL

    Query Optimization for Mere Humans in PostgreSQL PostgreSQL: Query Optimization for Mere Humans Understanding a PostgreSQL execution plan with practical examples Photo by Greg Rakozy on Unsplash Today, users have high expectations for the programs they use. Users expect programs to have amazing features, to be fast, and to consume a reasonable amount of resources. As developers,…