Category: big-data

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? A case study on techniques to maximize your clusters The post Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? appeared first on Towards Data Science. Hector Mejia Go to original source

March 1, 2026
Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB How I learned to handle growing datasets without slowing down my entire workflow The post Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB appeared first on Towards Data Science. Benjamin Nweke Go to original source

November 22, 2025
NumPy for Absolute Beginners: A Project-Based Approach to Data Analysis

NumPy for Absolute Beginners: A Project-Based Approach to Data Analysis Build a high-performance sensor data pipeline from scratch and unlock the true speed of Python’s scientific computing core The post NumPy for Absolute Beginners: A Project-Based Approach to Data Analysis appeared first on Towards Data Science. Ibrahim Salami Go to original source

November 5, 2025
The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix

The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix Retraining is easy; knowing when not to is the real challenge. In machine learning, performance drops are rarely about stale weights; they’re about misunderstood signals. The post The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix appeared first on Towards Data Science.…

July 31, 2025
The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated

The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big…

May 15, 2025
MapReduce: How It Powers Scalable Data Processing

MapReduce: How It Powers Scalable Data Processing In this article, I’ll give a brief introduction to the MapReduce programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational…

April 23, 2025
Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and…

March 15, 2025
Anatomy of a Parquet File

Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined…

March 14, 2025
Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing,…

March 14, 2025
Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become…

March 12, 2025
Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…

February 13, 2025
The Concepts Data Professionals Should Know in 2025: Part 1

The Concepts Data Professionals Should Know in 2025: Part 1 From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT. Continue reading on Towards Data Science » Sarah Lea Go to original source

January 20, 2025
Scaling Statistics: Incremental Standard Deviation in SQL with dbt

Scaling Statistics: Incremental Standard Deviation in SQL with dbt Why scan yesterday’s data when you can increment today’s? Image by the author SQL aggregation functions can be computationally expensive when applied to large datasets. As datasets grow, recalculating metrics over the entire dataset repeatedly becomes inefficient. To address this challenge, incremental aggregation is often employed — a method…

January 2, 2025