Category: big-data

  • Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

    Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? A case study on techniques to maximize your clusters The post Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not? appeared first on Towards Data Science. Hector Mejia Go to original source

  • Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB

    Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB How I learned to handle growing datasets without slowing down my entire workflow The post Modern DataFrames in Python: A Hands-On Tutorial with Polars and DuckDB appeared first on Towards Data Science. Benjamin Nweke Go to original source

  • NumPy for Absolute Beginners: A Project-Based Approach to Data Analysis

    NumPy for Absolute Beginners: A Project-Based Approach to Data Analysis Build a high-performance sensor data pipeline from scratch and unlock the true speed of Python’s scientific computing core The post NumPy for Absolute Beginners: A Project-Based Approach to Data Analysis appeared first on Towards Data Science. Ibrahim Salami Go to original source

  • The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix

    The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix Retraining is easy; knowing when not to is the real challenge. In machine learning, performance drops are rarely about stale weights; they’re about misunderstood signals. The post The Misconception of Retraining: Why Model Refresh Isn’t Always the Fix appeared first on Towards Data Science.…

  • The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated

    The Geospatial Capabilities of Microsoft Fabric and ESRI GeoAnalytics, Demonstrated The saying goes that 80% of data collected, stored and maintained by governments can be associated with geographical locations. Although never empirically proven, it illustrates the importance of location within data. Ever growing data volumes put constraints on systems that handle geospatial data. Common Big…

  • MapReduce: How It Powers Scalable Data Processing

    MapReduce: How It Powers Scalable Data Processing In this article, I’ll give a brief introduction to the MapReduce programming model. Hopefully after reading this, you leave with a solid intuition of what MapReduce is, the role it plays in scalable data processing, and how to recognize when it can be applied to optimize a computational…

  • Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

    Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and…

  • Anatomy of a Parquet File

    Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined…

  • Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

    Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing,…

  • Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies

    Mastering Hadoop, Part 1: Installation, Configuration, and Modern Big Data Strategies Nowadays, a large amount of data is collected on the internet, which is why companies are faced with the challenge of being able to store, process, and analyze these volumes efficiently. Hadoop is an open-source framework from the Apache Software Foundation and has become…

  • Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets

    Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also…

  • The Concepts Data Professionals Should Know in 2025: Part 1

    The Concepts Data Professionals Should Know in 2025: Part 1 From Data Lakehouses to Event-Driven Architecture — Master 12 data concepts and turn them into simple projects to stay ahead in IT. Continue reading on Towards Data Science » Sarah Lea Go to original source

  • Scaling Statistics: Incremental Standard Deviation in SQL with dbt

    Scaling Statistics: Incremental Standard Deviation in SQL with dbt Why scan yesterday’s data when you can increment today’s? Image by the author SQL aggregation functions can be computationally expensive when applied to large datasets. As datasets grow, recalculating metrics over the entire dataset repeatedly becomes inefficient. To address this challenge, incremental aggregation is often employed — a method…