{"id":3836,"date":"2025-05-15T07:03:18","date_gmt":"2025-05-15T07:03:18","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/15\/parquet-file-format-everything-you-need-to-know\/"},"modified":"2025-05-15T07:03:18","modified_gmt":"2025-05-15T07:03:18","slug":"parquet-file-format-everything-you-need-to-know","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/15\/parquet-file-format-everything-you-need-to-know\/","title":{"rendered":"Parquet File Format \u2013 Everything You Need to Know!"},"content":{"rendered":"<p>    Parquet File Format \u2013 Everything You Need to Know!<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1747162664768\" class=\"mdspan-comment\">With<\/mdspan> the amount of <a href=\"https:\/\/towardsdatascience.com\/tag\/data\/\" title=\"Data\">Data<\/a> growing exponentially in the last few years, one of the biggest challenges has become finding the most optimal way to store various data flavors. Unlike in the (not so far) past, when relational databases were considered the only way to go, organizations now want to perform analysis over raw data \u2013 think of social media sentiment analysis, audio\/video files, and so on \u2013 which usually couldn\u2019t be stored in a traditional (relational) way, or storing them in a traditional way would require significant effort and time, which increase the overall time-for-analysis.<\/p>\n<p class=\"wp-block-paragraph\">Another challenge was to somehow stick with a traditional approach to have data stored in a structured way, but without the necessity to design complex and time-consuming ETL workloads to move this data into the enterprise data warehouse. Additionally, what if half of the data professionals in your organization are proficient with, let\u2019s say, Python (data scientists, data engineers), and the other half (data engineers, data analysts) with SQL? Would you insist that \u201cPythonists\u201d learn SQL? Or, vice-versa?<\/p>\n<p class=\"wp-block-paragraph\">Or, would you prefer a storage option that can play to the strengths of your entire data team? I have good news for you \u2013 something like this has already existed since 2013, and it\u2019s called Apache <a href=\"https:\/\/towardsdatascience.com\/tag\/parquet\/\" title=\"Parquet\">Parquet<\/a>!<\/p>\n<h3 class=\"wp-block-heading\">Parquet file format in a nutshell<\/h3>\n<p class=\"wp-block-paragraph\">Before I show you the ins and outs of the Parquet file format, there are (at least) five main reasons why Parquet is considered a de facto standard for storing data nowadays:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<em><strong>Data compression<\/strong><\/em>\u00a0\u2013 by applying various encoding and compression algorithms, Parquet file provides reduced memory consumption<\/li>\n<li class=\"wp-block-list-item\">\n<em><strong>Columnar storage<\/strong><\/em>\u00a0\u2013 this is of paramount importance in analytic workloads, where fast data read operation is the key requirement. But, more on that later in the article\u2026<\/li>\n<li class=\"wp-block-list-item\">\n<em><strong>Language agnostic<\/strong><\/em>\u00a0\u2013 as already mentioned previously, developers may use different programming languages to manipulate the data in the Parquet file<\/li>\n<li class=\"wp-block-list-item\">\n<strong><em>Open-source format<\/em><\/strong>\u00a0\u2013 meaning, you are not locked with a specific vendor<\/li>\n<li class=\"wp-block-list-item\"><em><strong>Support for complex data types<\/strong><\/em><\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">Row-store vs Column-store<\/h3>\n<p class=\"wp-block-paragraph\">We\u2019ve already mentioned that Parquet is a column-based storage format. However, to understand the benefits of using the Parquet file format, we first need to draw the line between the row-based and column-based ways of storing the data.<\/p>\n<p class=\"wp-block-paragraph\">In traditional, row-based storage, the data is stored as a sequence of rows. Something like this:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-106.png?ssl=1\" alt=\"\" class=\"wp-image-603893\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now, when we are talking about\u00a0<a href=\"https:\/\/www.ibm.com\/cloud\/blog\/olap-vs-oltp\" target=\"_blank\" rel=\"noreferrer noopener\">OLAP<\/a>\u00a0scenarios, some of the common questions that your users may ask are:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">How many balls did we sell?<\/li>\n<li class=\"wp-block-list-item\">How many users from the USA bought a T-shirt?<\/li>\n<li class=\"wp-block-list-item\">What is the total amount spent by customer Maria Adams?<\/li>\n<li class=\"wp-block-list-item\">How many sales did we have on January 2nd?<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">To be able to answer any of these questions, the engine must scan each and every row from the beginning to the very end! So, to answer the question: how many users from the USA bought T-shirt, the engine has to do something like this:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-107.png?ssl=1\" alt=\"\" class=\"wp-image-603894\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Essentially, we just need the information from two columns: Product (T-Shirts) and Country (USA), but the engine will scan all five columns! This is not the most efficient solution \u2013 I think we can agree on that\u2026<\/p>\n<h3 class=\"wp-block-heading\">Column store<\/h3>\n<p class=\"wp-block-paragraph\">Let\u2019s now examine how the column store works. As you may assume, the approach is 180 degrees different:<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-108.png?ssl=1\" alt=\"\" class=\"wp-image-603895\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In this case, each column is a separate entity \u2013 meaning, each column is physically separated from other columns! Going back to our previous business question: the engine can now scan only those columns that are needed by the query (Product and country), while\u00a0<em>skipping scanning<\/em>\u00a0the unnecessary columns. And, in most cases, this should improve the performance of the analytical queries.<\/p>\n<p class=\"wp-block-paragraph\">Ok, that\u2019s nice, but the column store existed before Parquet and it still exists outside of Parquet as well. So, what is so special about the Parquet format?<\/p>\n<h2 class=\"wp-block-heading\">Parquet is a columnar format that stores the data in row groups<\/h2>\n<p class=\"wp-block-paragraph\">Wait, what?! Wasn\u2019t it complicated enough even before this? Don\u2019t worry, it\u2019s much easier than it sounds <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f642.png?ssl=1\" alt=\"\ud83d\ude42\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s go back to our previous example and depict how Parquet will store this same chunk of data:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"355\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-109-1024x355.png?resize=1024%2C355&#038;ssl=1\" alt=\"\" class=\"wp-image-603896\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s stop for a moment and explain the illustration above, as this is exactly the structure of the Parquet file (some additional things were intentionally omitted, but we will come soon to explain that as well). Columns are still stored as separate units, but Parquet introduces additional structures, called Row group.<\/p>\n<p class=\"wp-block-paragraph\">Why is this additional structure super important?<\/p>\n<p class=\"wp-block-paragraph\">You\u2019ll need to wait for an answer for a bit :). In OLAP scenarios, we are mainly concerned with two concepts:\u00a0<strong><em>projection<\/em><\/strong>\u00a0and\u00a0<strong><em>predicate(s)<\/em><\/strong>. Projection refers to a\u00a0<strong>SELECT<\/strong>\u00a0statement in SQL language \u2013 which columns are needed by the query. Back to our previous example, we need only the Product and Country columns, so the engine can skip scanning the remaining ones.<\/p>\n<p class=\"wp-block-paragraph\">Predicate(s) refer to the\u00a0<strong>WHERE<\/strong>\u00a0clause in SQL language \u2013 which rows satisfy criteria defined in the query. In our case, we are interested in T-Shirts only, so the engine can completely skip scanning Row group 2, where all the values in the Product column equal socks!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"390\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-110-1024x390.png?resize=1024%2C390&#038;ssl=1\" alt=\"\" class=\"wp-image-603897\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s quickly stop here, as I want you to realize the difference between various types of storage in terms of the work that needs to be performed by the engine:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Row store \u2013 the engine needs to scan all 5 columns and all 6 rows<\/li>\n<li class=\"wp-block-list-item\">Column store \u2013 the engine needs to scan 2 columns and all 6 rows<\/li>\n<li class=\"wp-block-list-item\">Column store with row groups \u2013 the engine needs to scan 2 columns and 4 rows<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Obviously, this is an oversimplified example, with only 6 rows and 5 columns, where you will definitely not see any difference in performance between these three storage options. However, in real life, when you\u2019re dealing with much larger amounts of data, the difference becomes more evident.<\/p>\n<p class=\"wp-block-paragraph\">Now, the fair question would be: how does Parquet \u201cknow\u201d which row group to skip\/scan?<\/p>\n<h3 class=\"wp-block-heading\">Parquet file contains metadata<\/h3>\n<p class=\"wp-block-paragraph\">This means that every Parquet file contains \u201cdata about data\u201d \u2013 information such as minimum and maximum values in a specific column within a certain row group. Furthermore, every Parquet file contains a footer, which keeps the information about the format version, schema information, column metadata, and so on. You can find more details about Parquet metadata types\u00a0<a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/metadata\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Important:<\/strong>\u00a0In order to optimize the performance and eliminate unnecessary data structures (row groups and columns), the engine first needs to \u201cget familiar\u201d with the data, so it first reads the metadata. It\u2019s not a slow operation, but it still requires a certain amount of time. Therefore, if you\u2019re querying the data from multiple small Parquet files, query performance can degrade, because the engine will have to read metadata from each file. So, you should be better off merging multiple smaller files into one bigger file (but still not too big :)\u2026<\/p>\n<p class=\"wp-block-paragraph\">I hear you, I hear you: Nikola, what is \u201csmall\u201d and what is \u201cbig\u201d? Unfortunately, there is no single \u201cgolden\u201d number here, but for example,\u00a0<a href=\"https:\/\/www.youtube.com\/watch?v=RxjMibOx__A\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft Azure Synapse Analytics recommends that the individual Parquet file should be at least a few hundred MBs in size<\/a>.<\/p>\n<h3 class=\"wp-block-heading\">What else is in there?<\/h3>\n<p class=\"wp-block-paragraph\">Here is a simplified, high-level illustration of the Parquet file format:<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/111.png?ssl=1\" alt=\"\" class=\"wp-image-603898\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Can it be better than this? Yes, with data compression<\/h2>\n<p class=\"wp-block-paragraph\">Ok, we\u2019ve explained how skipping the scan of the unnecessary data structures (row groups and columns) may benefit your queries and increase the overall performance. But, it\u2019s not only about that \u2013 remember when I told you at the very beginning that one of the main advantages of the Parquet format is the reduced memory footprint of the file? This is achieved by applying various compression algorithms.<\/p>\n<p class=\"wp-block-paragraph\">I\u2019ve already written about various data compression types in Power BI (and the Tabular model in general)\u00a0<a href=\"https:\/\/data-mozart.com\/inside-vertipaq-compress-for-success\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>, so maybe it\u2019s a good idea to start by reading this article.<\/p>\n<p class=\"wp-block-paragraph\">There are two main encoding types that enable Parquet to compress the data and achieve astonishing savings in space:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<em><a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/data-pages\/encodings\/#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8\" target=\"_blank\" rel=\"noreferrer noopener\">Dictionary encoding<\/a><\/em>\u00a0\u2013 Parquet creates a dictionary of the distinct values in the column, and afterward replaces \u201creal\u201d values with index values from the dictionary. Going back to our example, this process looks something like this:<\/li>\n<\/ul>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-111.png?ssl=1\" alt=\"\" class=\"wp-image-603899\"><figcaption class=\"wp-element-caption\">Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You might think: why this overhead, when product names are quite short, right? Ok, but now imagine that you store the detailed description of the product, such as: \u201cLong arm T-Shirt with application on the neck\u201d. And, now imagine that you have this product sold million times\u2026Yeah, instead of having million times repeating value \u201cLong arm\u2026bla bla\u201d, the Parquet will store only the Index value (integer instead of text).<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<em><a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/data-pages\/encodings\/#a-namerlearun-length-encoding--bit-packing-hybrid-rle--3\" target=\"_blank\" rel=\"noreferrer noopener\">Run-Length-Encoding with Bit-Packing<\/a><\/em>\u00a0\u2013 when your data contains many repeating values, Run-Length-Encoding (RLE) algorithm may bring additional memory savings.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Can it be better than\u00a0THIS?! Yes, with the Delta Lake file format<\/h2>\n<p class=\"wp-block-paragraph\">Ok, what the heck is now a Delta Lake format?! This is the article about Parquet, right?<\/p>\n<p class=\"wp-block-paragraph\"><strong><em>So, to put it in plain English: Delta Lake is nothing else but the Parquet format \u201con steroids\u201d.\u00a0<\/em><\/strong>When I say \u201csteroids\u201d, the main one is the versioning of Parquet files. It also stores a transaction log to enable tracking all changes applied to the Parquet file. This is also known as\u00a0<a href=\"https:\/\/www.ibm.com\/docs\/en\/cics-ts\/5.4?topic=processing-acid-properties-transactions\" target=\"_blank\" rel=\"noreferrer noopener\">ACID-compliant transactions<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Since it supports not only ACID transactions, but also supports time travel (rollbacks, audit trails, etc.) and DML (Data Manipulation Language) statements, such as INSERT, UPDATE and DELETE, you won\u2019t be wrong if you think of the Delta Lake as a \u201cdata warehouse on the data lake\u201d (who said: Lakehouse<img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f609.png?ssl=1\" alt=\"\ud83d\ude09\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f609.png?ssl=1\" alt=\"\ud83d\ude09\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f609.png?ssl=1\" alt=\"\ud83d\ude09\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\">). Examining the pros and cons of the Lakehouse concept is out of the scope of this article, but if you\u2019re curious to go deeper into this, I suggest you read\u00a0<a href=\"https:\/\/www.databricks.com\/glossary\/data-lakehouse\" target=\"_blank\" rel=\"noreferrer noopener\">this article<\/a>\u00a0from Databricks.<\/p>\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n<p class=\"wp-block-paragraph\">We evolve! Same as we, the data is also evolving. So, new flavors of data required new ways of storing it. The Parquet file format is one of the most efficient storage options in the current data landscape, since it provides multiple benefits \u2013 both in terms of memory consumption, by leveraging various compression algorithms, and fast query processing by enabling the engine to skip scanning unnecessary data.<\/p>\n<p class=\"wp-block-paragraph\">Thanks for reading!<\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/parquet-file-format-everything-you-need-to-know\/\">Parquet File Format \u2013 Everything You Need to Know!<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Nikola Ilic<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/parquet-file-format-everything-you-need-to-know\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Parquet File Format \u2013 Everything You Need to Know! With the amount of Data growing exponentially in the last few years, one of the biggest challenges has become finding the most optimal way to store various data flavors. Unlike in the (not so far) past, when relational databases were considered the only way to go, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,2680,692,2025,401,83,2026],"tags":[84,1729,2681],"class_list":["post-3836","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-columnar-databases","category-data","category-data-storage","category-data-engineering","category-data-science","category-parquet","tag-data","tag-file","tag-parquet"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3836"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3836"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3836\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}