{"id":2410,"date":"2025-03-14T07:02:23","date_gmt":"2025-03-14T07:02:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/14\/anatomy-of-a-parquet-file\/"},"modified":"2025-03-14T07:02:23","modified_gmt":"2025-03-14T07:02:23","slug":"anatomy-of-a-parquet-file","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/14\/anatomy-of-a-parquet-file\/","title":{"rendered":"Anatomy of a Parquet File"},"content":{"rendered":"<p>    Anatomy of a Parquet File<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">In recent years, Parquet has become a standard format for data storage in <a href=\"https:\/\/towardsdatascience.com\/tag\/big-data\/\" title=\"Big Data\">Big Data<\/a> ecosystems. Its column-oriented format offers several advantages:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Faster query execution when only a subset of columns is being processed<\/li>\n<li class=\"wp-block-list-item\">Quick calculation of statistics across all data<\/li>\n<li class=\"wp-block-list-item\">Reduced storage volume thanks to efficient compression<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">When combined with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with query engines (e.g., Trino) and data warehouse compute clusters (e.g., Snowflake, BigQuery). In this article, the content of a Parquet file is dissected using mainly standard Python tools to better understand its structure and how it contributes to such performances.<\/p>\n<h2 class=\"wp-block-heading\">Writing Parquet file(s)<\/h2>\n<p class=\"wp-block-paragraph\">To produce Parquet files, we use PyArrow, a Python binding for Apache Arrow that stores dataframes in memory in columnar format. PyArrow allows fine-grained parameter tuning when writing the file. This makes PyArrow ideal for Parquet manipulation (one can also simply use <a href=\"https:\/\/pandas.pydata.org\/docs\/reference\/api\/pandas.DataFrame.to_parquet.html\">Pandas<\/a>).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># generator.py\n\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nfrom faker import Faker\n\nfake = Faker()\nFaker.seed(12345)\nnum_records = 100\n\n# Generate fake data\nnames = [fake.name() for _ in range(num_records)]\naddresses = [fake.address().replace(\"n\", \", \") for _ in range(num_records)]\nbirth_dates = [\n    fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)\n]\ncities = [addr.split(\", \")[1] for addr in addresses]\nbirth_years = [date.year for date in birth_dates]\n\n# Cast the data to the Arrow format\nname_array = pa.array(names, type=pa.string())\naddress_array = pa.array(addresses, type=pa.string())\nbirth_date_array = pa.array(birth_dates, type=pa.date32())\ncity_array = pa.array(cities, type=pa.string())\nbirth_year_array = pa.array(birth_years, type=pa.int32())\n\n# Create schema with non-nullable fields\nschema = pa.schema(\n    [\n        pa.field(\"name\", pa.string(), nullable=False),\n        pa.field(\"address\", pa.string(), nullable=False),\n        pa.field(\"date_of_birth\", pa.date32(), nullable=False),\n        pa.field(\"city\", pa.string(), nullable=False),\n        pa.field(\"birth_year\", pa.int32(), nullable=False),\n    ]\n)\n\ntable = pa.Table.from_arrays(\n    [name_array, address_array, birth_date_array, city_array, birth_year_array],\n    schema=schema,\n)\n\nprint(table)<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">pyarrow.Table\nname: string not null\naddress: string not null\ndate_of_birth: date32[day] not null\ncity: string not null\nbirth_year: int32 not null\n----\nname: [[\"Adam Bryan\",\"Jacob Lee\",\"Candice Martinez\",\"Justin Thompson\",\"Heather Rubio\"]]\naddress: [[\"822 Jennifer Field Suite 507, Anthonyhaven, UT 98088\",\"292 Garcia Mall, Lake Belindafurt, IN 69129\",\"31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323\",\"00716 Kristina Trail Suite 381, Howelltown, SC 64961\",\"351 Christopher Expressway Suite 332, West Edward, CO 68607\"]]\ndate_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]\ncity: [[\"Anthonyhaven\",\"Lake Belindafurt\",\"East Tammiestad\",\"Howelltown\",\"West Edward\"]]\nbirth_year: [[1955,1950,1955,1957,1956]]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The output clearly reflects a columns-oriented storage, unlike Pandas, which usually displays a traditional \u201crow-wise\u201d table.<\/p>\n<h2 class=\"wp-block-heading\">How is a Parquet file stored?<\/h2>\n<p class=\"wp-block-paragraph\">Parquet files are generally stored in cheap object storage databases like S3 (AWS) or GCS (GCP) to be easily accessible by data processing pipelines. These files are usually organized with a partitioning strategy by leveraging directory structures:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># generator.py\n\nnum_records = 100\n\n# ...\n\n# Writing the parquet files to disk\npq.write_to_dataset(\n    table,\n    root_path='dataset',\n    partition_cols=['birth_year', 'city']\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">If <code>birth_year<\/code> and <code>city columns<\/code> are defined as partitioning keys, PyArrow creates such a tree structure in the directory dataset:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">dataset\/\n\u251c\u2500 birth_year=1949\/\n\u251c\u2500 birth_year=1950\/\n\u2502 \u251c\u2500 city=Aaronbury\/\n\u2502 \u2502 \u251c\u2500 828d313a915a43559f3111ee8d8e6c1a-0.parquet\n\u2502 \u2502 \u251c\u2500 828d313a915a43559f3111ee8d8e6c1a-0.parquet\n\u2502 \u2502 \u251c\u2500 \u2026\n\u2502 \u251c\u2500 city=Alicialand\/\n\u2502 \u251c\u2500 \u2026\n\u251c\u2500 birth_year=1951 \u251c\u2500 ...\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The strategy enables partition pruning: when a query filters on these columns, the engine can use folder names to read only the necessary files. This is why the partitioning strategy is crucial for limiting delay, I\/O, and compute resources when handling large volumes of data (as has been the case for decades with traditional relational databases).<\/p>\n<p class=\"wp-block-paragraph\">The pruning effect can be easily verified by counting the files opened by a Python script that filters the birth year:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># query.py\nimport duckdb\n\nduckdb.sql(\n    \"\"\"\n    SELECT * \n    FROM read_parquet('dataset\/*\/*\/*.parquet', hive_partitioning = true)\n    where birth_year = 1949\n    \"\"\"\n).show()<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">&gt; strace -e trace=open,openat,read -f python query.py 2&gt;&amp;1 | grep \"dataset\/.*.parquet\"\n\n[pid    37] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Box%201306\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    37] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Box%201306\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Box%201306\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Box%203487\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Box%203487\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Clarkemouth\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Clarkemouth\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=DPO%20AP%2020198\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=DPO%20AP%2020198\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=East%20Morgan\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=East%20Morgan\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=FPO%20AA%2006122\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=FPO%20AA%2006122\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=New%20Michelleport\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=New%20Michelleport\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=North%20Danielchester\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=North%20Danielchester\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Port%20Chase\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Port%20Chase\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Richardmouth\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Richardmouth\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 4\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Robbinsshire\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 5\n[pid    39] openat(AT_FDCWD, \"dataset\/birth_year=1949\/city=Robbinsshire\/e1ad1666a2144fbc94892d4ac1234c64-0.parquet\", O_RDONLY) = 3<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Only 23 files are read out of 100.<\/p>\n<h2 class=\"wp-block-heading\">Reading a raw Parquet file<\/h2>\n<p class=\"wp-block-paragraph\">Let\u2019s decode a raw Parquet file without specialized libraries. For simplicity, the dataset is dumped into a single file without compression or encoding.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># generator.py\n\n# ...\n\npq.write_table(\n    table,\n    \"dataset.parquet\",\n    use_dictionary=False,\n    compression=\"NONE\",\n    write_statistics=True,\n    column_encoding=None,\n)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The first thing to know is that the binary file is framed by 4 bytes whose ASCII representation is \u201cPAR1\u201d. The file is corrupted if this is not the case.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># reader.py\n\nwith open(\"dataset.parquet\", \"rb\") as file:\n    parquet_data = file.read()\n\nassert parquet_data[:4] == b\"PAR1\", \"Not a valid parquet file\"\nassert parquet_data[-4:] == b\"PAR1\", \"File footer is corrupted\"<\/code><\/pre>\n<p class=\"wp-block-paragraph\">As indicated in the <a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/\">documentation<\/a>, the file is divided into two parts: the \u201crow groups\u201d containing actual data, and the footer containing metadata (schema below).<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/file.png?ssl=1\" alt=\"\" class=\"wp-image-599620\" style=\"width:327px;height:auto\"><\/figure>\n<h3 class=\"wp-block-heading\">The footer<\/h3>\n<p class=\"wp-block-paragraph\">The size of the footer is indicated in the 4 bytes preceding the end marker as an unsigned integer written in \u201clittle endian\u201d format (noted \u201c&lt;I\u201d for the <code>unpack<\/code> function).<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># reader.py\n\nimport struct\n\n# ...\n\nfooter_length = struct.unpack(\"&lt;I\", parquet_data[-8:-4])[0]\nprint(f\"Footer size in bytes: {footer_length}\")\n\nfooter_start = len(parquet_data) - footer_length - 8\nfooter_data = parquet_data[footer_start:-8]<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">Footer size in bytes: 1088<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The footer information is encoded in a cross-language serialization format called <a href=\"https:\/\/thrift.apache.org\/\">Apache Thrift<\/a>. Using a human-readable but verbose format like JSON and then translating it into binary would be less efficient in terms of memory usage. With Thrift, one can declare data structures as follows:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">struct Customer {\n\t1: required string name,\n\t2: optional i16 birthYear,\n\t3: optional list&lt;string&gt; interests\n}<\/code><\/pre>\n<p class=\"wp-block-paragraph\">On the basis of this declaration, Thrift can generate Python code to decode byte strings with such data structure (it also generates code to perform the encoding part). The thrift file containing all the data structures implemented in a Parquet file can be downloaded <a href=\"https:\/\/github.com\/apache\/parquet-format\/blob\/master\/src\/main\/thrift\/parquet.thrift\">here<\/a>. After having installed the thrift binary, let\u2019s run:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">thrift -r --gen py parquet.thrift<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The generated Python code is placed in the \u201cgen-py\u201d folder. The footer\u2019s data structure is represented by the FileMetaData class \u2013 a Python class automatically generated from the Thrift schema. Using Thrift\u2019s Python utilities, binary data is parsed and populated into an instance of this FileMetaData class.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># reader.py\n\nimport sys\n\n# ...\n\n# Add the generated classes to the python path\nsys.path.append(\"gen-py\")\nfrom parquet.ttypes import FileMetaData, PageHeader\nfrom thrift.transport import TTransport\nfrom thrift.protocol import TCompactProtocol\n\ndef read_thrift(data, thrift_instance):\n    \"\"\"\n    Read a Thrift object from a binary buffer.\n    Returns the Thrift object and the number of bytes read.\n    \"\"\"\n    transport = TTransport.TMemoryBuffer(data)\n    protocol = TCompactProtocol.TCompactProtocol(transport)\n    thrift_instance.read(protocol)\n    return thrift_instance, transport._buffer.tell()\n\n# The number of bytes read is not used for now\nfile_metadata_thrift, _ = read_thrift(footer_data, FileMetaData())\n\nprint(f\"Number of rows in the whole file: {file_metadata_thrift.num_rows}\")\nprint(f\"Number of row groups: {len(file_metadata_thrift.row_groups)}\")\n\nNumber of rows in the whole file: 100\nNumber of row groups: 1<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The footer contains extensive information about the file\u2019s structure and content. For instance, it accurately tracks the number of rows in the generated dataframe. These rows are all contained within a single \u201crow group.\u201d <em>But what is a \u201crow group?\u201d<\/em><\/p>\n<h3 class=\"wp-block-heading\"><strong>Row groups<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">Unlike purely column-oriented formats, Parquet employs a hybrid approach. Before writing column blocks, the dataframe is first partitioned vertically into row groups (the parquet file we generated is too small to be split in multiple row groups).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"612\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/Parquet-2-1024x612.png?resize=1024%2C612&#038;ssl=1\" alt=\"\" class=\"wp-image-599594\"><\/figure>\n<p class=\"wp-block-paragraph\">This hybrid structure offers several advantages:<\/p>\n<p class=\"wp-block-paragraph\">Parquet calculates statistics (such as min\/max values) for each column within each row group. These statistics are crucial for query optimization, allowing query engines to skip entire row groups that don\u2019t match filtering criteria. For example, if a query filters for <code>birth_year &gt; 1955<\/code> and a row group\u2019s maximum birth year is 1954, the engine can efficiently skip that entire data section. This optimisation is called \u201cpredicate pushdown\u201d. Parquet also stores other useful statistics like distinct value counts and null counts.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># reader.py\n# ...\n\nfirst_row_group = file_metadata_thrift.row_groups[0]\nbirth_year_column = first_row_group.columns[4]\n\nmin_stat_bytes = birth_year_column.meta_data.statistics.min\nmax_stat_bytes = birth_year_column.meta_data.statistics.max\n\nmin_year = struct.unpack(\"&lt;I\", min_stat_bytes)[0]\nmax_year = struct.unpack(\"&lt;I\", max_stat_bytes)[0]\n\nprint(f\"The birth year range is between {min_year} and {max_year}\")<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-bash\">The birth year range is between 1949 and 1958<\/code><\/pre>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Row groups enable parallel processing of data (particularly valuable for frameworks like Apache Spark). The size of these row groups can be configured based on the computing resources available (using the <code>row_group_size<\/code> property in function <code>write_table<\/code> when using PyArrow).\n<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># generator.py\n\n# ...\n\npq.write_table(\n    table,\n    \"dataset.parquet\",\n    row_group_size=100,\n)\n\n# \/! Keep the default value of \"row_group_size\" for the next parts<\/code><\/pre>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Even if this is not the primary objective of a column format, Parquet\u2019s hybrid structure maintains reasonable performance when reconstructing complete rows. Without row groups, rebuilding an entire row might require scanning the entirety of each column which would be extremely inefficient for large files.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Data Pages<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">The smallest substructure of a Parquet file is the page. It contains a sequence of values from the same column and, therefore, of the same type. The choice of page size is the result of a trade-off:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Larger pages mean less metadata to store and read, which is optimal for queries with minimal filtering.\n<\/li>\n<li class=\"wp-block-list-item\">Smaller pages reduce the amount of unnecessary data read, which is better when queries target small, scattered data ranges.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/Parquet-3.png?ssl=1\" alt=\"\" class=\"wp-image-599595\"><\/figure>\n<p class=\"wp-block-paragraph\">Now let\u2019s decode the contents of the first page of the column dedicated to addresses whose location can be found in the footer (given by the <code>data_page_offset<\/code> attribute of the right <code>ColumnMetaData<\/code>) . Each page is preceded by a Thrift <code>PageHeader<\/code> object containing some metadata. The offset actually points to a Thrift binary representation of the page metadata that precedes the page itself. The Thrift class is called a <code>PageHeader<\/code> and can also be found in the <code>gen-py<\/code> directory.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f4a1.png?ssl=1\" alt=\"\ud83d\udca1\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"><em> Between the PageHeader and the actual values contained within the page, there may be a few bytes dedicated to implementing the <\/em><a href=\"https:\/\/static.googleusercontent.com\/media\/research.google.com\/fr\/\/pubs\/archive\/36632.pdf\"><em>Dremel<\/em><\/a><em> format, which allows encoding <\/em><a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/nestedencoding\/\"><em>nested data structures<\/em><\/a><em>. Since our data has a regular tabular format and the values are not nullable, these bytes are skipped when writing the file (<\/em><a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/data-pages\/\"><em>https:\/\/parquet.apache.org\/docs\/file-format\/data-pages\/<\/em><\/a><em>).<\/em><\/p>\n<\/blockquote>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># reader.py\n# ...\n\naddress_column = first_row_group.columns[1]\ncolumn_start = address_column.meta_data.data_page_offset\ncolumn_end = column_start + address_column.meta_data.total_compressed_size\ncolumn_content = parquet_data[column_start:column_end]\n\npage_thrift, page_header_size = read_thrift(column_content, PageHeader())\npage_content = column_content[\n    page_header_size : (page_header_size + page_thrift.compressed_page_size)\n]\nprint(column_content[:100])<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">b'6x00x00x00481 Mata Squares Suite 260, Lake Rachelville, KY 874642x00x00x00671 Barker Crossing Suite 390, Mooreto'<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The generated values finally appear, in plain text and not encoded (as specified when writing the Parquet file). However, to optimize the columnar format, it is recommended to use one of the following encoding algorithms: dictionary encoding, run length encoding (RLE), or delta encoding (the latter being reserved for int32 and int64 types), followed by compression using gzip or snappy (available codecs are listed <a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/data-pages\/compression\/\">here<\/a>). Since encoded pages contain similar values (all addresses, all decimal numbers, etc.), compression ratios can be particularly advantageous.<\/p>\n<p class=\"wp-block-paragraph\">As documented in the <a href=\"https:\/\/parquet.apache.org\/docs\/file-format\/data-pages\/encodings\/\">specification<\/a>, when character strings (BYTE_ARRAY) are not encoded, each value is preceded by its size represented as a 4-byte integer. This can be observed in the previous output:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"395\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-13-at-5.52.46%25E2%2580%25AFPM-1024x395.png?resize=1024%2C395&#038;ssl=1\" alt=\"\" class=\"wp-image-599672\"><\/figure>\n<p class=\"wp-block-paragraph\">To read all the values (for example, the first 10), the loop is rather simple:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">idx = 0\nfor _ in range(10):\n    str_size = struct.unpack(\"&lt;I\", page_content[idx : (idx + 4)])[0]\n    print(page_content[(idx + 4) : (idx + 4 + str_size)].decode())\n    idx += 4 + str_size<\/code><\/pre>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">481 Mata Squares Suite 260, Lake Rachelville, KY 87464\n671 Barker Crossing Suite 390, Mooretown, MI 21488\n62459 Jordan Knoll Apt. 970, Emilyfort, DC 80068\n948 Victor Square Apt. 753, Braybury, RI 67113\n365 Edward Place Apt. 162, Calebborough, AL 13037\n894 Reed Lock, New Davidmouth, NV 84612\n24082 Allison Squares Suite 345, North Sharonberg, WY 97642\n00266 Johnson Drives, South Lori, MI 98513\n15255 Kelly Plains, Richardmouth, GA 33438\n260 Thomas Glens, Port Gabriela, OH 96758<\/code><\/pre>\n<p class=\"wp-block-paragraph\">And there we have it! We have successfully recreated, in a very simple way, how a specialized library would read a Parquet file. By understanding its building blocks including headers, footers, row groups, and data pages, we can better appreciate how features like predicate pushdown and partition pruning deliver such impressive performance benefits in data-intensive environments. I am convinced knowing how Parquet works under the hood helps making better decisions about storage strategies, compression choices, and performance optimization.<\/p>\n<p class=\"wp-block-paragraph\">All the code used in this article is available on my GitHub repository at <a href=\"https:\/\/github.com\/kili-mandjaro\/anatomy-parquet\">https:\/\/github.com\/kili-mandjaro\/anatomy-parquet<\/a>, where you can explore more examples and experiment with different Parquet file configurations.<\/p>\n<p class=\"wp-block-paragraph\">Whether you are building data pipelines, optimizing query performance, or simply curious about data storage formats, I hope this deep dive into Parquet\u2019s inner structures has provided valuable insights for your <a href=\"https:\/\/towardsdatascience.com\/tag\/data-engineering\/\" title=\"Data Engineering\">Data Engineering<\/a> journey.<\/p>\n<p class=\"wp-block-paragraph\"><em>All images are by the author.<\/em><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/anatomy-of-a-parquet-file\/\">Anatomy of a Parquet File<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Kilian Ollivier<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/anatomy-of-a-parquet-file\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Anatomy of a Parquet File In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages: Faster query execution when only a subset of columns is being processed Quick calculation of statistics across all data Reduced storage volume thanks to efficient compression When combined [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,1067,2025,401,83,166,2026],"tags":[2027,2028,1855],"class_list":["post-2410","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-big-data","category-data-storage","category-data-engineering","category-data-science","category-hands-on-tutorials","category-parquet","tag-array","tag-birth","tag-pa"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2410"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2410"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2410\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2410"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2410"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}