Parquet structure
This guide provides an overview of handling Parquet files for data masking. It includes defining Parquet schemas, understanding nested structures, and using Parquet column paths to accurately target data fields for masking.
Understanding Parquet structure
Parquet is a columnar storage format, storing data by columns rather than rows, optimizing compression and query performance. It is a binary format and can only be read using Parquet parsers or readers. A parquet file typically has a file header and footer containing file metadata such as the version, schema, row-group sizes, encoding and compression codecs, statistics, dictionaries, etc. The file body contains the row groups, column chunks, and pages that store the actual data.
Definitions
Schema: A Parquet schema defines the structure of the stored data, including field types, repetition rules, and nesting levels. The schema can include primitive types - INT32
, INT64
, BOOLEAN
, BINARY
, FLOAT
, DOUBLE
and complex types - LIST
, MAP
, STRUCT
.
Groups and Fields: Groups represent structured objects (similar to JSON objects or XML elements), and fields define individual data points within a group (equivalent to JSON fields or XML nodes).
Repetition Level and Definition Levels: Repition level tracks the repeated (LIST) elements inside a nested structure, identifying whether a value starts a new list or continues an existing one. The definition level tracks nullability in a nested or optional column, i.e., it represents how many levels of nesting exist before a value is defined (non-null).
Hierarchical structure
On a higher level, the Parquet hierarchical structure consists of three primary components:
Row Groups: Rows group is a horizontal partition of data (collection of rows), containing column data for a subset of rows. Each row group can be processed independently, allowing parallel processing and reading parts of data without loading the entire dataset into memory.
Columns: Each row group contains column chunks, which store all values for a single column. Each column is stored separately, allowing column-level compression and encoding. The query engine can selectively query required columns, reducing computational overhead and memory usage.
Pages: Each column chunk is divided into Pages containing encoded and compressed data. There are three types of pages: data pages (actual column values), dictionary pages (dictionaries for repeated values), and index pages (optional, store row indexes for faster lookups).
Parquet example (Tabular format)
Understanding ColumnPath
In Parquet, a ColumnPath is the fully qualified name of a column, representing its location within a hierarchical (nested) schema. It is used to reference specific columns in a structured or nested dataset. In the context of the masking engine (specific to parquet-java libraries):
For flat schemas, column paths are simply the column names (e.g., First_Name)
For nested schemas, column paths represent the full path to a nested field, using a slash (/) separator to indicate hierarchy (e.g., Location/Address)
For list schema elements, column paths end in <path>/list/element (e.g., Emails/list/element)
ColumnPath example
First_Name
Last_Name
DOB
SSN
Emails/list/element
Location/Address
Location/CityLocation/House_Number
Location/Zip_Code
Memory considerations for Parquet data masking job:
A row group is the smallest partition of data loaded into memory when reading a Parquet file. The size of a row group can range from a few megabytes to several gigabytes.
Profiling job: The minimum memory allocation must be at least equal to the size of the largest row group in the file.
Masking job: Since masking involves reading and writing simultaneously, the minimum memory allocation must be twice the largest row group size. This ensures that heap memory can accommodate row groups from both the input and output files.
Note that memory estimations based on row groups provide a ballpark range, however, the actual job may require additional memory due to other processing operations.
Parquet masking limitations:
In a parquet file, each column chunk and different row groups of the same column may use different compression codecs. If the masking engine encounters such a file, the masked output file is written without compression, using the UNCOMPRESSED codec.
Writing a new masked parquet file uses the parquet file writer’s default properties. As a result, some properties and statistics from the original file may not be carried over to the new masked file.
For a Parquet file execution-component section, Bytes Processed represents the total data written. However, since the Parquet file metadata size cannot currently be calculated, the Bytes Remaining in the execution-component may show a non-zero value (i.e., metadata size) even after masking is complete.