Parquet structure

This guide provides an overview of handling Parquet files for data masking. It includes defining Parquet schemas, understanding nested structures, and using Parquet column paths to accurately target data fields for masking.

Understanding Parquet structure

Parquet is a columnar storage format, storing data by columns rather than rows, optimizing compression and query performance. It is a binary format and can only be read using Parquet parsers or readers. A parquet file typically has a file header and footer containing file metadata such as the version, schema, row-group sizes, encoding and compression codecs, statistics, dictionaries, etc. The file body contains the row groups, column chunks, and pages that store the actual data.

Definitions

Schema: A Parquet schema defines the structure of the stored data, including field types, repetition rules, and nesting levels. The schema can include primitive types - INT32, INT64, BOOLEAN, BINARY, FLOAT, DOUBLE and complex types - LIST, MAP, STRUCT.

Groups and Fields: Groups represent structured objects (similar to JSON objects or XML elements), and fields define individual data points within a group (equivalent to JSON fields or XML nodes).

Repetition Level and Definition Levels: Repition level tracks the repeated (LIST) elements inside a nested structure, identifying whether a value starts a new list or continues an existing one. The definition level tracks nullability in a nested or optional column, i.e., it represents how many levels of nesting exist before a value is defined (non-null).

Hierarchical structure

On a higher level, the Parquet hierarchical structure consists of three primary components:

Row Groups: Rows group is a horizontal partition of data (collection of rows), containing column data for a subset of rows. Each row group can be processed independently, allowing parallel processing and reading parts of data without loading the entire dataset into memory.

Columns: Each row group contains column chunks, which store all values for a single column. Each column is stored separately, allowing column-level compression and encoding. The query engine can selectively query required columns, reducing computational overhead and memory usage.

Pages: Each column chunk is divided into Pages containing encoded and compressed data. There are three types of pages: data pages (actual column values), dictionary pages (dictionaries for repeated values), and index pages (optional, store row indexes for faster lookups).

Parquet example (Tabular format)

Understanding ColumnPath

In Parquet, a ColumnPath is the fully qualified name of a column, representing its location within a hierarchical (nested) schema. It is used to reference specific columns in structured or nested datasets.

In the context of the masking engine (specific to parquet-java libraries):

For flat schemas, column paths are simply the column names (e.g., First_Name).
For nested schemas, column paths represent the full path to a nested field, using a slash (/) separator to indicate hierarchy (e.g., Location/Address).
For list schema elements, column paths end in <path>/list/element (e.g., Emails/list/element).

ColumnPath example

First_Name

Last_Name

DOB

SSN

Emails/list/element

Location/Address

Location/CityLocation/House_Number

Location/Zip_Code

Data types in Parquet

This section describes how Apache Parquet stores data on disk and how data types are defined within the format. Understanding Parquet’s data type layout is essential for applying appropriate masking algorithms to Parquet columns.

Parquet is a columnar storage format that utilizes two layers of data typing.

Physical types (Primitive types)
Logical types (Extended types)

Physical types

Parquet defines a limited set of physical (primitive) types that determine how data is stored on disk at the binary level. These types form the foundational structure for all Parquet data.

The table below outlines each physical type, the memory footprint it occupies, and how the Java-based Parquet reader interprets it within the context of the Delphix Continuous Compliance engine.

Primitive Type	Storage Size	Description	Java Mapping
BOOLEAN	1 bit (packed)	True or False	boolean
INT32	4 bytes	32-bit signed integer	int
INT64	8 bytes	64-bit signed integer	long
INT96	12 bytes	Legacy nanosecond timestamp (used by Impala or Hive engines)	Binary (decoded to LocalDateTime)
FLOAT	4 bytes	IEEE 754 single-precision float	float
DOUBLE	8 bytes	IEEE 754 double-precision float	double
BYTE_ARRAY / BINARY	Variable	Length-prefixed binary blob (0–2³¹-1 bytes)	Binary / byte[]
FIXED_LEN_BYTE_ARRAY	N bytes (fixed length)	Fixed-length binary blob	Binary / byte[]

Logical types

Parquet logical types (represented in code via LogicalTypeAnnotation or the older ConvertedType annotate the core set of physical (primitive) types with semantic meaning. These annotations do not alter the underlying storage format on disk; instead, they instruct Parquet readers to interpret the stored bytes in a specific way.

For example, a logical type might indicate that a raw byte sequence should be interpreted as a timestamp, a fixed-length byte array as a UUID, or an integer as a decimal with scale 2. This semantic layer ensures data is read and processed correctly by consuming applications.

It’s important to note that some logical types support only specific masking algorithms, and a few are currently not supported for masking. Refer to the table below for masking support details:

Logical Type	Physical Storage	Description	Supported Masking Algorithm
STRING	BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY	UTF-8 encoded text	Supports string-based algorithms
MAP	3-level GROUP	Key→value pairs (repeated key_value groups)	Supported
LIST	3-level GROUP	Ordered collection of elements	Supported
ENUM	BINARY	One of a predefined set of string values	Supports string-based algorithms
DECIMAL	INT32, INT64, or FIXED_LEN_BYTE_ARRAY	Fixed-precision number with (precision, scale)	Supports only numeric based algorithms
DATE	INT32	Days since Unix epoch	Supports only numeric based algorithms
TIME	INT32/INT64 + unit (MILLIS/MICROS/NANOS)	Time since midnight (with optional UTC adjustment flag)	Supports only numeric based algorithms
TIMESTAMP	INT64 + unit (MILLIS/MICROS/NANOS)	Instant since Unix epoch (with optional UTC adjustment flag)	Supports only numeric based algorithms
INTEGER	INT32/INT64	Integer with explicit bit-width and signedness	Supports only numeric based algorithms
JSON	BYTE_ARRAY	JSON document as text	Supported
BSON	BYTE_ARRAY	BSON document	Unsupported
UUID	FIXED_LEN_BYTE_ARRAY(16)	128-bit universally unique identifier	Supported. Since UUID is hexadecimal value, consider algorithms reflecting valid UUID after masking.
INTERVAL	FIXED_LEN_BYTE_ARRAY(12)	Hive-style interval: months (4 bytes), days (4 bytes), milliseconds (4 bytes), little-endian	Unsupported
FLOAT16	FIXED_LEN_BYTE_ARRAY(2)	IEEE 754 half-precision (16-bit) floating point	Unsupported
TIMESTAMP	INT96	Legacy nanosecond timestamp	Supports only Date based algorithms with only date format: `yyyy-MM-dd HH:mm:ss` (this causes precision loss of nanoseconds)

Algorithms and masking parquet

Many storage engines—including Apache Hive, Apache Drill, AWS Athena/Redshift Spectrum, Presto/Trino, DuckDB, Delta Lake, and others—use Parquet files as their underlying storage format. When implementing data masking at the storage layer, it is essential to follow a schema-aware and type-sensitive approach.

To apply masking effectively to Parquet data:

Inspect the generated Parquet schema.
Select masking algorithms based on the physical type of the column.
Ensure that masked outputs confirm to the original Parquet type definitions and any associated logical annotations.

Masking algorithms must target the physical storage type of each column, regardless of any higher-level logical annotations (such as TIMESTAMP or DECIMAL).

Example scenario: If the Parquet schema defines a column with a logical type of timestamp, check the underlying physical type:

If the type is INT96, use a date-based algorithm such as DateShift.
If the type is INT32 or INT64, use a numeric masking algorithm.

Memory considerations for Parquet data masking job

A row group is the smallest partition of data loaded into memory when reading a Parquet file. The size of a row group can range from a few megabytes to several gigabytes.

Profiling job: The minimum memory allocation must be at least equal to the size of the largest row group in the file.

Masking job: Since masking involves reading and writing simultaneously, the minimum memory allocation must be twice the largest row group size. This ensures that heap memory can accommodate row groups from both the input and output files.

Note that memory estimations based on row groups provide a ballpark range, however, the actual job may require additional memory due to other processing operations.

Parquet masking limitations

In a parquet file, each column chunk and different row groups of the same column may use different compression codecs. If the masking engine encounters such a file, the masked output file is written without compression, using the UNCOMPRESSED codec.

Writing a new masked parquet file uses the parquet file writer’s default properties. As a result, some properties and statistics from the original file may not be carried over to the new masked file.

For a Parquet file execution-component section, Bytes Processed represents the total data written. However, since the Parquet file metadata size cannot currently be calculated, the Bytes Remaining in the execution-component may show a non-zero value (i.e., metadata size) even after masking is complete.