Parquet structure

This guide provides an overview of handling Parquet files for data masking. It includes defining Parquet schemas, understanding nested structures, and using Parquet column paths to accurately target data fields for masking.

Understanding Parquet structure

Parquet is a columnar storage format, storing data by columns rather than rows, optimizing compression and query performance. It is a binary format and can only be read using Parquet parsers or readers. A parquet file typically has a file header and footer containing file metadata such as the version, schema, row-group sizes, encoding and compression codecs, statistics, dictionaries, etc. The file body contains the row groups, column chunks, and pages that store the actual data.

Definitions

Schema: A Parquet schema defines the structure of the stored data, including field types, repetition rules, and nesting levels. The schema can include primitive types - INT32, INT64, BOOLEAN, BINARY, FLOAT, DOUBLE and complex types - LIST, MAP, STRUCT.

Groups and Fields: Groups represent structured objects (similar to JSON objects or XML elements), and fields define individual data points within a group (equivalent to JSON fields or XML nodes).

Repetition Level and Definition Levels: Repition level tracks the repeated (LIST) elements inside a nested structure, identifying whether a value starts a new list or continues an existing one. The definition level tracks nullability in a nested or optional column, i.e., it represents how many levels of nesting exist before a value is defined (non-null).

Hierarchical structure

On a higher level, the Parquet hierarchical structure consists of three primary components:

Row Groups: Rows group is a horizontal partition of data (collection of rows), containing column data for a subset of rows. Each row group can be processed independently, allowing parallel processing and reading parts of data without loading the entire dataset into memory.

Columns: Each row group contains column chunks, which store all values for a single column. Each column is stored separately, allowing column-level compression and encoding. The query engine can selectively query required columns, reducing computational overhead and memory usage.

Pages: Each column chunk is divided into Pages containing encoded and compressed data. There are three types of pages: data pages (actual column values), dictionary pages (dictionaries for repeated values), and index pages (optional, store row indexes for faster lookups).

Parquet example (Tabular format)

parquet structure

Understanding ColumnPath

In Parquet, a ColumnPath is the fully qualified name of a column, representing its location within a hierarchical (nested) schema. It is used to reference specific columns in structured or nested datasets.

In the context of the masking engine (specific to parquet-java libraries):

  • For flat schemas, column paths are simply the column names (e.g., First_Name).
  • For nested schemas, column paths represent the full path to a nested field, using a slash (/) separator to indicate hierarchy (e.g., Location/Address).
  • For list schema elements, column paths end in <path>/list/element (e.g., Emails/list/element).

ColumnPath example

First_Name

Last_Name

DOB

SSN

Emails/list/element

Location/Address

Location/CityLocation/House_Number

Location/Zip_Code

Data types in Parquet

This section describes how Apache Parquet stores data on disk and how data types are defined within the format. Understanding Parquet’s data type layout is essential for applying appropriate masking algorithms to Parquet columns.

Parquet is a columnar storage format that utilizes two layers of data typing.

  1. Physical types (Primitive types)
  2. Logical types (Extended types)

Physical types

Parquet defines a limited set of physical (primitive) types that determine how data is stored on disk at the binary level. These types form the foundational structure for all Parquet data.

The table below outlines each physical type, the memory footprint it occupies, and how the Java-based Parquet reader interprets it within the context of the Delphix Continuous Compliance engine.

Primitive Type

Storage Size

Description

Java Mapping

BOOLEAN

1 bit (packed)

True or False

boolean

INT32

4 bytes

32-bit signed integer

int

INT64

8 bytes

64-bit signed integer

long

INT96

12 bytes

Legacy nanosecond timestamp (used by Impala or Hive engines)

Binary (decoded to LocalDateTime)

FLOAT

4 bytes

IEEE 754 single-precision float

float

DOUBLE

8 bytes

IEEE 754 double-precision float

double

BYTE_ARRAY / BINARY

Variable

Length-prefixed binary blob (0–2³¹-1 bytes)

Binary / byte[]

FIXED_LEN_BYTE_ARRAY

N bytes (fixed length)

Fixed-length binary blob

Binary / byte[]

Logical types

Parquet logical types (represented in code via LogicalTypeAnnotation or the older ConvertedType annotate the core set of physical (primitive) types with semantic meaning. These annotations do not alter the underlying storage format on disk; instead, they instruct Parquet readers to interpret the stored bytes in a specific way.

For example, a logical type might indicate that a raw byte sequence should be interpreted as a timestamp, a fixed-length byte array as a UUID, or an integer as a decimal with scale 2. This semantic layer ensures data is read and processed correctly by consuming applications.

It’s important to note that some logical types support only specific masking algorithms, and a few are currently not supported for masking. Refer to the table below for masking support details:

Logical Type

Physical Storage

Description

Supported Masking Algorithm

STRING

BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY

UTF-8 encoded text

Supports string-based algorithms

MAP

3-level GROUP

Key→value pairs (repeated key_value groups)

Supported

LIST

3-level GROUP

Ordered collection of elements

Supported

ENUM

BINARY

One of a predefined set of string values

Supports string-based algorithms

DECIMAL

INT32, INT64, or FIXED_LEN_BYTE_ARRAY

Fixed-precision number with (precision, scale)

Supports only numeric based algorithms

DATE

INT32

Days since Unix epoch

Supports only numeric based algorithms

TIME

INT32/INT64 + unit (MILLIS/MICROS/NANOS)

Time since midnight (with optional UTC adjustment flag)

Supports only numeric based algorithms

TIMESTAMP

INT64 + unit (MILLIS/MICROS/NANOS)

Instant since Unix epoch (with optional UTC adjustment flag)

Supports only numeric based algorithms

INTEGER

INT32/INT64

Integer with explicit bit-width and signedness

Supports only numeric based algorithms

JSON

BYTE_ARRAY

JSON document as text

Supported

BSON

BYTE_ARRAY

BSON document

Unsupported

UUID

FIXED_LEN_BYTE_ARRAY(16)

128-bit universally unique identifier

Supported. Since UUID is hexadecimal value, consider algorithms reflecting valid UUID after masking.

INTERVAL

FIXED_LEN_BYTE_ARRAY(12)

Hive-style interval: months (4 bytes), days (4 bytes), milliseconds (4 bytes), little-endian

Unsupported

FLOAT16

FIXED_LEN_BYTE_ARRAY(2)

IEEE 754 half-precision (16-bit) floating point

Unsupported

TIMESTAMP

INT96

Legacy nanosecond timestamp

Supports only Date based algorithms with only date format: yyyy-MM-dd HH:mm:ss (this causes precision loss of nanoseconds)

Algorithms and masking parquet

Many storage engines—including Apache Hive, Apache Drill, AWS Athena/Redshift Spectrum, Presto/Trino, DuckDB, Delta Lake, and others—use Parquet files as their underlying storage format. When implementing data masking at the storage layer, it is essential to follow a schema-aware and type-sensitive approach.

To apply masking effectively to Parquet data:

  • Inspect the generated Parquet schema.
  • Select masking algorithms based on the physical type of the column.
  • Ensure that masked outputs confirm to the original Parquet type definitions and any associated logical annotations.
Masking algorithms must target the physical storage type of each column, regardless of any higher-level logical annotations (such as TIMESTAMP or DECIMAL).

Example scenario: If the Parquet schema defines a column with a logical type of timestamp, check the underlying physical type:

  • If the type is INT96, use a date-based algorithm such as DateShift.
  • If the type is INT32 or INT64, use a numeric masking algorithm.

Memory considerations for Parquet data masking job

A row group is the smallest partition of data loaded into memory when reading a Parquet file. The size of a row group can range from a few megabytes to several gigabytes.

Profiling job: The minimum memory allocation must be at least equal to the size of the largest row group in the file.

Masking job: Since masking involves reading and writing simultaneously, the minimum memory allocation must be twice the largest row group size. This ensures that heap memory can accommodate row groups from both the input and output files.

Note that memory estimations based on row groups provide a ballpark range, however, the actual job may require additional memory due to other processing operations.

Parquet masking limitations

In a parquet file, each column chunk and different row groups of the same column may use different compression codecs. If the masking engine encounters such a file, the masked output file is written without compression, using the UNCOMPRESSED codec.

Writing a new masked parquet file uses the parquet file writer’s default properties. As a result, some properties and statistics from the original file may not be carried over to the new masked file.

For a Parquet file execution-component section, Bytes Processed represents the total data written. However, since the Parquet file metadata size cannot currently be calculated, the Bytes Remaining in the execution-component may show a non-zero value (i.e., metadata size) even after masking is complete.