The MaskingAlgorithm Java interface
Any Java class that should be recognized as a masking algorithm (whether standalone or configurable) must implement the MaskingAlgorithm interface. This interface is parameterized with the data type that the algorithm masks, which defines the input and output data type of the mask method. The full details of this interface are described in the masking plugin API Java document.
Core data types
The Delphix Continuous Compliance Engine is designed to support a wide and extensible set of data sources, which naturally encode data in a variety of different formats. In order to simplify algorithm development while maintaining the ability to mask data from many sources, a core set of data formats have been identified to likely require different masking treatment, and the Extensible Algorithm framework ensures that all data is converted to and from these types (as needed). These types define the allowed parameterization of the MaskingAlgorithm Java interface.
Each masking algorithm class is defined to mask exactly one of the following data types:
-
Binary data - java.nio.ByteBuffer
-
String data - java.lang.String
-
Numeric data - java.math.BigDecimal
-
Date time data - java.time.LocalDateTime
-
Multi-column data - com.delphix.masking.api.plugin.utils.GenericDataRow (See Multi-Column Masking section)
Each algorithm is expected to input, process, and emit objects of one of the above Java types, but is free to use any intermediate types (as needed) to access library methods. Since that is frequently the case, that data of one type is stored in databases or documents in a type other than its most natural native type (ex. dates stored in VARCHAR fields, or numbers stored as text in a CSV file), the masking framework that executes these algorithms is capable of performing a number of automatic type conversions, detailed in the next section. This allows algorithms written to process one data type to handle data of other types, with no additional work required of the algorithm author.
Supported automatic type conversions
Algorithm native type |
Supported type |
Notes |
---|---|---|
ByteBuffer |
String |
Algorithm receives the UTF-8 encoded value of the String and is expected to return a valid UTF-8 ByteBuffer. |
LocalDateTime |
String |
The correct date format must be assigned to the field or column in the masking inventory. |
LocalDateTime |
Compatible numeric types |
A compatible date format, such as yyyyMMdd, must be assigned to the column in inventory. |
BigDecimal |
All numeric types |
Unconverted to BigDecimal. Values out of range (after masking) are truncated to fit the range of the underlying type. |
BigDecimal |
String |
String value is converted to a number. |
Special case values
In order to allow algorithms to implement special handling for null, empty, and special case values, these values are presented to the masking algorithm unmodified. Algorithms should be prepared to process the full range of input values possible for the input type. In practice, this means that most mask method implementations will begin with a null check on the input value prior to attempting to use the input – for example, calling input.length() or similar. It is perfectly acceptable and common to return null in the case where the mask input is null.
Method overview
This section provides a high-level overview of the methods in the MaskingAlgorithm interface. For complete details, consult the masking plugin API Java document included in the Algorithm SDK archive.
-
getName and getDescription - These methods are used to determine the name and description of frameworks and algorithm instances included in the plugin. For user-created instances, these methods are never called.
-
getDefaultInstances and getAllowFurtherInstances - These methods control the set of instances of the algorithm framework that are defined by the plugin, and whether the user should be allowed to create additional instances.
-
validate - This method is called after configuration is applied to allow the algorithm class to check whether the injected configuration is valid.
-
setup and tearDown - These methods are called before the algorithm object is used for masking, and after, respectively. Typically, any resources, such as input files, are acquired during setup and released during tearDown.
-
mask - This is the method that does the actual data masking in the algorithm class. The input and output values are parameterized for type safety as described above
-
maskBatch - This method is called to perform masking in situations when it is possible for the caller to build a collection of input values to mask in a single method call. A default implementation is provided that simply calls the mask method on each value in the batch.
-
listMultiColumnFields - This method needs to be implemented only for Multi-Column Algorithms. It returns a list of AlgorithmLogicalField objects that define the set of fields that the multi-column algorithm masks.
The following methods are available but deprecated:
-
listMaskedFields - This method needs to be implemented for Multi-Column Algorithms. It returns a map of field names (
String
) to the Core Data Type. This method does not need to be implemented if not implementing a Multi-Column Algorithm. Implement listMultiColumnFields instead. -
listReadOnlyFields - Similar to
listMaskedFields
but optional for Multi-Column Algorithms. Fields returned by this method are read-only and cannot be changed. Implement listMultiColumnFields instead.
The life cycles of algorithm objects
The Extensibility framework uses objects classes implementing MaskingAlgorithm interface for several distinct purposes. These object life cycles are as follows:
Plugin discovery
This occurs when the extensibility framework evaluates the capabilities present in a MaskingAlgorithm class.
-
Java object creation - an object of the algorithm class is created
-
getName - determines framework name
-
getDescription - determines framework description
-
getDefaultInstances- determines all plugin-provided algorithm instances. For each instance:
-
getName - determines instance name
-
getDescription - determines instance description
-
validate - ensure object passes validation
-
Serialize configurable fields - these are saved as a JSON document defining the instance's configuration
-
Disposal - the Java object is discarded
-
-
getAllowFurtherInstances - determines whether the framework is visible in the algorithm/framework API endpoint
-
Disposal - the Java object is discarded
User algorithm creation
This life cycle occurs whenever a user attempts to create a new instance of a plugin algorithm framework. The algorithm definition is saved only if each step succeeds.
-
Java object creation - an object of the algorithm class is created
-
Configuration injection - the values in the user-provided JSON document are injected into the object
-
validate - the object's validate method is called
-
Disposal - the Java object is discarded
Algorithm use
This is the life cycle of an algorithm object when used to mask data.
-
Java object creation - an object of the algorithm class is created
-
Configuration injection - the saved JSON document defining this instance is injected in the object
-
setup - the setup method is called once
-
mask - the mask method is called on each value to be masked
-
tearDown - the tearDown method is called once
-
Disposal - the Java object is discarded
It should be noted that a distinct Java object is created for each application of a masking algorithm during Job execution. For algorithms that create or load a large amount of state, this can result in significant memory usage storing redundant data for each instance. This can be avoided using a class level static cache to store data; the instance name, which can be retrieved during setup from the ComponentService interface object, can be used as an access key for data cached in this way.
Multi-column masking
It is possible to write an algorithm that masks data that depends on other column(s) values. In order to account for the different possible data types, we use an object called a GenericDataRow
.
Generic data
A GenericDataRow
is a map of field names (String
) to GenericData
objects. Each GenericData
object contains the value, along with methods to return the respective typed object. When accessing the value from a GenericDataObject
it will be necessary to read it into a Core Data Type. To do so, use one of the following methods:
-
getStringValue()
-
getBigDecimalValue()
-
getLocalDateTimeValue()
-
getByteBufferValue()
Once the value has been masked it should be re-set by calling setValue
and passing as an argument the value as a Core Data Type.
Batch masking
Batch masking is a feature that can improve algorithm performance significantly when high latency operations are employed as part of the masking process. Accessing an external resource like a database or API introduces significant execution latency compared executing Java code; batching incurs only a single round-trip latency while masking many values. Batching also allows the interchange of values between data rows during masking.
Batch masking support in jobs
Batching is currently supported for these job types:
-
All Database masking jobs
-
Delimited File jobs
-
Fixed-Width File jobs
Batch size is equal to the job's Row Limit
divided by 5, or equal to 2000 when the Row Limit
is disabled; this is the guaranteed lower bound for batch size, assuming at least that number of inputs are available and no conditional record types are present. The final batch when processing a table or file may be up to twice the normal batch size.
For file jobs, the presence of conditional record types will cause batch sizes to be unpredictable, as the availability of records for batch execution will naturally vary based on how many records actually match the criteria for each record type. Algorithms that require a minimum batch size, such as Secure Shuffle, may fail in this case.
Using Batch Masking in an algorithm implementation
An algorithm implementation can customize how batches of values are masked by overriding the maskBatch method in the MaskingAlgorithm
interface. There is no reason to implement this method unless there is a benefit to processing multiple values in a single operation. A common example of this is when the algorithm is accessing an external API to perform masking; in this case, masking multiple inputs per method call allows the access latency of the API to be incurred only once for the entire batch of inputs.
The maskBatch method is called with a MaskingBatch
object parameterized by the same Java type used in the MaskingAlgorithm
interface definition. The MaskingBatch
object provides the following methods to facilitate masking:
-
size - returns the size of the batch of values
-
getValue - returns the value to be masked at a particular index in the batch
-
setValue - sets the mask result at a particular index in the batch
-
setError - indicates that an error occurred when masking the input value at a particular index in the batch
The default implementation of maskBatch in the MaskingAlgorithm
interface provides a simple example of how to use these methods.