Skip to content

Hash

Module for hashing data using SHA-2 family of hash functions

See the docstring of the Sha2Hash class for more information.

koheesio.spark.transformations.hash.HASH_ALGORITHM module-attribute #

HASH_ALGORITHM = Literal[224, 256, 384, 512]

koheesio.spark.transformations.hash.STRING module-attribute #

STRING = STRING

koheesio.spark.transformations.hash.Sha2Hash #

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).

Note

This function allows concatenating the values of multiple columns together prior to hashing.

Parameters:

Name Type Description Default
columns Union[str, List[str]]

The column (or list of columns) to hash. Alias: column

required
delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

|
num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

256
target_column str

The generated hash will be written to the column name specified here

required

delimiter class-attribute instance-attribute #

delimiter: Optional[str] = Field(
    default="|",
    description="Optional separator for the string that will eventually be hashed. Defaults to '|'",
)

num_bits class-attribute instance-attribute #

num_bits: Optional[HASH_ALGORITHM] = Field(
    default=256,
    description="Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512",
)

target_column class-attribute instance-attribute #

target_column: str = Field(
    default=...,
    description="The generated hash will be written to the column name specified here",
)

execute #

execute()
Source code in src/koheesio/spark/transformations/hash.py
def execute(self):
    columns = list(self.get_columns())
    self.output.df = (
        self.df.withColumn(
            self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)
        )
        if columns
        else self.df
    )

koheesio.spark.transformations.hash.sha2_hash #

sha2_hash(
    columns: List[str],
    delimiter: Optional[str] = "|",
    num_bits: Optional[HASH_ALGORITHM] = 256,
)

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.

If a null is passed, the result will also be null.

Parameters:

Name Type Description Default
columns List[str]

The columns to hash

required
delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

|
num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

256
Source code in src/koheesio/spark/transformations/hash.py
def sha2_hash(columns: List[str], delimiter: Optional[str] = "|", num_bits: Optional[HASH_ALGORITHM] = 256):
    """
    hash the value of 1 or more columns using SHA-2 family of hash functions

    Mild wrapper around pyspark.sql.functions.sha2

    - https://spark.apache.org/docs/3.3.2/api/python/reference/api/pyspark.sql.functions.sha2.html

    Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
    This function allows concatenating the values of multiple columns together prior to hashing.

    If a null is passed, the result will also be null.

    Parameters
    ----------
    columns : List[str]
        The columns to hash
    delimiter : Optional[str], optional, default=|
        Optional separator for the string that will eventually be hashed. Defaults to '|'
    num_bits : Optional[HASH_ALGORITHM], optional, default=256
        Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512
    """
    # make sure all columns are of type pyspark.sql.Column and cast to string
    _columns = []
    for c in columns:
        if isinstance(c, str):
            c: Column = col(c)
        _columns.append(c.cast(STRING.spark_type()))

    # concatenate columns if more than 1 column is provided
    if len(_columns) > 1:
        column = concat_ws(delimiter, *_columns)
    else:
        column = _columns[0]

    return sha2(column, num_bits)