Skip to content


Module for hashing data using SHA-2 family of hash functions

See the docstring of the Sha2Hash class for more information.

koheesio.spark.transformations.hash.HASH_ALGORITHM module-attribute #

HASH_ALGORITHM = Literal[224, 256, 384, 512]

koheesio.spark.transformations.hash.STRING module-attribute #


koheesio.spark.transformations.hash.Sha2Hash #

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).


This function allows concatenating the values of multiple columns together prior to hashing.


Name Type Description Default
columns Union[str, List[str]]

The column (or list of columns) to hash. Alias: column

delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

target_column str

The generated hash will be written to the column name specified here


delimiter class-attribute instance-attribute #

delimiter: Optional[str] = Field(default='|', description="Optional separator for the string that will eventually be hashed. Defaults to '|'")

num_bits class-attribute instance-attribute #

num_bits: Optional[HASH_ALGORITHM] = Field(default=256, description='Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512')

target_column class-attribute instance-attribute #

target_column: str = Field(default=..., description='The generated hash will be written to the column name specified here')

execute #

Source code in src/koheesio/spark/transformations/
def execute(self):
    columns = list(self.get_columns())
    self.output.df = (
            self.target_column, sha2_hash(columns=columns, delimiter=self.delimiter, num_bits=self.num_bits)
        if columns
        else self.df

koheesio.spark.transformations.hash.sha2_hash #

sha2_hash(columns: List[str], delimiter: Optional[str] = '|', num_bits: Optional[HASH_ALGORITHM] = 256)

hash the value of 1 or more columns using SHA-2 family of hash functions

Mild wrapper around pyspark.sql.functions.sha2

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). This function allows concatenating the values of multiple columns together prior to hashing.

If a null is passed, the result will also be null.


Name Type Description Default
columns List[str]

The columns to hash

delimiter Optional[str]

Optional separator for the string that will eventually be hashed. Defaults to '|'

num_bits Optional[HASH_ALGORITHM]

Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512

Source code in src/koheesio/spark/transformations/
def sha2_hash(columns: List[str], delimiter: Optional[str] = "|", num_bits: Optional[HASH_ALGORITHM] = 256):
    hash the value of 1 or more columns using SHA-2 family of hash functions

    Mild wrapper around pyspark.sql.functions.sha2


    Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
    This function allows concatenating the values of multiple columns together prior to hashing.

    If a null is passed, the result will also be null.

    columns : List[str]
        The columns to hash
    delimiter : Optional[str], optional, default=|
        Optional separator for the string that will eventually be hashed. Defaults to '|'
    num_bits : Optional[HASH_ALGORITHM], optional, default=256
        Algorithm to use for sha2 hash. Defaults to 256. Should be one of 224, 256, 384, 512
    # make sure all columns are of type pyspark.sql.Column and cast to string
    _columns = []
    for c in columns:
        if isinstance(c, str):
            c: Column = col(c)

    # concatenate columns if more than 1 column is provided
    if len(_columns) > 1:
        column = concat_ws(delimiter, *_columns)
        column = _columns[0]

    return sha2(column, num_bits)