Uuid5
Ability to generate UUID5 using native pyspark (no udf)
koheesio.spark.transformations.uuid5.HashUUID5 #
Generate a UUID with the UUID5 algorithm
Spark does not provide inbuilt API to generate version 5 UUID, hence we have to use a custom implementation to provide this capability.
Prerequisites: this function has no side effects. But be aware that in most cases, the expectation is that your data is clean (e.g. trimmed of leading and trailing spaces)
Concept
UUID5 is based on the SHA-1 hash of a namespace identifier (which is a UUID) and a name (which is a string). https://docs.python.org/3/library/uuid.html#uuid.uuid5
Based on https://github.com/MrPowers/quinn/pull/96 with the difference that since Spark 3.0.0 an OVERLAY function from ANSI SQL 2016 is available which saves coding space and string allocation(s) in place of CONCAT + SUBSTRING.
For more info on OVERLAY, see: https://docs.databricks.com/sql/language-manual/functions/overlay.html
Example
Input is a DataFrame with two columns:
id | string |
---|---|
1 | hello |
2 | world |
3 |
Input parameters:
- source_columns = ["id", "string"]
- target_column = "uuid5"
Result:
id | string | uuid5 |
---|---|---|
1 | hello | f3e99bbd-85ae-5dc3-bf6e-cd0022a0ebe6 |
2 | world | b48e880f-c289-5c94-b51f-b9d21f9616c0 |
3 | 2193a99d-222e-5a0c-a7d6-48fbe78d2708 |
In code:
In this example, the id
and string
columns are concatenated and hashed using the UUID5 algorithm. The result is
stored in the uuid5
column.
delimiter
class-attribute
instance-attribute
#
delimiter: Optional[str] = Field(
default="|",
description="Separator for the string that will eventually be hashed",
)
description
class-attribute
instance-attribute
#
description: str = (
"Generate a UUID with the UUID5 algorithm"
)
extra_string
class-attribute
instance-attribute
#
extra_string: Optional[str] = Field(
default="",
description="In case of collisions, one can pass an extra string to hash on.",
)
namespace
class-attribute
instance-attribute
#
source_columns
class-attribute
instance-attribute
#
source_columns: ListOfColumns = Field(
default=...,
description="List of columns that should be hashed. Should contain the name of at least 1 column. A list of columns or a single column can be specified. For example: `['column1', 'column2']` or `'column1'`",
)
target_column
class-attribute
instance-attribute
#
target_column: str = Field(
default=...,
description="The generated UUID will be written to the column name specified here",
)
execute #
Source code in src/koheesio/spark/transformations/uuid5.py
koheesio.spark.transformations.uuid5.hash_uuid5 #
hash_uuid5(
input_value: str,
namespace: Optional[Union[str, UUID]] = "",
extra_string: Optional[str] = "",
)
pure python implementation of HashUUID5
See: https://docs.python.org/3/library/uuid.html#uuid.uuid5
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_value |
str
|
value that will be hashed |
required |
namespace |
Optional[str | UUID]
|
namespace DNS |
''
|
extra_string |
Optional[str]
|
optional extra string that will be prepended to the input_value |
''
|
Returns:
Type | Description |
---|---|
str
|
uuid.UUID (uuid5) cast to string |
Source code in src/koheesio/spark/transformations/uuid5.py
koheesio.spark.transformations.uuid5.uuid5_namespace #
Helper function used to provide a UUID5 hashed namespace based on the passed str
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ns |
Optional[Union[str, UUID]]
|
A str, an empty string (or None), or an existing UUID can be passed |
required |
Returns:
Type | Description |
---|---|
UUID
|
UUID5 hashed namespace |