Concat
Concatenates multiple input columns together into a single column, optionally using a given separator.
koheesio.spark.transformations.strings.concat.Concat #
This is a wrapper around PySpark concat() and concat_ws() functions
Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.
Concept
When working with arrays, the function will return the result of the concatenation of the elements in the array.
- If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.
- If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.
When working with date/timestamps, the function will return the result of the concatenation of the elements in the
array. The timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
Union[str, List[str]]
|
The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first. |
required |
target_column
|
Optional[str]
|
Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'. |
None
|
spacer
|
Optional[str]
|
Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used |
None
|
Example
Example using a string column and a timestamp column#
input_df:
column_a | column_b |
---|---|
text | 1997-02-28 10:30:00 |
output_df = Concat(
columns=["column_a", "column_b"],
target_column="concatenated_column",
spacer="--",
).transform(input_df)
output_df:
column_a | column_b | concatenated_column |
---|---|---|
text | 1997-02-28 10:30:00 | text--1997-02-28 10:30:00 |
In the example above, the resulting column is a string column.
If we had left out the spacer, the resulting column would have had the value of text1997-02-28 10:30:00
(a string). Note that the timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss
.
Example using two array columns#
input_df:
array_col_1 | array_col_2 |
---|---|
[text1, text2] | [text3, text4] |
output_df = Concat(
columns=["array_col_1", "array_col_2"],
target_column="concatenated_column",
spacer="--",
).transform(input_df)
output_df:
array_col_1 | array_col_2 | concatenated_column |
---|---|---|
[text1, text2] | [text3, text4] | "text1--text2--text3" |
Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would
have been an array with the values of ["text1", "text2", "text3"]
.
Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.
spacer
class-attribute
instance-attribute
#
spacer: Optional[str] = Field(
default=None,
description="Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used",
alias="sep",
)
target_column
class-attribute
instance-attribute
#
target_column: Optional[str] = Field(
default=None,
description="Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.",
)
get_target_column #
Get the target column name if it is not provided.
If not provided, a name will be generated by concatenating the names of the source columns with an '_'.