Concat

Concatenates multiple input columns together into a single column, optionally using a given separator.

koheesio.spark.transformations.strings.concat.Concat #

This is a wrapper around PySpark concat() and concat_ws() functions

Concatenates multiple input columns together into a single column, optionally using a given separator. The function works with strings, date/timestamps, binary, and compatible array columns.

Concept

When working with arrays, the function will return the result of the concatenation of the elements in the array.

If a spacer is used, the resulting output will be a string with the elements of the array separated by the spacer.
If no spacer is used, the resulting output will be an array with the elements of the array concatenated together.

When working with date/timestamps, the function will return the result of the concatenation of the elements in the array. The timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss.

Parameters:

Name	Type	Description	Default
`columns`	`Union[str, List[str]]`	The column (or list of columns) to concatenate. Alias: column. If at least one of the values is None or null, the resulting string will also be None/null (except for when using arrays). Columns can be of any type, but must ideally be of the same type. Different types can be used, but the function will convert them to string values first.	required
`target_column`	`Optional[str]`	Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.	`None`
`spacer`	`Optional[str]`	Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used	`None`

Example

Example using a string column and a timestamp column#

input_df:

column_a	column_b
text	1997-02-28 10:30:00

output_df = Concat(
    columns=["column_a", "column_b"],
    target_column="concatenated_column",
    spacer="--",
).transform(input_df)

output_df:

column_a	column_b	concatenated_column
text	1997-02-28 10:30:00	text--1997-02-28 10:30:00

In the example above, the resulting column is a string column.

If we had left out the spacer, the resulting column would have had the value of text1997-02-28 10:30:00 (a string). Note that the timestamp is converted to a string using the default format of yyyy-MM-dd HH:mm:ss.

Example using two array columns#

input_df:

array_col_1	array_col_2
[text1, text2]	[text3, text4]

output_df = Concat(
    columns=["array_col_1", "array_col_2"],
    target_column="concatenated_column",
    spacer="--",
).transform(input_df)

output_df:

array_col_1	array_col_2	concatenated_column
[text1, text2]	[text3, text4]	"text1--text2--text3"

Note that the null value in the second array is ignored. If we had left out the spacer, the resulting column would have been an array with the values of ["text1", "text2", "text3"].

Array columns can only be concatenated with another array column. If you want to concatenate an array column with a none-array value, you will have to convert said column to an array first.

spacer `class-attribute` `instance-attribute` #

spacer: Optional[str] = Field(
    default=None,
    description="Optional spacer / separator symbol. Defaults to None. When left blank, no spacer is used",
    alias="sep",
)

target_column `class-attribute` `instance-attribute` #

target_column: Optional[str] = Field(
    default=None,
    description="Target column name. When this is left blank, a name will be generated by concatenating the names of the source columns with an '_'.",
)

execute #

execute() -> DataFrame

Source code in src/koheesio/spark/transformations/strings/concat.py

def execute(self) -> DataFrame:
    columns = [col(s) for s in self.get_columns()]
    self.output.df = self.df.withColumn(
        self.target_column, concat_ws(self.spacer, *columns) if self.spacer else concat(*columns)
    )

get_target_column #

get_target_column(target_column_value, values)

Get the target column name if it is not provided.

If not provided, a name will be generated by concatenating the names of the source columns with an '_'.