Split

Splits the contents of a column on basis of a split_pattern

Classes:

Name	Description
`SplitAll`	Splits the contents of a column on basis of a split_pattern.
`SplitAtFirstMatch`	Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

koheesio.spark.transformations.strings.split.SplitAll #

This function splits the contents of a column on basis of a split_pattern.

It splits at al the locations the pattern is found. The new column will be of ArrayType.

Wraps the pyspark.sql.functions.split function.

Parameters:

Name	Type	Description	Default
`columns`	`Union[str, List[str]]`	The column (or list of columns) to split. Alias: column	required
`target_column`	`Optional[str]`	The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.	`None`
`split_pattern`	`str`	This is the pattern that will be used to split the column contents.	required

Example

Splitting with a space as a pattern:#

input_df:

product	amount	country
Banana lemon orange	1000	USA
Carrots Blueberries	1500	USA
Beans	1600	USA

output_df = SplitColumn(
    column="product", target_column="split", split_pattern=" "
).transform(input_df)

output_df:

product	amount	country	split
Banana lemon orange	1000	USA	["Banana", "lemon" "orange"]
Carrots Blueberries	1500	USA	["Carrots", "Blueberries"]
Beans	1600	USA	["Beans"]

split_pattern `class-attribute` `instance-attribute` #

split_pattern: str = Field(
    default=...,
    description="The pattern to split the column contents.",
)

func #

func(column: Column) -> Column

Source code in src/koheesio/spark/transformations/strings/split.py

def func(self, column: Column) -> Column:
    return split(column, pattern=self.split_pattern)

koheesio.spark.transformations.strings.split.SplitAtFirstMatch #

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

Note

SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
The new column will be of StringType.
If you want to split a column more than once, you should call this function multiple times.

Parameters:

Name	Type	Description	Default
`columns`	`Union[str, List[str]]`	The column (or list of columns) to split. Alias: column	required
`target_column`	`Optional[str]`	The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.	`None`
`split_pattern`	`str`	This is the pattern that will be used to split the column contents.	required
`retrieve_first_part`	`Optional[bool]`	Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.	`True`

Example

Splitting with a space as a pattern:#

input_df:

product	amount	country
Banana lemon orange	1000	USA
Carrots Blueberries	1500	USA
Beans	1600	USA

output_df = SplitColumn(
    column="product", target_column="split_first", split_pattern="an"
).transform(input_df)

output_df:

product	amount	country	split_first
Banana lemon orange	1000	USA	B
Carrots Blueberries	1500	USA	Carrots Blueberries
Beans	1600	USA	Be

retrieve_first_part `class-attribute` `instance-attribute` #

retrieve_first_part: Optional[bool] = Field(
    default=True,
    description="Takes the first part of the split when true, the second part when False. Other parts are ignored.",
)

func #

func(column: Column) -> Column

Source code in src/koheesio/spark/transformations/strings/split.py

def func(self, column: Column) -> Column:
    split_func = split(column, pattern=self.split_pattern)

    # first part
    if self.retrieve_first_part:
        return split_func.getItem(0)

    # or, second part
    return coalesce(split_func.getItem(1), lit(""))

Split

koheesio.spark.transformations.strings.split.SplitAll #

Splitting with a space as a pattern:#

split_pattern class-attribute instance-attribute #

func #

koheesio.spark.transformations.strings.split.SplitAtFirstMatch #

Splitting with a space as a pattern:#

retrieve_first_part class-attribute instance-attribute #

func #

split_pattern `class-attribute` `instance-attribute` #

retrieve_first_part `class-attribute` `instance-attribute` #