Skip to content

Split

Splits the contents of a column on basis of a split_pattern

Classes:

Name Description
SplitAll

Splits the contents of a column on basis of a split_pattern.

SplitAtFirstMatch

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part.

koheesio.spark.transformations.strings.split.SplitAll #

This function splits the contents of a column on basis of a split_pattern.

It splits at al the locations the pattern is found. The new column will be of ArrayType.

Wraps the pyspark.sql.functions.split function.

Parameters:

Name Type Description Default
columns Union[str, List[str]]

The column (or list of columns) to split. Alias: column

required
target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None
split_pattern str

This is the pattern that will be used to split the column contents.

required
Example
Splitting with a space as a pattern:#

input_df:

product amount country
Banana lemon orange 1000 USA
Carrots Blueberries 1500 USA
Beans 1600 USA
output_df = SplitColumn(
    column="product", target_column="split", split_pattern=" "
).transform(input_df)

output_df:

product amount country split
Banana lemon orange 1000 USA ["Banana", "lemon" "orange"]
Carrots Blueberries 1500 USA ["Carrots", "Blueberries"]
Beans 1600 USA ["Beans"]

split_pattern class-attribute instance-attribute #

split_pattern: str = Field(
    default=...,
    description="The pattern to split the column contents.",
)

func #

func(column: Column)
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):
    return split(column, pattern=self.split_pattern)

koheesio.spark.transformations.strings.split.SplitAtFirstMatch #

Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..

Note
  • SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
  • The new column will be of StringType.
  • If you want to split a column more than once, you should call this function multiple times.

Parameters:

Name Type Description Default
columns Union[str, List[str]]

The column (or list of columns) to split. Alias: column

required
target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None
split_pattern str

This is the pattern that will be used to split the column contents.

required
retrieve_first_part Optional[bool]

Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True.

True
Example
Splitting with a space as a pattern:#

input_df:

product amount country
Banana lemon orange 1000 USA
Carrots Blueberries 1500 USA
Beans 1600 USA
output_df = SplitColumn(
    column="product", target_column="split_first", split_pattern="an"
).transform(input_df)

output_df:

product amount country split_first
Banana lemon orange 1000 USA B
Carrots Blueberries 1500 USA Carrots Blueberries
Beans 1600 USA Be

retrieve_first_part class-attribute instance-attribute #

retrieve_first_part: Optional[bool] = Field(
    default=True,
    description="Takes the first part of the split when true, the second part when False. Other parts are ignored.",
)

func #

func(column: Column)
Source code in src/koheesio/spark/transformations/strings/split.py
def func(self, column: Column):
    split_func = split(column, pattern=self.split_pattern)

    # first part
    if self.retrieve_first_part:
        return split_func.getItem(0)

    # or, second part
    return coalesce(split_func.getItem(1), lit(""))