Split
Splits the contents of a column on basis of a split_pattern
Classes:
Name | Description |
---|---|
SplitAll |
Splits the contents of a column on basis of a split_pattern. |
SplitAtFirstMatch |
Like SplitAll, but only splits the string once. You can specify whether you want the first or second part. |
koheesio.spark.transformations.strings.split.SplitAll #
This function splits the contents of a column on basis of a split_pattern.
It splits at al the locations the pattern is found. The new column will be of ArrayType.
Wraps the pyspark.sql.functions.split function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Union[str, List[str]]
|
The column (or list of columns) to split. Alias: column |
required |
target_column |
Optional[str]
|
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix. |
None
|
split_pattern |
str
|
This is the pattern that will be used to split the column contents. |
required |
Example
Splitting with a space as a pattern:#
input_df:
product | amount | country |
---|---|---|
Banana lemon orange | 1000 | USA |
Carrots Blueberries | 1500 | USA |
Beans | 1600 | USA |
output_df = SplitColumn(
column="product", target_column="split", split_pattern=" "
).transform(input_df)
output_df:
product | amount | country | split |
---|---|---|---|
Banana lemon orange | 1000 | USA | ["Banana", "lemon" "orange"] |
Carrots Blueberries | 1500 | USA | ["Carrots", "Blueberries"] |
Beans | 1600 | USA | ["Beans"] |
koheesio.spark.transformations.strings.split.SplitAtFirstMatch #
Like SplitAll, but only splits the string once. You can specify whether you want the first or second part..
Note
- SplitAtFirstMatch splits the string only once. To specify whether you want the first part or the second you can toggle the parameter retrieve_first_part.
- The new column will be of StringType.
- If you want to split a column more than once, you should call this function multiple times.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Union[str, List[str]]
|
The column (or list of columns) to split. Alias: column |
required |
target_column |
Optional[str]
|
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix. |
None
|
split_pattern |
str
|
This is the pattern that will be used to split the column contents. |
required |
retrieve_first_part |
Optional[bool]
|
Takes the first part of the split when true, the second part when False. Other parts are ignored. Defaults to True. |
True
|
Example
Splitting with a space as a pattern:#
input_df:
product | amount | country |
---|---|---|
Banana lemon orange | 1000 | USA |
Carrots Blueberries | 1500 | USA |
Beans | 1600 | USA |
output_df = SplitColumn(
column="product", target_column="split_first", split_pattern="an"
).transform(input_df)
output_df:
product | amount | country | split_first |
---|---|---|---|
Banana lemon orange | 1000 | USA | B |
Carrots Blueberries | 1500 | USA | Carrots Blueberries |
Beans | 1600 | USA | Be |
retrieve_first_part
class-attribute
instance-attribute
#
retrieve_first_part: Optional[bool] = Field(
default=True,
description="Takes the first part of the split when true, the second part when False. Other parts are ignored.",
)