Regexp

String transformations using regular expressions.

This module contains transformations that use regular expressions to transform strings.

Classes:

Name	Description
`RegexpExtract`	Extract a specific group matched by a Java regexp from the specified string column.
`RegexpReplace`	Searches for the given regexp and replaces all instances with what is in 'replacement'.

koheesio.spark.transformations.strings.regexp.RegexpExtract #

Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.

A wrapper around the pyspark regexp_extract function

Parameters:

Name	Type	Description	Default
`columns`	`ListOfColumns`	The column (or list of columns) to extract from. Alias: column	required
`target_column`	`Optional[str]`	The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.	`None`
`regexp`	`str`	The Java regular expression to extract	required
`index`	`Optional[int]`	When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.	`0`

Example

Extracting the year and week number from a string#

Let's say we have a column containing the year and week in a format like Y## W# and we would like to extract the week numbers.

input_df:

YWK
2020 W1
2021 WK2

output_df = RegexpExtract(
    column="YWK",
    target_column="week_number",
    regexp="Y([0-9]+) ?WK?([0-9]+)",
    index=2,  # remember that this is 1-indexed! So 2 will get the week number in this example.
).transform(input_df)

output_df:

YWK	week_number
2020 W1	1
2021 WK2	2

Using the same example, but extracting the year instead#

If you want to extract the year, you can use index=1.

output_df = RegexpExtract(
    column="YWK",
    target_column="year",
    regexp="Y([0-9]+) ?WK?([0-9]+)",
    index=1,  # remember that this is 1-indexed! So 1 will get the year in this example.
).transform(input_df)

output_df:

YWK	year
2020 W1	2020
2021 WK2	2021

index `class-attribute` `instance-attribute` #

index: Optional[int] = Field(
    default=0,
    description="When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.",
)

regexp `class-attribute` `instance-attribute` #

regexp: str = Field(
    default=...,
    description="The Java regular expression to extract",
)

func #

func(column: Column)

Source code in src/koheesio/spark/transformations/strings/regexp.py

def func(self, column: Column):
    return regexp_extract(column, self.regexp, self.index)

koheesio.spark.transformations.strings.regexp.RegexpReplace #

Searches for the given regexp and replaces all instances with what is in 'replacement'.

A wrapper around the pyspark regexp_replace function

Parameters:

Name	Type	Description	Default
`columns`	`ListOfColumns`	The column (or list of columns) to replace in. Alias: column	required
`target_column`	`Optional[str]`	The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.	`None`
`regexp`	`str`	The regular expression to replace	required
`replacement`	`str`	String to replace matched pattern with.	required

Examples:

input_df: | content | |------------| | hello world|

Let's say you want to replace 'hello'.

output_df = RegexpReplace(
    column="content",
    target_column="replaced",
    regexp="hello",
    replacement="gutentag",
).transform(input_df)

output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|

regexp `class-attribute` `instance-attribute` #

regexp: str = Field(
    default=...,
    description="The regular expression to replace",
)

replacement `class-attribute` `instance-attribute` #

replacement: str = Field(
    default=...,
    description="String to replace matched pattern with.",
)

func #

func(column: Column)

Source code in src/koheesio/spark/transformations/strings/regexp.py

def func(self, column: Column):
    return regexp_replace(column, self.regexp, self.replacement)

Regexp

koheesio.spark.transformations.strings.regexp.RegexpExtract #

Extracting the year and week number from a string#

Using the same example, but extracting the year instead#

index class-attribute instance-attribute #

regexp class-attribute instance-attribute #

func #

koheesio.spark.transformations.strings.regexp.RegexpReplace #

regexp class-attribute instance-attribute #

replacement class-attribute instance-attribute #

func #

index `class-attribute` `instance-attribute` #

regexp `class-attribute` `instance-attribute` #

regexp `class-attribute` `instance-attribute` #

replacement `class-attribute` `instance-attribute` #