Regexp
String transformations using regular expressions.
This module contains transformations that use regular expressions to transform strings.
Classes:
Name | Description |
---|---|
RegexpExtract |
Extract a specific group matched by a Java regexp from the specified string column. |
RegexpReplace |
Searches for the given regexp and replaces all instances with what is in 'replacement'. |
koheesio.spark.transformations.strings.regexp.RegexpExtract #
Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.
A wrapper around the pyspark regexp_extract function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
ListOfColumns
|
The column (or list of columns) to extract from. Alias: column |
required |
target_column
|
Optional[str]
|
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix. |
None
|
regexp
|
str
|
The Java regular expression to extract |
required |
index
|
Optional[int]
|
When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match. |
0
|
Example
Extracting the year and week number from a string#
Let's say we have a column containing the year and week in a format like Y## W#
and we would like to extract the
week numbers.
input_df:
YWK |
---|
2020 W1 |
2021 WK2 |
output_df = RegexpExtract(
column="YWK",
target_column="week_number",
regexp="Y([0-9]+) ?WK?([0-9]+)",
index=2, # remember that this is 1-indexed! So 2 will get the week number in this example.
).transform(input_df)
output_df:
YWK | week_number |
---|---|
2020 W1 | 1 |
2021 WK2 | 2 |
Using the same example, but extracting the year instead#
If you want to extract the year, you can use index=1.
output_df = RegexpExtract(
column="YWK",
target_column="year",
regexp="Y([0-9]+) ?WK?([0-9]+)",
index=1, # remember that this is 1-indexed! So 1 will get the year in this example.
).transform(input_df)
output_df:
YWK | year |
---|---|
2020 W1 | 2020 |
2021 WK2 | 2021 |
index
class-attribute
instance-attribute
#
index: int = Field(
default=0,
description="When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.",
)
koheesio.spark.transformations.strings.regexp.RegexpReplace #
Searches for the given regexp and replaces all instances with what is in 'replacement'.
A wrapper around the pyspark regexp_replace function
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns
|
ListOfColumns
|
The column (or list of columns) to replace in. Alias: column |
required |
target_column
|
Optional[str]
|
The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix. |
None
|
regexp
|
str
|
The regular expression to replace |
required |
replacement
|
str
|
String to replace matched pattern with. |
required |
Examples:
input_df: | content | |------------| | hello world|
Let's say you want to replace 'hello'.
output_df = RegexpReplace(
column="content",
target_column="replaced",
regexp="hello",
replacement="gutentag",
).transform(input_df)
output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|