Regexp
String transformations using regular expressions.
This module contains transformations that use regular expressions to transform strings.
Classes:
| Name | Description | 
|---|---|
| RegexpExtract | Extract a specific group matched by a Java regexp from the specified string column. | 
| RegexpReplace | Searches for the given regexp and replaces all instances with what is in 'replacement'. | 
koheesio.spark.transformations.strings.regexp.RegexpExtract #
Extract a specific group matched by a Java regexp from the specified string column. If the regexp did not match, or the specified group did not match, an empty string is returned.
A wrapper around the pyspark regexp_extract function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| columns | ListOfColumns | The column (or list of columns) to extract from. Alias: column | required | 
| target_column | Optional[str] | The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix. | None | 
| regexp | str | The Java regular expression to extract | required | 
| index | Optional[int] | When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match. | 0 | 
Example
Extracting the year and week number from a string#
Let's say we have a column containing the year and week in a format like Y## W# and we would like to extract the
week numbers.
input_df:
| YWK | 
|---|
| 2020 W1 | 
| 2021 WK2 | 
output_df = RegexpExtract(
    column="YWK",
    target_column="week_number",
    regexp="Y([0-9]+) ?WK?([0-9]+)",
    index=2,  # remember that this is 1-indexed! So 2 will get the week number in this example.
).transform(input_df)
output_df:
| YWK | week_number | 
|---|---|
| 2020 W1 | 1 | 
| 2021 WK2 | 2 | 
Using the same example, but extracting the year instead#
If you want to extract the year, you can use index=1.
output_df = RegexpExtract(
    column="YWK",
    target_column="year",
    regexp="Y([0-9]+) ?WK?([0-9]+)",
    index=1,  # remember that this is 1-indexed! So 1 will get the year in this example.
).transform(input_df)
output_df:
| YWK | year | 
|---|---|
| 2020 W1 | 2020 | 
| 2021 WK2 | 2021 | 
            index
  
      class-attribute
      instance-attribute
  
#
index: Optional[int] = Field(
    default=0,
    description="When there are more groups in the match, you can indicate which one you want. 0 means the whole match. 1 and above are groups within that match.",
)
koheesio.spark.transformations.strings.regexp.RegexpReplace #
Searches for the given regexp and replaces all instances with what is in 'replacement'.
A wrapper around the pyspark regexp_replace function
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| columns | ListOfColumns | The column (or list of columns) to replace in. Alias: column | required | 
| target_column | Optional[str] | The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix. | None | 
| regexp | str | The regular expression to replace | required | 
| replacement | str | String to replace matched pattern with. | required | 
Examples:
input_df: | content | |------------| | hello world|
Let's say you want to replace 'hello'.
output_df = RegexpReplace(
    column="content",
    target_column="replaced",
    regexp="hello",
    replacement="gutentag",
).transform(input_df)
output_df: | content | replaced | |------------|---------------| | hello world| gutentag world|