Skip to content

Substring

Extracts a substring from a string column starting at the given position.

koheesio.spark.transformations.strings.substring.Substring #

Extracts a substring from a string column starting at the given position.

This is a wrapper around PySpark substring() function

Notes
  • Numeric columns will be cast to string
  • start is 1-indexed, not 0-indexed!

Parameters:

Name Type Description Default
columns Union[str, List[str]]

The column (or list of columns) to substring. Alias: column

required
target_column Optional[str]

The column to store the result in. If not provided, the result will be stored in the source column. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix.

None
start PositiveInt

Positive int. Defines where to begin the substring from. The first character of the field has index 1!

required
length Optional[int]

Optional. If not provided, the substring will go until end of string.

-1
Example
Extract a substring from a string column starting at the given position.#

input_df:

column
skyscraper
output_df = Substring(
    column="column",
    target_column="substring_column",
    start=3,  # 1-indexed! So this will start at the 3rd character
    length=4,
).transform(input_df)

output_df:

column substring_column
skyscraper yscr

length class-attribute instance-attribute #

length: Optional[int] = Field(
    default=-1,
    description="The target length for the string. use -1 to perform until end",
)

start class-attribute instance-attribute #

start: PositiveInt = Field(
    default=..., description="The starting position"
)

func #

func(column: Column)
Source code in src/koheesio/spark/transformations/strings/substring.py
def func(self, column: Column):
    return when(column.isNull(), None).otherwise(substring(column, self.start, self.length)).cast(StringType())