Arrays
A collection of classes for performing various transformations on arrays in PySpark.
These transformations include operations such as removing duplicates, exploding arrays into separate rows, reversing the order of elements, sorting elements, removing certain values, and calculating aggregate statistics like minimum, maximum, sum, mean, and median.
Concept
- Every transformation in this module is implemented as a class that inherits from the
ArrayTransformation
class. - The
ArrayTransformation
class is a subclass ofColumnsTransformationWithTarget
- The
ArrayTransformation
class implements thefunc
method, which is used to define the transformation logic. - The
func
method takes acolumn
as input and returns aColumn
object. - The
Column
object is a PySpark column that can be used to perform transformations on a DataFrame column. - The
ArrayTransformation
limits the data type of the transformation to array by setting theColumnConfig
class torun_for_all_data_type = [SparkDatatype.ARRAY]
andlimit_data_type = [SparkDatatype.ARRAY]
.
See Also
- koheesio.spark.transformations Module containing all transformation classes.
- koheesio.spark.transformations.ColumnsTransformationWithTarget Base class for all transformations that operate on columns and have a target column.
koheesio.spark.transformations.arrays.ArrayDistinct #
Remove duplicates from array
filter_empty
class-attribute
instance-attribute
#
filter_empty: bool = Field(
default=True,
description="Remove null, nan, and empty values from array. Default is True.",
)
func #
Source code in src/koheesio/spark/transformations/arrays.py
koheesio.spark.transformations.arrays.ArrayMax #
Return the maximum value in the array
koheesio.spark.transformations.arrays.ArrayMean #
Return the mean of the values in the array.
Note: Only numeric values are supported for calculating the mean.
func #
Calculate the mean of the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
koheesio.spark.transformations.arrays.ArrayMedian #
Return the median of the values in the array.
The median is the middle value in a sorted, ascending or descending, list of numbers.
- If the size of the array is even, the median is the average of the two middle numbers.
- If the size of the array is odd, the median is the middle number.
Note: Only numeric values are supported for calculating the median.
func #
Calculate the median of the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
koheesio.spark.transformations.arrays.ArrayMin #
koheesio.spark.transformations.arrays.ArrayNullNanProcess #
Process an array by removing NaN and/or NULL values from elements.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keep_nan |
bool
|
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array. |
False
|
keep_null |
bool
|
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array. |
False
|
Returns:
Name | Type | Description |
---|---|---|
column |
Column
|
The processed column with NaN and/or NULL values removed from elements. |
Examples:
>>> input_data = [(1, [1.1, 2.1, 4.1, float("nan")])]
>>> input_schema = StructType([StructField("id", IntegerType(), True),
StructField("array_float", ArrayType(FloatType()), True),
])
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(input_data, schema=input_schema)
>>> transformer = ArrayNumericNanProcess(column="array_float", keep_nan=False)
>>> transformer.transform(df)
>>> print(transformer.output.df.collect()[0].asDict()["array_float"])
[1.1, 2.1, 4.1]
>>> input_data = [(1, [1.1, 2.2, 4.1, float("nan")])]
>>> input_schema = StructType([StructField("id", IntegerType(), True),
StructField("array_float", ArrayType(FloatType()), True),
])
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(input_data, schema=input_schema)
>>> transformer = ArrayNumericNanProcess(column="array_float", keep_nan=True)
>>> transformer.transform(df)
>>> print(transformer.output.df.collect()[0].asDict()["array_float"])
[1.1, 2.1, 4.1, nan]
keep_nan
class-attribute
instance-attribute
#
keep_nan: bool = Field(
False,
description="Whether to keep nan values in the array. Default is False. If set to True, the nan values will be kept in the array.",
)
keep_null
class-attribute
instance-attribute
#
keep_null: bool = Field(
False,
description="Whether to keep null values in the array. Default is False. If set to True, the null values will be kept in the array.",
)
func #
Process the given column by removing NaN and/or NULL values from elements.
Parameters:
column : Column The column to be processed.
Returns:
column : Column The processed column with NaN and/or NULL values removed from elements.
Source code in src/koheesio/spark/transformations/arrays.py
koheesio.spark.transformations.arrays.ArrayRemove #
Remove a certain value from the array
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keep_nan |
bool
|
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array. |
False
|
keep_null |
bool
|
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array. |
False
|
make_distinct
class-attribute
instance-attribute
#
make_distinct: bool = Field(
default=False,
description="Whether to remove duplicates from the array.",
)
value
class-attribute
instance-attribute
#
value: Any = Field(
default=None,
description="The value to remove from the array.",
)
func #
Source code in src/koheesio/spark/transformations/arrays.py
koheesio.spark.transformations.arrays.ArrayReverse #
koheesio.spark.transformations.arrays.ArraySort #
Sort the elements in the array
By default, the elements are sorted in ascending order. To sort the elements in descending order, set the reverse
parameter to True.
koheesio.spark.transformations.arrays.ArraySortDesc #
koheesio.spark.transformations.arrays.ArraySum #
Return the sum of the values in the array
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keep_nan |
bool
|
Whether to keep NaN values in the array. If set to True, the NaN values will be kept in the array. |
False
|
keep_null |
bool
|
Whether to keep NULL values in the array. If set to True, the NULL values will be kept in the array. |
False
|
func #
Using the aggregate
function to sum the values in the array
Source code in src/koheesio/spark/transformations/arrays.py
koheesio.spark.transformations.arrays.ArrayTransformation #
Base class for array transformations
ColumnConfig #
koheesio.spark.transformations.arrays.Explode #
Explode the array into separate rows
distinct
class-attribute
instance-attribute
#
distinct: bool = Field(
False,
description="Remove duplicates from the exploded array. Default is False.",
)
preserve_nulls
class-attribute
instance-attribute
#
preserve_nulls: bool = Field(
True,
description="Preserve rows with null values in the exploded array by using explode_outer instead of explode.Default is True.",
)