Row number dedup
This module contains the RowNumberDedup class, which performs a row_number deduplication operation on a DataFrame.
See the docstring of the RowNumberDedup class for more information.
koheesio.spark.transformations.row_number_dedup.RowNumberDedup #
A class used to perform a row_number deduplication operation on a DataFrame.
This class is a specialized transformation that extends the ColumnsTransformation class. It sorts the DataFrame based on the provided sort columns and assigns a row_number to each row. It then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row can be stored in a specified target column or a default column named "meta_row_number_column". The class also provides an option to preserve meta columns (like the row_numberk column) in the output DataFrame.
Attributes:
| Name | Type | Description | 
|---|---|---|
| columns | list | List of columns to apply the transformation to. If a single '*' is passed as a column name or if the columns
parameter is omitted, the transformation will be applied to all columns of the data types specified in
 | 
| sort_columns | list | List of columns that the DataFrame will be sorted by. | 
| target_column | (str, optional) | Column where the row_number of each row will be stored. | 
| preserve_meta | (bool, optional) | Flag that determines whether the meta columns should be kept in the output DataFrame. | 
            preserve_meta
  
      class-attribute
      instance-attribute
  
#
preserve_meta: bool = Field(
    default=False,
    description="If true, meta columns are kept in output dataframe. Defaults to 'False'",
)
            sort_columns
  
      class-attribute
      instance-attribute
  
#
sort_columns: conlist(Union[str, Column], min_length=0) = (
    Field(
        default_factory=list,
        alias="sort_column",
        description="List of orderBy columns. If only one column is passed, it can be passed as a single object.",
    )
)
            target_column
  
      class-attribute
      instance-attribute
  
#
target_column: Optional[Union[str, Column]] = Field(
    default="meta_row_number_column",
    alias="target_suffix",
    description="The column to store the result in. If not provided, the result will be stored in the sourcecolumn. Alias: target_suffix - if multiple columns are given as source, this will be used as a suffix",
)
            window_spec
  
      property
  
#
    Builds a WindowSpec object based on the columns defined in the configuration.
The WindowSpec object is used to define a window frame over which functions are applied in Spark.
This method partitions the data by the columns returned by the get_columns method and then orders the
partitions by the columns specified in sort_columns.
Notes
The order of the columns in the WindowSpec object is preserved. If a column is passed as a string, it is converted to a Column object with DESC ordering.
Returns:
| Type | Description | 
|---|---|
| WindowSpec | A WindowSpec object that can be used to define a window frame in Spark. | 
execute #
execute() -> Output
Performs the row_number deduplication operation on the DataFrame.
This method sorts the DataFrame based on the provided sort columns, assigns a row_number to each row, and then filters the DataFrame to keep only the top-row_number row for each group of duplicates. The row_number of each row is stored in the target column. If preserve_meta is False, the method also drops the target column from the DataFrame.
Source code in src/koheesio/spark/transformations/row_number_dedup.py
              set_sort_columns #
Validates and optimizes the sort_columns parameter.
This method ensures that sort_columns is a list (or single object) of unique strings or Column objects. It removes any empty strings or None values from the list and deduplicates the columns.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| columns_value | Union[str, Column, List[Union[str, Column]]] | The value of the sort_columns parameter. | required | 
Returns:
| Type | Description | 
|---|---|
| List[Union[str, Column]] | The optimized and deduplicated list of sort columns. |