Skip to content

Drop column

This module defines the DropColumn class, a subclass of ColumnsTransformation.

koheesio.spark.transformations.drop_column.DropColumn #

Drop one or more columns

The DropColumn class is used to drop one or more columns from a PySpark DataFrame. It wraps the pyspark.DataFrame.drop function and can handle either a single string or a list of strings as input.

If a specified column does not exist in the DataFrame, no error or warning is thrown, and all existing columns will remain.

Expected behavior
  • When the column does not exist, all columns will remain (no error or warning is thrown)
  • Either a single string, or a list of strings can be specified
Example

df:

product amount country
Banana lemon orange 1000 USA
Carrots Blueberries 1500 USA
Beans 1600 USA
output_df = DropColumn(column="product").transform(df)

output_df:

amount country
1000 USA
1500 USA
1600 USA

In this example, the product column is dropped from the DataFrame df.

execute #

execute()
Source code in src/koheesio/spark/transformations/drop_column.py
def execute(self):
    self.log.info(f"{self.column=}")
    self.output.df = self.df.drop(*self.columns)