Skip to content

Writers

The Writer class is used to write the DataFrame to a target.

koheesio.spark.writers.BatchOutputMode #

For Batch:

  • append: Append the contents of the DataFrame to the output table, default option in Koheesio.
  • overwrite: overwrite the existing data.
  • ignore: ignore the operation (i.e. no-op).
  • error or errorifexists: throw an exception at runtime.
  • merge: update matching data in the table and insert rows that do not exist.
  • merge_all: update matching data in the table and insert rows that do not exist.

APPEND class-attribute instance-attribute #

APPEND = 'append'

ERROR class-attribute instance-attribute #

ERROR = 'error'

ERRORIFEXISTS class-attribute instance-attribute #

ERRORIFEXISTS = 'error'

IGNORE class-attribute instance-attribute #

IGNORE = 'ignore'

MERGE class-attribute instance-attribute #

MERGE = 'merge'

MERGEALL class-attribute instance-attribute #

MERGEALL = 'merge_all'

MERGE_ALL class-attribute instance-attribute #

MERGE_ALL = 'merge_all'

OVERWRITE class-attribute instance-attribute #

OVERWRITE = 'overwrite'

koheesio.spark.writers.StreamingOutputMode #

For Streaming:

  • append: only the new rows in the streaming DataFrame will be written to the sink.
  • complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.
  • update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.

APPEND class-attribute instance-attribute #

APPEND = 'append'

COMPLETE class-attribute instance-attribute #

COMPLETE = 'complete'

UPDATE class-attribute instance-attribute #

UPDATE = 'update'

koheesio.spark.writers.Writer #

The Writer class is used to write the DataFrame to a target.

df class-attribute instance-attribute #

df: Optional[DataFrame] = Field(
    default=None,
    description="The Spark DataFrame",
    exclude=True,
)

format class-attribute instance-attribute #

format: str = Field(
    default="delta", description="The format of the output"
)

streaming property #

streaming: bool

Check if the DataFrame is a streaming DataFrame or not.

execute abstractmethod #

execute() -> Output

Execute on a Writer should handle writing of the self.df (input) as a minimum

Source code in src/koheesio/spark/writers/__init__.py
@abstractmethod
def execute(self) -> SparkStep.Output:
    """Execute on a Writer should handle writing of the self.df (input) as a minimum"""
    # self.df  # input dataframe
    ...

write #

write(df: Optional[DataFrame] = None) -> Output

Write the DataFrame to the output using execute() and return the output.

If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.

Source code in src/koheesio/spark/writers/__init__.py
def write(self, df: Optional[DataFrame] = None) -> SparkStep.Output:
    """Write the DataFrame to the output using execute() and return the output.

    If no DataFrame is passed, the self.df will be used.
    If no self.df is set, a RuntimeError will be thrown.
    """
    self.df = df or self.df
    if not self.df:
        raise RuntimeError("No valid Dataframe was passed")
    self.execute()
    return self.output