Writers

The Writer class is used to write the DataFrame to a target.

koheesio.spark.writers.BatchOutputMode #

For Batch:

append: Append the contents of the DataFrame to the output table, default option in Koheesio.
overwrite: overwrite the existing data.
ignore: ignore the operation (i.e. no-op).
error or errorifexists: throw an exception at runtime.
merge: update matching data in the table and insert rows that do not exist.
merge_all: update matching data in the table and insert rows that do not exist.

APPEND `class-attribute` `instance-attribute` #

APPEND = 'append'

ERROR `class-attribute` `instance-attribute` #

ERROR = 'error'

ERRORIFEXISTS `class-attribute` `instance-attribute` #

ERRORIFEXISTS = 'error'

IGNORE `class-attribute` `instance-attribute` #

IGNORE = 'ignore'

MERGE `class-attribute` `instance-attribute` #

MERGE = 'merge'

MERGEALL `class-attribute` `instance-attribute` #

MERGEALL = 'merge_all'

MERGE_ALL `class-attribute` `instance-attribute` #

MERGE_ALL = 'merge_all'

OVERWRITE `class-attribute` `instance-attribute` #

OVERWRITE = 'overwrite'

koheesio.spark.writers.StreamingOutputMode #

For Streaming:

append: only the new rows in the streaming DataFrame will be written to the sink.
complete: all the rows in the streaming DataFrame/Dataset will be written to the sink every time there are some updates.
update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn't contain aggregations, it will be equivalent to append mode.

APPEND `class-attribute` `instance-attribute` #

APPEND = 'append'

COMPLETE `class-attribute` `instance-attribute` #

COMPLETE = 'complete'

UPDATE `class-attribute` `instance-attribute` #

UPDATE = 'update'

koheesio.spark.writers.Writer #

The Writer class is used to write the DataFrame to a target.

df `class-attribute` `instance-attribute` #

df: Optional[DataFrame] = Field(
    default=None,
    description="The Spark DataFrame",
    exclude=True,
)

format `class-attribute` `instance-attribute` #

format: str = Field(
    default="delta", description="The format of the output"
)

streaming `property` #

streaming: bool

Check if the DataFrame is a streaming DataFrame or not.

execute `abstractmethod` #

execute() -> Output

Execute on a Writer should handle writing of the self.df (input) as a minimum

Source code in src/koheesio/spark/writers/__init__.py

@abstractmethod
def execute(self) -> SparkStep.Output:
    """Execute on a Writer should handle writing of the self.df (input) as a minimum"""
    # self.df  # input dataframe
    ...

write #

write(df: Optional[DataFrame] = None) -> Output

Write the DataFrame to the output using execute() and return the output.

If no DataFrame is passed, the self.df will be used. If no self.df is set, a RuntimeError will be thrown.

Source code in src/koheesio/spark/writers/__init__.py

def write(self, df: Optional[DataFrame] = None) -> SparkStep.Output:
    """Write the DataFrame to the output using execute() and return the output.

    If no DataFrame is passed, the self.df will be used.
    If no self.df is set, a RuntimeError will be thrown.
    """
    self.df = df or self.df
    if not self.df:
        raise RuntimeError("No valid Dataframe was passed")
    self.execute()
    return self.output

Writers

koheesio.spark.writers.BatchOutputMode #

APPEND class-attribute instance-attribute #

ERROR class-attribute instance-attribute #

ERRORIFEXISTS class-attribute instance-attribute #

IGNORE class-attribute instance-attribute #

MERGE class-attribute instance-attribute #

MERGEALL class-attribute instance-attribute #

MERGE_ALL class-attribute instance-attribute #

OVERWRITE class-attribute instance-attribute #

koheesio.spark.writers.StreamingOutputMode #

APPEND class-attribute instance-attribute #

COMPLETE class-attribute instance-attribute #

UPDATE class-attribute instance-attribute #

koheesio.spark.writers.Writer #

df class-attribute instance-attribute #

format class-attribute instance-attribute #

streaming property #

execute abstractmethod #

write #

APPEND `class-attribute` `instance-attribute` #

ERROR `class-attribute` `instance-attribute` #

ERRORIFEXISTS `class-attribute` `instance-attribute` #

IGNORE `class-attribute` `instance-attribute` #

MERGE `class-attribute` `instance-attribute` #

MERGEALL `class-attribute` `instance-attribute` #

MERGE_ALL `class-attribute` `instance-attribute` #

OVERWRITE `class-attribute` `instance-attribute` #

APPEND `class-attribute` `instance-attribute` #

COMPLETE `class-attribute` `instance-attribute` #

UPDATE `class-attribute` `instance-attribute` #

df `class-attribute` `instance-attribute` #

format `class-attribute` `instance-attribute` #

streaming `property` #

execute `abstractmethod` #