Writer Module#
The Writer module in Koheesio provides a set of classes for writing data to various destinations. A Writer is a
type of SparkStep that takes data from self.input.df and writes it to a destination based on the output parameters.
What is a Writer?#
A Writer is a subclass of SparkStep that writes data to a destination. The data to be written is taken from a
DataFrame, which is accessible through the df property of the Writer.
How to Define a Writer?#
To define a Writer, you create a subclass of the Writer class and implement the execute method. The execute
method should take data from self.input.df and write it to the destination.
Here's an example of a Writer:
class MyWriter(Writer):
def execute(self):
# get data from self.input.df
data = self.input.df
# write data to destination
write_to_destination(data)
Key Features of a Writer#
-
Write Method: The
Writerclass provides awritemethod that calls theexecutemethod and writes the data to the destination. Essentially, calling.write()is a shorthand for calling.execute().output.df. This allows you to write data to aWriterwithout having to call theexecutemethod directly. This is a convenience method that simplifies the usage of aWriter.Here's an example of how to use the
.write()method:# Instantiate the Writer my_writer = MyWriter() # Use the .write() method to write the data my_writer.write() # The data from MyWriter's DataFrame is now written to the destinationIn this example,
MyWriteris a subclass ofWriterthat you've defined. After creating an instance ofMyWriter, you call the.write()method to write the data to the destination. The data fromMyWriter's DataFrame is now written to the destination. -
DataFrame Property: The
Writerclass provides adfproperty as a shorthand for accessingself.input.df. This property ensures that the data is ready to be written, even if theexecutemethod hasn't been explicitly called.Here's an example of how to use the
dfproperty:# Instantiate the Writer my_writer = MyWriter() # Use the df property to get the data as a DataFrame df = my_writer.df # Now df is a DataFrame with the data that will be written by MyWriterIn this example,
MyWriteris a subclass ofWriterthat you've defined. After creating an instance ofMyWriter, you access thedfproperty to get the data as a DataFrame. The DataFramedfnow contains the data that will be written byMyWriter. -
SparkSession: Every
Writerhas aSparkSessionavailable asself.spark. This is the currently activeSparkSession, which can be used to perform distributed data processing tasks.Here's an example of how to use the
sparkproperty:# Instantiate the Writer my_writer = MyWriter() # Use the spark property to get the SparkSession spark = my_writer.spark # Now spark is the SparkSession associated with MyWriterIn this example,
MyWriteris a subclass ofWriterthat you've defined. After creating an instance ofMyWriter, you access thesparkproperty to get theSparkSession. TheSparkSessionsparkcan now be used to perform distributed data processing tasks.
Understanding Inheritance in Writers#
Just like a Step, a Writer is defined as a subclass that inherits from the base Writer class. This means it
inherits all the properties and methods from the Writer class and can add or override them as needed. The main method
that needs to be overridden is the execute method, which should implement the logic for writing data from
self.input.df to the destination.
Examples of Writer Classes in Koheesio#
Koheesio provides a variety of Writer subclasses for writing data to different destinations. Here are just a few
examples:
BoxFileWriterDeltaTableStreamWriterDeltaTableWriterDummyWriterForEachBatchStreamWriterKafkaWriterSnowflakeWriterStreamWriter
Please note that this is not an exhaustive list. Koheesio provides many more Writer subclasses for a wide range of
data destinations. For a complete list, please refer to the Koheesio documentation or the source code.
Benefits of Using Writers in Data Pipelines#
Using Writer classes in your data pipelines has several benefits:
- Simplicity: Writers abstract away the details of writing data to various destinations, allowing you to focus on the logic of your pipeline.
- Consistency: By using Writers, you ensure that data is written in a consistent manner across different parts of your pipeline.
- Flexibility: Writers can be easily swapped out for different data destinations without changing the rest of your pipeline.
- Efficiency: Writers automatically manage resources like connections and file handles, ensuring efficient use of resources.
- Early Input Validation: As a
Writeris a type ofSparkStep, which in turn is aStepand a type of PydanticBaseModel, all inputs are validated when an instance of aWriterclass is created. This early validation helps catch errors related to invalid input, such as an invalid URL for a database, before the PySpark pipeline starts executing. This can help avoid unnecessary computation and make your data pipelines more robust and reliable.
By using the concept of a Writer, you can create data pipelines that are simple, consistent, flexible, efficient, and
reliable.