Reader Module#
The Reader module in Koheesio provides a set of classes for reading data from various sources. A Reader is a type
of SparkStep that reads data from a source based on the input parameters and stores the result in self.output.df
for subsequent steps.
What is a Reader?#
A Reader is a subclass of SparkStep that reads data from a source and stores the result. The source could be a
file, a database, a web API, or any other data source. The result is stored in a DataFrame, which is accessible through
the df property of the Reader.
API Reference#
See API Reference for a detailed description of the Reader class and its methods.
Key Features of a Reader#
- Read Method: The
Readerclass provides areadmethod that calls theexecutemethod and returns the result. Essentially, calling.read()is a shorthand for calling.execute().output.df. This allows you to read data from aReaderwithout having to call theexecutemethod directly. This is a convenience method that simplifies the usage of aReader.
Here's an example of how to use the .read() method:
# Instantiate the Reader
my_reader = MyReader()
# Use the .read() method to get the data as a DataFrame
df = my_reader.read()
# Now df is a DataFrame with the data read by MyReader
In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader,
you call the .read() method to read the data and get it back as a DataFrame. The DataFrame df now contains the
data read by MyReader.
- DataFrame Property: The
Readerclass provides adfproperty as a shorthand for accessingself.output.df. Ifself.output.dfisNone, theexecutemethod is run first. This property ensures that the data is loaded and ready to be used, even if theexecutemethod hasn't been explicitly called.
Here's an example of how to use the df property:
# Instantiate the Reader
my_reader = MyReader()
# Use the df property to get the data as a DataFrame
df = my_reader.df
# Now df is a DataFrame with the data read by MyReader
In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader,
you access the df property to get the data as a DataFrame. The DataFrame df now contains the data read by
MyReader.
- SparkSession: Every
Readerhas aSparkSessionavailable asself.spark. This is the currently activeSparkSession, which can be used to perform distributed data processing tasks.
Here's an example of how to use the spark property:
# Instantiate the Reader
my_reader = MyReader()
# Use the spark property to get the SparkSession
spark = my_reader.spark
# Now spark is the SparkSession associated with MyReader
In this example, MyReader is a subclass of Reader that you've defined. After creating an instance of MyReader,
you access the spark property to get the SparkSession. The SparkSession spark can now be used to perform
distributed data processing tasks.
How to Define a Reader?#
To define a Reader, you create a subclass of the Reader class and implement the execute method. The execute
method should read from the source and store the result in self.output.df. This is an abstract method, which means it
must be implemented in any subclass of Reader.
Here's an example of a Reader:
class MyReader(Reader):
def execute(self):
# read data from source
data = read_from_source()
# store result in self.output.df
self.output.df = data
Understanding Inheritance in Readers#
Just like a Step, a Reader is defined as a subclass that inherits from the base Reader class. This means it
inherits all the properties and methods from the Reader class and can add or override them as needed. The main method
that needs to be overridden is the execute method, which should implement the logic for reading data from the source
and storing it in self.output.df.
Benefits of Using Readers in Data Pipelines#
Using Reader classes in your data pipelines has several benefits:
-
Simplicity: Readers abstract away the details of reading data from various sources, allowing you to focus on the logic of your pipeline.
-
Consistency: By using Readers, you ensure that data is read in a consistent manner across different parts of your pipeline.
-
Flexibility: Readers can be easily swapped out for different data sources without changing the rest of your pipeline.
-
Efficiency: Readers automatically manage resources like connections and file handles, ensuring efficient use of resources.
By using the concept of a Reader, you can create data pipelines that are simple, consistent, flexible, and efficient.
Examples of Reader Classes in Koheesio#
Koheesio provides a variety of Reader subclasses for reading data from different sources. Here are just a few
examples:
-
Teradata Reader: A
Readersubclass for reading data from Teradata databases. It's defined in thekoheesio/steps/readers/teradata.pyfile. -
Snowflake Reader: A
Readersubclass for reading data from Snowflake databases. It's defined in thekoheesio/steps/readers/snowflake.pyfile. -
Box Reader: A
Readersubclass for reading data from Box. It's defined in thekoheesio/steps/integrations/box.pyfile.
These are just a few examples of the many Reader subclasses available in Koheesio. Each Reader subclass is designed
to read data from a specific source. They all inherit from the base Reader class and implement the execute method to
read data from their respective sources and store it in self.output.df.
Please note that this is not an exhaustive list. Koheesio provides many more Reader subclasses for a wide range of
data sources. For a complete list, please refer to the Koheesio documentation or the source code.
More readers can be found in the koheesio/steps/readers module.