Skip to content

Spark sql reader

This module contains the SparkSqlReader class which reads the SparkSQL compliant query and returns the dataframe.

koheesio.spark.readers.spark_sql_reader.SparkSqlReader #

SparkSqlReader reads the SparkSQL compliant query and returns the dataframe.

This SQL can originate from a string or a file and may contain placeholder (parameters) for templating. - Placeholders are identified with ${placeholder}. - Placeholders can be passed as explicit params (params) or as implicit params (kwargs).

Example

SQL script (example.sql):

SELECT id, id + 1 AS incremented_id, ${dynamic_column} AS extra_column
FROM ${table_name}

Python code:

from koheesio.spark.readers import SparkSqlReader

reader = SparkSqlReader(
    sql_path="example.sql",
    # params can also be passed as kwargs
    dynamic_column"="name",
    "table_name"="my_table"
)
reader.execute()

In this example, the SQL script is read from a file and the placeholders are replaced with the given params. The resulting SQL query is:

SELECT id, id + 1 AS incremented_id, name AS extra_column
FROM my_table

The query is then executed and the resulting DataFrame is stored in the output.df attribute.

Parameters:

Name Type Description Default
sql_path str or Path

Path to a SQL file

required
sql str

SQL query to execute

required
params dict

Placeholders (parameters) for templating. These are identified with ${placeholder} in the SQL script.

required
Notes

Any arbitrary kwargs passed to the class will be added to params.

execute #

execute() -> Output
Source code in src/koheesio/spark/readers/spark_sql_reader.py
def execute(self) -> Reader.Output:
    self.output.df = self.spark.sql(self.query)