Jdbc
Module for reading data from JDBC sources.
Classes:
Name | Description |
---|---|
JdbcReader |
Reader for JDBC tables. |
koheesio.spark.readers.jdbc.JdbcReader #
Reader for JDBC tables.
Wrapper around Spark's jdbc read format
Notes
- Query has precedence over dbtable. If query and dbtable both are filled in, dbtable will be ignored!
- Extra options to the spark reader can be passed through the
options
input. Refer to Spark documentation for details: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html - Consider using
fetchsize
as one of the options, as it is greatly increases the performance of the reader - Consider using
numPartitions
,partitionColumn
,lowerBound
,upperBound
together with real or synthetic partitioning column as it will improve the reader performance
When implementing a JDBC reader, the get_options()
method should be implemented. The method should return a dict
of options required for the specific JDBC driver. The get_options()
method can be overridden in the child class.
Additionally, the driver
parameter should be set to the name of the JDBC driver. Be aware that the driver jar
needs to be included in the Spark session; this class does not (and can not) take care of that!
Example
Note: jars should be added to the Spark session manually. This class does not take care of that.
This example depends on the jar for MS SQL:
https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/9.2.1.jre8/mssql-jdbc-9.2.1.jre8.jar
from koheesio.spark.readers.jdbc import JdbcReader
jdbc_mssql = JdbcReader(
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
url="jdbc:sqlserver://10.xxx.xxx.xxx:1433;databaseName=YOUR_DATABASE",
user="YOUR_USERNAME",
password="***",
dbtable="schemaname.tablename",
options={"fetchsize": 100},
)
df = jdbc_mssql.read()
dbtable
class-attribute
instance-attribute
#
dbtable: Optional[str] = Field(
default=None,
description="Database table name, also include schema name",
)
driver
class-attribute
instance-attribute
#
driver: str = Field(
default=...,
description="Driver name. Be aware that the driver jar needs to be passed to the task",
)
format
class-attribute
instance-attribute
#
format: str = Field(
default="jdbc",
description="The type of format to load. Defaults to 'jdbc'.",
)
options
class-attribute
instance-attribute
#
options: Optional[Dict[str, Any]] = Field(
default_factory=dict,
description="Extra options to pass to spark reader",
)
password
class-attribute
instance-attribute
#
query
class-attribute
instance-attribute
#
url
class-attribute
instance-attribute
#
url: str = Field(
default=...,
description="URL for the JDBC driver. Note, in some environments you need to use the IP Address instead of the hostname of the server.",
)
user
class-attribute
instance-attribute
#
user: str = Field(
default=...,
description="User to authenticate to the server",
)
execute #
Wrapper around Spark's jdbc read format
Source code in src/koheesio/spark/readers/jdbc.py
get_options #
Dictionary of options required for the specific JDBC driver.
Note: override this method if driver requires custom names, e.g. Snowflake: sfUrl
, sfUser
, etc.