Utils
Spark Utility functions
koheesio.spark.utils.spark_minor_version
module-attribute
#
spark_minor_version: float = get_spark_minor_version()
koheesio.spark.utils.SparkDatatype #
Allowed spark datatypes
The following table lists the data types that are supported by Spark SQL.
Data type | SQL name |
---|---|
ByteType | BYTE, TINYINT |
ShortType | SHORT, SMALLINT |
IntegerType | INT, INTEGER |
LongType | LONG, BIGINT |
FloatType | FLOAT, REAL |
DoubleType | DOUBLE |
DecimalType | DECIMAL, DEC, NUMERIC |
StringType | STRING |
BinaryType | BINARY |
BooleanType | BOOLEAN |
TimestampType | TIMESTAMP, TIMESTAMP_LTZ |
DateType | DATE |
ArrayType | ARRAY |
MapType | MAP |
NullType | VOID |
Not supported yet
- TimestampNTZType TIMESTAMP_NTZ
- YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
- DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
from_string
classmethod
#
from_string(value: str) -> SparkDatatype
Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive
koheesio.spark.utils.get_spark_minor_version #
get_spark_minor_version() -> float
Returns the minor version of the spark instance.
For example, if the spark version is 3.3.2, this function would return 3.3
koheesio.spark.utils.import_pandas_based_on_pyspark_version #
This function checks the installed version of PySpark and then tries to import the appropriate version of pandas. If the correct version of pandas is not installed, it raises an ImportError with a message indicating which version of pandas should be installed.
Source code in src/koheesio/spark/utils.py
koheesio.spark.utils.on_databricks #
on_databricks() -> bool
Retrieve if we're running on databricks or elsewhere
koheesio.spark.utils.schema_struct_to_schema_str #
schema_struct_to_schema_str(schema: StructType) -> str
Converts a StructType to a schema str
koheesio.spark.utils.spark_data_type_is_array #
spark_data_type_is_array(data_type: DataType) -> bool