Common
Spark Utility functions
koheesio.spark.utils.common.AnalysisException
module-attribute
#
AnalysisException = AnalysisException
koheesio.spark.utils.common.SPARK_MINOR_VERSION
module-attribute
#
SPARK_MINOR_VERSION: float = get_spark_minor_version()
koheesio.spark.utils.common.SparkDatatype #
Allowed spark datatypes
The following table lists the data types that are supported by Spark SQL.
Data type | SQL name |
---|---|
ByteType | BYTE, TINYINT |
ShortType | SHORT, SMALLINT |
IntegerType | INT, INTEGER |
LongType | LONG, BIGINT |
FloatType | FLOAT, REAL |
DoubleType | DOUBLE |
DecimalType | DECIMAL, DEC, NUMERIC |
StringType | STRING |
BinaryType | BINARY |
BooleanType | BOOLEAN |
TimestampType | TIMESTAMP, TIMESTAMP_LTZ |
DateType | DATE |
ArrayType | ARRAY |
MapType | MAP |
NullType | VOID |
Not supported yet
- TimestampNTZType TIMESTAMP_NTZ
- YearMonthIntervalType INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH
- DayTimeIntervalType INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND
from_string
classmethod
#
from_string(value: str) -> SparkDatatype
Allows for getting the right Enum value by simply passing a string value This method is not case-sensitive
koheesio.spark.utils.common.check_if_pyspark_connect_is_supported #
check_if_pyspark_connect_is_supported() -> bool
Check if the current version of PySpark supports the connect module
Source code in src/koheesio/spark/utils/common.py
koheesio.spark.utils.common.get_active_session #
Get the active Spark session
Source code in src/koheesio/spark/utils/common.py
koheesio.spark.utils.common.get_column_name #
get_column_name(col: Column) -> str
Get the column name from a Column object
Normally, the name of a Column object is not directly accessible in the regular pyspark API. This function extracts the name of the given column object without needing to provide it in the context of a DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
col
|
Column
|
The Column object |
required |
Returns:
Type | Description |
---|---|
str
|
The name of the given column |
Source code in src/koheesio/spark/utils/common.py
koheesio.spark.utils.common.get_spark_minor_version #
get_spark_minor_version() -> float
Returns the minor version of the spark instance.
For example, if the spark version is 3.3.2, this function would return 3.3
koheesio.spark.utils.common.import_pandas_based_on_pyspark_version #
import_pandas_based_on_pyspark_version() -> ModuleType
This function checks the installed version of PySpark and then tries to import the appropriate version of pandas. If the correct version of pandas is not installed, it raises an ImportError with a message indicating which version of pandas should be installed.
Source code in src/koheesio/spark/utils/common.py
koheesio.spark.utils.common.on_databricks #
on_databricks() -> bool
Retrieve if we're running on databricks or elsewhere
koheesio.spark.utils.common.schema_struct_to_schema_str #
schema_struct_to_schema_str(schema: StructType) -> str
Converts a StructType to a schema str
koheesio.spark.utils.common.show_string #
show_string(
df: DataFrame,
n: int = 20,
truncate: Union[bool, int] = True,
vertical: bool = False,
) -> str
Returns a string representation of the DataFrame The default implementation of DataFrame.show() hardcodes a print statement, which is not always desirable. With this function, you can get the string representation of the DataFrame instead, and choose how to display it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The DataFrame to display |
required |
n
|
int
|
The number of rows to display, by default 20 |
20
|
truncate
|
Union[bool, int]
|
If set to True, truncate the displayed columns, by default True |
True
|
vertical
|
bool
|
If set to True, display the DataFrame vertically, by default False |
False
|
Source code in src/koheesio/spark/utils/common.py
koheesio.spark.utils.common.spark_data_type_is_array #
spark_data_type_is_array(data_type: DataType) -> bool