Cast to datatype
Transformations to cast a column or set of columns to a given datatype.
Each one of these have been vetted to throw warnings when wrong datatypes are passed (to skip erroring any job or pipeline).
Furthermore, detailed tests have been added to ensure that types are actually compatible as prescribed.
Concept
- One can use the CastToDataType class directly, or use one of the more specific subclasses.
- Each subclass is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
- Compatible Data types are specified in the ColumnConfig class of each subclass, and are documented in the docstring of each subclass.
See class docstrings for more information
Note
Dates, Arrays and Maps are not supported by this module.
- for dates, use the koheesio.spark.transformations.date_time module
- for arrays, use the koheesio.spark.transformations.arrays module
Classes:
Name | Description |
---|---|
CastToDatatype |
Cast a column or set of columns to a given datatype |
CastToByte |
Cast to Byte (a.k.a. tinyint) |
CastToShort |
Cast to Short (a.k.a. smallint) |
CastToInteger |
Cast to Integer (a.k.a. int) |
CastToLong |
Cast to Long (a.k.a. bigint) |
CastToFloat |
Cast to Float (a.k.a. real) |
CastToDouble |
Cast to Double |
CastToDecimal |
Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal) |
CastToString |
Cast to String |
CastToBinary |
Cast to Binary (a.k.a. byte array) |
CastToBoolean |
Cast to Boolean |
CastToTimestamp |
Cast to Timestamp |
Note
The following parameters are common to all classes in this module:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
ListOfColumns
|
Name of the source column(s). Alias: column |
required |
target_column |
str
|
Name of the target column or alias if more than one column is specified. Alias: target_alias |
required |
datatype |
str or SparkDatatype
|
Datatype to cast to. Choose from SparkDatatype Enum (only needs to be specified in CastToDatatype, all other classes have a fixed datatype) |
required |
koheesio.spark.transformations.cast_to_datatype.CastToBinary #
Cast to Binary (a.k.a. byte array)
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- float
- double
- decimal
- boolean
- timestamp
- date
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- integer
- long
- string
Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- void skipped by default
ColumnConfig #
koheesio.spark.transformations.cast_to_datatype.CastToBoolean #
Cast to Boolean
Unsupported datatypes:
Following casts are not supported
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToBoolean class.
koheesio.spark.transformations.cast_to_datatype.CastToByte #
Cast to Byte (a.k.a. tinyint)
Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- boolean
- timestamp
- decimal
- double
- float
- long
- integer
- short
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- timestamp range of values too small for timestamp to have any meaning
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToByte class.
koheesio.spark.transformations.cast_to_datatype.CastToDatatype #
Cast a column or set of columns to a given datatype
Wrapper around pyspark.sql.Column.cast
Concept
This class acts as the base class for all the other CastTo* classes. It is compatible with 'run_for_all' - meaning that if no column or columns are specified, it will be run for all compatible data types.
Example
input_df:
c1 | c2 |
---|---|
1 | 2 |
3 | 4 |
output_df = CastToDatatype(
column="c1",
datatype="string",
target_alias="c1",
).transform(input_df)
output_df:
c1 | c2 |
---|---|
"1" | 2 |
"3" | 4 |
In the example above, the column c1
is cast to a string datatype. The column c2
is not affected.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
ListOfColumns
|
Name of the source column(s). Alias: column |
required |
datatype |
str or SparkDatatype
|
Datatype to cast to. Choose from SparkDatatype Enum |
required |
target_column |
str
|
Name of the target column or alias if more than one column is specified. Alias: target_alias |
required |
datatype
class-attribute
instance-attribute
#
datatype: Union[str, SparkDatatype] = Field(
default=...,
description="Datatype. Choose from SparkDatatype Enum",
)
func #
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
validate_datatype #
validate_datatype(datatype_value) -> SparkDatatype
Validate the datatype.
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
koheesio.spark.transformations.cast_to_datatype.CastToDecimal #
Cast to Decimal (a.k.a. decimal, numeric, dec, BigDecimal)
Represents arbitrary-precision signed decimal numbers. Backed internally by java.math.BigDecimal
. A BigDecimal
consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
The DecimalType must have fixed precision (the maximum total number of digits) and scale (the number of digits on the right of dot). For example, (5, 2) can support the value from [-999.99 to 999.99].
The precision can be up to 38, the scale must be less or equal to precision.
Spark sets the default precision and scale to (10, 0) when creating a DecimalType. However, when inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).
For compatibility reasons, Koheesio sets the default precision and scale to (38, 18) when creating a DecimalType.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- boolean
- timestamp
- date
- string
- void
- decimal spark will convert existing decimals to null if the precision and scale doesn't fit the data
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
ListOfColumns
|
Name of the source column(s). Alias: column |
*
|
target_column |
str
|
Name of the target column or alias if more than one column is specified. Alias: target_alias |
required |
precision |
conint(gt=0, le=38)
|
the maximum (i.e. total) number of digits (default: 38). Must be > 0. |
38
|
scale |
conint(ge=0, le=18)
|
the number of digits on right side of dot. (default: 18). Must be >= 0. |
18
|
precision
class-attribute
instance-attribute
#
precision: conint(gt=0, le=38) = Field(
default=38,
description="The maximum total number of digits (precision) of the decimal. Must be > 0. Default is 38",
)
scale
class-attribute
instance-attribute
#
scale: conint(ge=0, le=18) = Field(
default=18,
description="The number of digits to the right of the decimal point (scale). Must be >= 0. Default is 18",
)
ColumnConfig #
Set the data types that are compatible with the CastToDecimal class.
func #
validate_scale_and_precisions #
Validate the precision and scale values.
Source code in src/koheesio/spark/transformations/cast_to_datatype.py
koheesio.spark.transformations.cast_to_datatype.CastToDouble #
Cast to Double
Represents 8-byte double-precision floating point numbers. The range of numbers is from -1.7976931348623157E308 to 1.7976931348623157E308.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- integer
- long
- float
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToDouble class.
koheesio.spark.transformations.cast_to_datatype.CastToFloat #
Cast to Float (a.k.a. real)
Represents 4-byte single-precision floating point numbers. The range of numbers is from -3.402823E38 to 3.402823E38.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- integer
- long
- double
- decimal
- boolean
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- timestamp precision is lost (use CastToDouble instead)
- string converts to null
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToFloat class.
koheesio.spark.transformations.cast_to_datatype.CastToInteger #
Cast to Integer (a.k.a. int)
Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- long
- float
- double
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToInteger class.
koheesio.spark.transformations.cast_to_datatype.CastToLong #
Cast to Long (a.k.a. bigint)
Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- short
- long
- float
- double
- decimal
- boolean
- timestamp
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToLong class.
koheesio.spark.transformations.cast_to_datatype.CastToShort #
Cast to Short (a.k.a. smallint)
Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
Unsupported datatypes:
Following casts are not supported
will raise an error in Spark:
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- byte
- integer
- long
- float
- double
- decimal
- string
- boolean
- timestamp
- date
- void
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- string converts to null
- timestamp range of values too small for timestamp to have any meaning
- date converts to null
- void skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToShort class.
koheesio.spark.transformations.cast_to_datatype.CastToString #
Cast to String
Supported datatypes:
Following casts are supported:
- byte
- short
- integer
- long
- float
- double
- decimal
- binary
- boolean
- timestamp
- date
- array
- map
ColumnConfig #
Set the data types that are compatible with the CastToString class.
koheesio.spark.transformations.cast_to_datatype.CastToTimestamp #
Cast to Timestamp
Numeric time stamp is the number of seconds since January 1, 1970 00:00:00.000000000 UTC. Not advised to be used on small integers, as the range of values is too small for timestamp to have any meaning.
For more fine-grained control over the timestamp format, use the date_time
module. This allows for parsing strings
to timestamps and vice versa.
See Also
Unsupported datatypes:
Following casts are not supported
- binary
- array<...>
- map<...,...>
Supported datatypes:
Following casts are supported:
- integer
- long
- float
- double
- decimal
- date
Spark doesn't error on these, but will cast to null or otherwise give mangled data. Hence, Koheesio will skip the transformation for these unless specific columns of these types are given as input:
- boolean: range of values too small for timestamp to have any meaning
- byte: range of values too small for timestamp to have any meaning
- string:
converts to null in most cases, use
date_time
module instead - short: range of values too small for timestamp to have any meaning
- void: skipped by default
ColumnConfig #
Set the data types that are compatible with the CastToTimestamp class.