Koheesio 0.10#
Version 0.10 is Koheesio's 4th release since getting Open Sourced.
This version brings several important features, security improvements, and bug fixes across different modules of Koheesio. The overall API remains unchanged.
New Contributors#
Big shout out to all contributors and a heartfelt welcome to our new contributors:
- @nogitting made their first contribution in https://github.com/Nike-Inc/koheesio/pull/154
- @zarembat made their first contribution in https://github.com/Nike-Inc/koheesio/pull/163
Migrating from v0.9#
For users currently using v0.9, consider the following:
-
Consider switching to the new
SecretStr
andSecretBytes
classes for handling secret strings and bytes with enhanced security. These classes are compatible with Pydantic's SecretStr and SecretBytes and allow seamless integration with existing code. To use, replacefrom pydantic import SecretStr, SecretBytes
withfrom koheesio.models import SecretStr, SecretBytes
to use the enhanced secret handling. This is especially useful if you find yourself needing to format secret strings or bytes (i.e. concatenation, f-strings, etc.) while maintaining security. -
The
partial
classmethod was added toBaseModel
for enhanced customization and flexibility. This method allows creating a new instance of a model with only the specified fields updated, such as overwriting or setting a field's default values. This feature provides more control over the model's behavior and allows for more dynamic model creation. -
The
HttpStep
class now supports authorization headers with proper masking for improved security. This change addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information. Comprehensive unit tests have been added to prevent regressions and ensure expected behavior. Consider updating yourHttpStep
implementations to take advantage of this enhanced security feature. -
JDBCReader
,HanaReader
, andTeradataReader
classes have been updated to useparams
overoptions
for improved consistency and maintainability. Theoptions
field has been renamed toparams
, and an aliasoptions
has been added for backwards compatibility. These changes provide a more consistent API across different reader classes and improve code readability. Note thatdbtable
andquery
validation now occurs upon initialization rather than at runtime, requiring eitherdbtable
orquery
to be submitted to use JDBC based classes.
Release 0.10.0#
v0.10.0 - 2025-03-..
Features and Enhancements#
The following new features are included with 0.10:
feature - PR #164
Core: Added Koheesio specific SecretStr
and SecretBytes
classes#
Allows formatting secret strings and bytes with enhanced security.
To use: replace from pydantic import SecretStr, SecretBytes
with from koheesio.models import SecretStr, SecretBytes
in your import statement.
- New Secret classes are compatible with Pydantic's
SecretStr
andSecretBytes
and allow seamless integration with existing code. - These classes expand support to allow usage with an f-string (or
.format
) and"string" + "other_string"
concatenation while remaining secure. Concatenations are not supported by Pydantic'sSecretStr
andSecretBytes
implementations. - A
SecretMixin
class was also introduced to reduce code duplication.
by @dannymeijer
feature - PR #150
Core > BaseModel: Added partial
classmethod to BaseModel
#
Partial allows for creating a new instance of a model with only the specified fields updated (such as overwriting or setting a fields default values)
by @dannymeijer
feature - PR #161, related issues: #87, #148
Box: Added a buffered version of BoxFileWriter
and improved logging for BoxCsvFileReader
#
Added the BoxBufferFileWriter
class for writing files to Box when physical storage isn't available. Data is instead buffered in memory before being written to Box. Also improves BoxCsvFileReader
logging output by providing the file name in addition to the file ID.
by @riccamini
feature - PR #168
Dev Experience: Easier debugging and dev improvements#
To make debugging easier, pyproject.toml
was updated to allow for easier running spark connect
in your local dev environment:
- Added extra dependencies for
pyspark[connect]==3.5.4
. - Added environment variables for Spark Connect in the development environment.
- Changed to verbose mode logging in the pytest output (also visible through Github Actions tests run output).
by @dannymeijer
feature - PR #163
Delta: Support for Delta table history and data staleness checks#
Enables fetching Delta table history and checking data staleness based on defined intervals and refresh days.
Changes to DeltaTableStep
class:
- Added
describe_history()
method toDeltaTableStep
for fetching Delta table history as a Spark DataFrame. - Added
is_date_stale()
method toDeltaTableStep
to check data staleness based on time intervals or specific refresh days.
by @zarembat
feature - PR #158 & #170, related issue: #157
Http: Added support for authorization headers with proper masking for improved security#
Addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information. Comprehensive unit tests added to prevent regressions and ensure expected behavior.
Changes to HttpStep
class:
- Added
decode_sensitive_headers
method to decodeSecretStr
values in headers. - Modified
get_headers
method to dump headers into JSON withoutSecretStr
masking. - Added
auth_header
field to handle authorization headers. - Implemented masking for bearer tokens to maintain their 'secret' status.
by @dannymeijer
feature - PR #143, related issue: #75
Step & Spark: Added Step for downloading files from a URL along with matching Spark transformation#
Allow downloading files from a given URL
- Added
DownloadFileStep
class in a new modulekoheesio.steps.download_file
- Added
FileWriteMode
enum with supported write modes:OVERWRITE
,APPEND
,IGNORE
,EXCLUSIVE
,BACKUP
Also made available as a spark Transformation
in DownloadFileFromUrlTransformation
in a new module koheesio.spark.transformations.download_files
- The spark implementation allows passing URLs through a column in a given DataFrame
- All URLs are then downloaded by the Spark Driver to a given location
by @mikita-sakalouski and @dannymeijer
refactor - PR #168
Snowflake: classes now use params
over options
#
- Snowflake classes now also base
ExtraParamsMixin
- Renamed
options
field toparams
and added aliasoptions
for backwards compatibility. - Introduced
SF_DEFAULT_PARAMS
.
by @dannymeijer
refactor - PR #168
Spark > Reader > JDBC, HanaReader, TeradataReader: Updated JDBC classes to use params
over options
#
JDBCReader
class now also basesExtraParamsMixin
.- Renamed
options
field toparams
and added aliasoptions
for backwards compatibility. dbtable
andquery
validation now handled upon initialization rather than at runtime.- Behavior now requires either
dbtable
orquery
to be submitted to be able to use JDBC.
HanaReader
and TeradataReader
classes were also updated to use params
over options
for improved consistency and maintainability.
by @dannymeijer
refactor - PR #142
Spark > Transformation > CamelToSnake: Added more efficient Spark 3.4+ supported operation#
CamelToSnakeTransformationnow uses
toDF` for more efficient transformation in Spark 3.4+.
by @dannymeijer
Bugfixes#
The following bugfixes are included with 0.10:
bugfix - PR #160
Core > Context: Fix Context initialization with another Context object and dotted notation#
- The
__init__
method of theContext
class incorrectly updated thekwargs
making it returnNone
. Calls toContext
containing anotherContext
object, would previously fail. - Also fixed an issue with how
Context
handled get operations for nested keys when using dotted notation
by @dannymeijer
bugfix - PR #168, related issue: #167
Core > Step: Fixed duplicate logging issues in nested Step classes#
We observed log duplication when using specific super call sequences in nested Step classes.
- Several changes were made to the
StepMetaClass
to address duplicate logs when usingsuper()
in the execute method of a Step class under specific circumstances. - Updated
_is_called_through_super
method to traverse the entire method resolution order (MRO) and correctly identifysuper()
calls. - Ensured
_execute_wrapper
method triggers logging only once per execute call.
This change prevents duplicate logs and ensures accurate log entries. The _is_called_through_super
method was also used for Output
validation; now we can ensure it is called only once.
by @dannymeijer
bugfix - PR #155, related issue: #149
Delta: Improve merge clause handling in DeltaTableWriter
#
*When using delta merge configuration (as dict) to provide merge condition to merge builder and having multiple calls for merge operation (e.g. for each batch processing in streaming), the original implementation was breaking due to a pop call on the used dictionary.
bugfix - PR #154, related issue: #153
Spark: Pyspark Connect support fixes#
- Connect support check previously excluded Spark 3.4 wrongfully
- Fix gets rid of False positives in our spark connect check utility
by @nogitting and @dannymeijer
bugfix - PR #142
Spark > ColumnsTransformation: ColumnConfig
defaults in ColumnsTransformation
not working correctly by#
run_for_all_data_type
and limit_data_type
were previously not working correctly.
by @dannymeijer