Koheesio 0.10#
Version 0.10 is Koheesio's 4th release since getting Open Sourced.
This version brings several important features, security improvements, and bug fixes across different modules of Koheesio. The overall API remains unchanged.
New Contributors#
Big shout out to all contributors and a heartfelt welcome to our new contributors:
- @nogitting made their first contribution in https://github.com/Nike-Inc/koheesio/pull/154
- @zarembat made their first contribution in https://github.com/Nike-Inc/koheesio/pull/163
Migrating from v0.9#
For users currently using v0.9, consider the following:
-
Consider switching to the new
SecretStrandSecretBytesclasses for handling secret strings and bytes with enhanced security. These classes are compatible with Pydantic's SecretStr and SecretBytes and allow seamless integration with existing code. To use, replacefrom pydantic import SecretStr, SecretByteswithfrom koheesio.models import SecretStr, SecretBytesto use the enhanced secret handling. This is especially useful if you find yourself needing to format secret strings or bytes (i.e. concatenation, f-strings, etc.) while maintaining security. -
The
partialclassmethod was added toBaseModelfor enhanced customization and flexibility. This method allows creating a new instance of a model with only the specified fields updated, such as overwriting or setting a field's default values. This feature provides more control over the model's behavior and allows for more dynamic model creation. -
The
HttpStepclass now supports authorization headers with proper masking for improved security. This change addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information. Comprehensive unit tests have been added to prevent regressions and ensure expected behavior. Consider updating yourHttpStepimplementations to take advantage of this enhanced security feature. -
JDBCReader,HanaReader, andTeradataReaderclasses have been updated to useparamsoveroptionsfor improved consistency and maintainability. Theoptionsfield has been renamed toparams, and an aliasoptionshas been added for backwards compatibility. These changes provide a more consistent API across different reader classes and improve code readability. Note thatdbtableandqueryvalidation now occurs upon initialization rather than at runtime, requiring eitherdbtableorqueryto be submitted to use JDBC based classes.
Release 0.10.0#
v0.10.0 - 2025-03-..
Features and Enhancements#
The following new features are included with 0.10:
feature - PR #164
Core: Added Koheesio specific SecretStr and SecretBytes classes#
Allows formatting secret strings and bytes with enhanced security.
To use: replace from pydantic import SecretStr, SecretBytes with from koheesio.models import SecretStr, SecretBytes in your import statement.
- New Secret classes are compatible with Pydantic's
SecretStrandSecretBytesand allow seamless integration with existing code. - These classes expand support to allow usage with an f-string (or
.format) and"string" + "other_string"concatenation while remaining secure. Concatenations are not supported by Pydantic'sSecretStrandSecretBytesimplementations. - A
SecretMixinclass was also introduced to reduce code duplication.
by @dannymeijer
feature - PR #150
Core > BaseModel: Added partial classmethod to BaseModel#
Partial allows for creating a new instance of a model with only the specified fields updated (such as overwriting or setting a fields default values)
by @dannymeijer
feature - PR #161, related issues: #87, #148
Box: Added a buffered version of BoxFileWriter and improved logging for BoxCsvFileReader#
Added the BoxBufferFileWriter class for writing files to Box when physical storage isn't available. Data is instead buffered in memory before being written to Box. Also improves BoxCsvFileReader logging output by providing the file name in addition to the file ID.
by @riccamini
feature - PR #168
Dev Experience: Easier debugging and dev improvements#
To make debugging easier, pyproject.toml was updated to allow for easier running spark connect in your local dev environment:
- Added extra dependencies for
pyspark[connect]==3.5.4. - Added environment variables for Spark Connect in the development environment.
- Changed to verbose mode logging in the pytest output (also visible through Github Actions tests run output).
by @dannymeijer
feature - PR #163
Delta: Support for Delta table history and data staleness checks#
Enables fetching Delta table history and checking data staleness based on defined intervals and refresh days.
Changes to DeltaTableStep class:
- Added
describe_history()method toDeltaTableStepfor fetching Delta table history as a Spark DataFrame. - Added
is_date_stale()method toDeltaTableStepto check data staleness based on time intervals or specific refresh days.
by @zarembat
feature - PR #158 & #170, related issue: #157
Http: Added support for authorization headers with proper masking for improved security#
Addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information. Comprehensive unit tests added to prevent regressions and ensure expected behavior.
Changes to HttpStep class:
- Added
decode_sensitive_headersmethod to decodeSecretStrvalues in headers. - Modified
get_headersmethod to dump headers into JSON withoutSecretStrmasking. - Added
auth_headerfield to handle authorization headers. - Implemented masking for bearer tokens to maintain their 'secret' status.
by @dannymeijer
feature - PR #143, related issue: #75
Step & Spark: Added Step for downloading files from a URL along with matching Spark transformation#
Allow downloading files from a given URL
- Added
DownloadFileStepclass in a new modulekoheesio.steps.download_file - Added
FileWriteModeenum with supported write modes:OVERWRITE,APPEND,IGNORE,EXCLUSIVE,BACKUP
Also made available as a spark Transformation in DownloadFileFromUrlTransformation in a new module koheesio.spark.transformations.download_files
- The spark implementation allows passing URLs through a column in a given DataFrame
- All URLs are then downloaded by the Spark Driver to a given location
by @mikita-sakalouski and @dannymeijer
refactor - PR #168
Snowflake: classes now use params over options#
- Snowflake classes now also base
ExtraParamsMixin - Renamed
optionsfield toparamsand added aliasoptionsfor backwards compatibility. - Introduced
SF_DEFAULT_PARAMS.
by @dannymeijer
refactor - PR #168
Spark > Reader > JDBC, HanaReader, TeradataReader: Updated JDBC classes to use params over options#
JDBCReaderclass now also basesExtraParamsMixin.- Renamed
optionsfield toparamsand added aliasoptionsfor backwards compatibility. dbtableandqueryvalidation now handled upon initialization rather than at runtime.- Behavior now requires either
dbtableorqueryto be submitted to be able to use JDBC.
HanaReader and TeradataReader classes were also updated to use params over options for improved consistency and maintainability.
by @dannymeijer
refactor - PR #142
Spark > Transformation > CamelToSnake: Added more efficient Spark 3.4+ supported operation#
CamelToSnakeTransformationnow usestoDF` for more efficient transformation in Spark 3.4+.
by @dannymeijer
Bugfixes#
The following bugfixes are included with 0.10:
bugfix - PR #160
Core > Context: Fix Context initialization with another Context object and dotted notation#
- The
__init__method of theContextclass incorrectly updated thekwargsmaking it returnNone. Calls toContextcontaining anotherContextobject, would previously fail. - Also fixed an issue with how
Contexthandled get operations for nested keys when using dotted notation
by @dannymeijer
bugfix - PR #168, related issue: #167
Core > Step: Fixed duplicate logging issues in nested Step classes#
We observed log duplication when using specific super call sequences in nested Step classes.
- Several changes were made to the
StepMetaClassto address duplicate logs when usingsuper()in the execute method of a Step class under specific circumstances. - Updated
_is_called_through_supermethod to traverse the entire method resolution order (MRO) and correctly identifysuper()calls. - Ensured
_execute_wrappermethod triggers logging only once per execute call.
This change prevents duplicate logs and ensures accurate log entries. The _is_called_through_super method was also used for Output validation; now we can ensure it is called only once.
by @dannymeijer
bugfix - PR #155, related issue: #149
Delta: Improve merge clause handling in DeltaTableWriter#
*When using delta merge configuration (as dict) to provide merge condition to merge builder and having multiple calls for merge operation (e.g. for each batch processing in streaming), the original implementation was breaking due to a pop call on the used dictionary.
bugfix - PR #154, related issue: #153
Spark: Pyspark Connect support fixes#
- Connect support check previously excluded Spark 3.4 wrongfully
- Fix gets rid of False positives in our spark connect check utility
by @nogitting and @dannymeijer
bugfix - PR #142
Spark > ColumnsTransformation: ColumnConfig defaults in ColumnsTransformation not working correctly by#
run_for_all_data_type and limit_data_type were previously not working correctly.
by @dannymeijer