Skip to content

Koheesio 0.10#

Version 0.10 is Koheesio's 4th release since getting Open Sourced.

This version brings several important features, security improvements, and bug fixes across different modules of Koheesio. The overall API remains unchanged.

New Contributors#

Big shout out to all contributors and a heartfelt welcome to our new contributors:

Migrating from v0.9#

For users currently using v0.9, consider the following:

  • Consider switching to the new SecretStr and SecretBytes classes for handling secret strings and bytes with enhanced security. These classes are compatible with Pydantic's SecretStr and SecretBytes and allow seamless integration with existing code. To use, replace from pydantic import SecretStr, SecretBytes with from koheesio.models import SecretStr, SecretBytes to use the enhanced secret handling. This is especially useful if you find yourself needing to format secret strings or bytes (i.e. concatenation, f-strings, etc.) while maintaining security.

  • The partial classmethod was added to BaseModel for enhanced customization and flexibility. This method allows creating a new instance of a model with only the specified fields updated, such as overwriting or setting a field's default values. This feature provides more control over the model's behavior and allows for more dynamic model creation.

  • The HttpStep class now supports authorization headers with proper masking for improved security. This change addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information. Comprehensive unit tests have been added to prevent regressions and ensure expected behavior. Consider updating your HttpStep implementations to take advantage of this enhanced security feature.

  • JDBCReader, HanaReader, and TeradataReader classes have been updated to use params over options for improved consistency and maintainability. The options field has been renamed to params, and an alias options has been added for backwards compatibility. These changes provide a more consistent API across different reader classes and improve code readability. Note that dbtable and query validation now occurs upon initialization rather than at runtime, requiring either dbtable or query to be submitted to use JDBC based classes.

Release 0.10.0#

v0.10.0 - 2025-03-..

Features and Enhancements#

The following new features are included with 0.10:

feature - PR #164

Core: Added Koheesio specific SecretStr and SecretBytes classes#

Allows formatting secret strings and bytes with enhanced security.
To use: replace from pydantic import SecretStr, SecretBytes with from koheesio.models import SecretStr, SecretBytes in your import statement.

  • New Secret classes are compatible with Pydantic's SecretStr and SecretBytes and allow seamless integration with existing code.
  • These classes expand support to allow usage with an f-string (or .format) and "string" + "other_string" concatenation while remaining secure. Concatenations are not supported by Pydantic's SecretStr and SecretBytes implementations.
  • A SecretMixin class was also introduced to reduce code duplication.

by @dannymeijer

feature - PR #150

Core > BaseModel: Added partial classmethod to BaseModel#

Partial allows for creating a new instance of a model with only the specified fields updated (such as overwriting or setting a fields default values)

by @dannymeijer

feature - PR #161, related issues: #87, #148

Box: Added a buffered version of BoxFileWriter and improved logging for BoxCsvFileReader#

Added the BoxBufferFileWriter class for writing files to Box when physical storage isn't available. Data is instead buffered in memory before being written to Box. Also improves BoxCsvFileReader logging output by providing the file name in addition to the file ID.

by @riccamini

feature - PR #168

Dev Experience: Easier debugging and dev improvements#

To make debugging easier, pyproject.toml was updated to allow for easier running spark connect in your local dev environment:

  • Added extra dependencies for pyspark[connect]==3.5.4.
  • Added environment variables for Spark Connect in the development environment.
  • Changed to verbose mode logging in the pytest output (also visible through Github Actions tests run output).

by @dannymeijer

feature - PR #163

Delta: Support for Delta table history and data staleness checks#

Enables fetching Delta table history and checking data staleness based on defined intervals and refresh days.

Changes to DeltaTableStep class:

  • Added describe_history() method to DeltaTableStep for fetching Delta table history as a Spark DataFrame.
  • Added is_date_stale() method to DeltaTableStep to check data staleness based on time intervals or specific refresh days.

by @zarembat

feature - PR #158 & #170, related issue: #157

Http: Added support for authorization headers with proper masking for improved security#

Addresses potential data leaks in authorization headers, ensuring secure handling of sensitive information. Comprehensive unit tests added to prevent regressions and ensure expected behavior.

Changes to HttpStep class:

  • Added decode_sensitive_headers method to decode SecretStr values in headers.
  • Modified get_headers method to dump headers into JSON without SecretStr masking.
  • Added auth_header field to handle authorization headers.
  • Implemented masking for bearer tokens to maintain their 'secret' status.

by @dannymeijer

feature - PR #143, related issue: #75

Step & Spark: Added Step for downloading files from a URL along with matching Spark transformation#

Allow downloading files from a given URL

  • Added DownloadFileStep class in a new module koheesio.steps.download_file
  • Added FileWriteMode enum with supported write modes: OVERWRITE, APPEND, IGNORE, EXCLUSIVE, BACKUP

Also made available as a spark Transformation in DownloadFileFromUrlTransformation in a new module koheesio.spark.transformations.download_files

  • The spark implementation allows passing URLs through a column in a given DataFrame
  • All URLs are then downloaded by the Spark Driver to a given location

by @mikita-sakalouski and @dannymeijer

refactor - PR #168

Snowflake: classes now use params over options#

  • Snowflake classes now also base ExtraParamsMixin
  • Renamed options field to params and added alias options for backwards compatibility.
  • Introduced SF_DEFAULT_PARAMS.

by @dannymeijer

refactor - PR #168

Spark > Reader > JDBC, HanaReader, TeradataReader: Updated JDBC classes to use params over options#

  • JDBCReader class now also bases ExtraParamsMixin.
  • Renamed options field to params and added alias options for backwards compatibility.
  • dbtable and query validation now handled upon initialization rather than at runtime.
  • Behavior now requires either dbtable or query to be submitted to be able to use JDBC.

HanaReader and TeradataReader classes were also updated to use params over options for improved consistency and maintainability.

by @dannymeijer

refactor - PR #142

Spark > Transformation > CamelToSnake: Added more efficient Spark 3.4+ supported operation#

CamelToSnakeTransformationnow usestoDF` for more efficient transformation in Spark 3.4+.

by @dannymeijer

Bugfixes#

The following bugfixes are included with 0.10:

bugfix - PR #160

Core > Context: Fix Context initialization with another Context object and dotted notation#

  • The __init__ method of the Context class incorrectly updated the kwargs making it return None. Calls to Context containing another Context object, would previously fail.
  • Also fixed an issue with how Context handled get operations for nested keys when using dotted notation

by @dannymeijer

bugfix - PR #168, related issue: #167

Core > Step: Fixed duplicate logging issues in nested Step classes#

We observed log duplication when using specific super call sequences in nested Step classes.

  • Several changes were made to the StepMetaClass to address duplicate logs when using super() in the execute method of a Step class under specific circumstances.
  • Updated _is_called_through_super method to traverse the entire method resolution order (MRO) and correctly identify super() calls.
  • Ensured _execute_wrapper method triggers logging only once per execute call.

This change prevents duplicate logs and ensures accurate log entries. The _is_called_through_super method was also used for Output validation; now we can ensure it is called only once.

by @dannymeijer

bugfix - PR #155, related issue: #149

Delta: Improve merge clause handling in DeltaTableWriter#

*When using delta merge configuration (as dict) to provide merge condition to merge builder and having multiple calls for merge operation (e.g. for each batch processing in streaming), the original implementation was breaking due to a pop call on the used dictionary.

by @mikita-sakalouski

bugfix - PR #154, related issue: #153

Spark: Pyspark Connect support fixes#

  • Connect support check previously excluded Spark 3.4 wrongfully
  • Fix gets rid of False positives in our spark connect check utility

by @nogitting and @dannymeijer

bugfix - PR #142

Spark > ColumnsTransformation: ColumnConfig defaults in ColumnsTransformation not working correctly by#

run_for_all_data_type and limit_data_type were previously not working correctly.

by @dannymeijer

bugfix - PR #168

Spark > Transformation > Hash: Fix error handling for missing columns in Spark Connect#

Updated sha2 function call to use named parameters.

Changes to Sha2Hash class:

  • Added check for missing columns.
  • Improved handling when no columns are provided.