Skip to content

File writer

File writers for different formats: - CSV - Parquet - Avro - JSON - ORC - Text

The FileWriter class is a configurable Writer that allows writing to different file formats providing any option needed. CsvFileWriter, ParquetFileWriter, AvroFileWriter, JsonFileWriter, OrcFileWriter, and TextFileWriter are convenience classes that just set the format field to the corresponding file format.

koheesio.spark.writers.file_writer.AvroFileWriter #

Writes a DataFrame to an Avro file.

This class is a convenience class that sets the format field to FileFormat.avro.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = AvroFileWriter(df=df, path="path/to/file.avro", output_mode=BatchOutputMode.APPEND)

format class-attribute instance-attribute #

format: FileFormat = avro

koheesio.spark.writers.file_writer.CsvFileWriter #

Writes a DataFrame to a CSV file.

This class is a convenience class that sets the format field to FileFormat.csv.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = CsvFileWriter(
    df=df,
    path="path/to/file.csv",
    output_mode=BatchOutputMode.APPEND,
    header=True,
)

format class-attribute instance-attribute #

format: FileFormat = csv

koheesio.spark.writers.file_writer.FileFormat #

Supported file formats for the FileWriter class

avro class-attribute instance-attribute #

avro = 'avro'

csv class-attribute instance-attribute #

csv = 'csv'

json class-attribute instance-attribute #

json = 'json'

orc class-attribute instance-attribute #

orc = 'orc'

parquet class-attribute instance-attribute #

parquet = 'parquet'

text class-attribute instance-attribute #

text = 'text'

koheesio.spark.writers.file_writer.FileWriter #

A configurable Writer that allows writing to different file formats providing any option needed.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = FileWriter(
    df=df,
    path="path/to/file.csv",
    output_mode=BatchOutputMode.APPEND,
    format=FileFormat.parquet,
    compression="snappy",
)

format class-attribute instance-attribute #

format: FileFormat = Field(
    ...,
    description="The file format to use when writing the data.",
)

output_mode class-attribute instance-attribute #

output_mode: BatchOutputMode = Field(
    default=APPEND, description="The output mode to use"
)

path class-attribute instance-attribute #

path: Union[Path, str] = Field(
    default=..., description="The path to write the file to"
)

ensure_path_is_str #

ensure_path_is_str(v: Union[Path, str]) -> str

Ensure that the path is a string as required by Spark.

Source code in src/koheesio/spark/writers/file_writer.py
@field_validator("path")
def ensure_path_is_str(cls, v: Union[Path, str]) -> str:
    """Ensure that the path is a string as required by Spark."""
    if isinstance(v, Path):
        return str(v.absolute().as_posix())
    return v

execute #

execute() -> Output
Source code in src/koheesio/spark/writers/file_writer.py
def execute(self) -> FileWriter.Output:
    writer = self.df.write

    if self.extra_params:
        self.log.info(f"Setting extra parameters for the writer: {self.extra_params}")
        writer = writer.options(**self.extra_params)

    writer.save(path=self.path, format=self.format, mode=self.output_mode)  # type: ignore

    self.output.df = self.df

koheesio.spark.writers.file_writer.JsonFileWriter #

Writes a DataFrame to a JSON file.

This class is a convenience class that sets the format field to FileFormat.json.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = JsonFileWriter(df=df, path="path/to/file.json", output_mode=BatchOutputMode.APPEND)

format class-attribute instance-attribute #

format: FileFormat = json

koheesio.spark.writers.file_writer.OrcFileWriter #

Writes a DataFrame to an ORC file.

This class is a convenience class that sets the format field to FileFormat.orc.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = OrcFileWriter(df=df, path="path/to/file.orc", output_mode=BatchOutputMode.APPEND)

format class-attribute instance-attribute #

format: FileFormat = orc

koheesio.spark.writers.file_writer.ParquetFileWriter #

Writes a DataFrame to a Parquet file.

This class is a convenience class that sets the format field to FileFormat.parquet.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = ParquetFileWriter(
    df=df,
    path="path/to/file.parquet",
    output_mode=BatchOutputMode.APPEND,
    compression="snappy",
)

format class-attribute instance-attribute #

format: FileFormat = parquet

koheesio.spark.writers.file_writer.TextFileWriter #

Writes a DataFrame to a text file.

This class is a convenience class that sets the format field to FileFormat.text.

Extra parameters can be passed to the writer as keyword arguments.

Examples:

writer = TextFileWriter(df=df, path="path/to/file.txt", output_mode=BatchOutputMode.APPEND)

format class-attribute instance-attribute #

format: FileFormat = text