Buffer
This module contains classes for writing data to a buffer before writing to the final destination.
The BufferWriter
class is a base class for writers that write to a buffer first. It provides methods for writing,
reading, and resetting the buffer, as well as checking if the buffer is compressed and compressing the buffer.
The PandasCsvBufferWriter
class is a subclass of BufferWriter
that writes a Spark DataFrame to CSV file(s) using
Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data
to more arbitrary file systems (e.g., SFTP).
The PandasJsonBufferWriter
class is a subclass of BufferWriter
that writes a Spark DataFrame to JSON file(s) using
Pandas. It is not meant to be used for writing huge amounts of data, but rather for writing smaller amounts of data
to more arbitrary file systems (e.g., SFTP).
koheesio.spark.writers.buffer.BufferWriter #
Base class for writers that write to a buffer first, before writing to the final destination.
execute()
method should implement how the incoming DataFrame is written to the buffer object (e.g. BytesIO) in the
output.
The default implementation uses a SpooledTemporaryFile
as the buffer. This is a file-like object that starts off
stored in memory and automatically rolls over to a temporary file on disk if it exceeds a certain size. A
SpooledTemporaryFile
behaves similar to BytesIO
, but with the added benefit of being able to handle larger
amounts of data.
This approach provides a balance between speed and memory usage, allowing for fast in-memory operations for smaller amounts of data while still being able to handle larger amounts of data that would not otherwise fit in memory.
Output #
Output class for BufferWriter
buffer
class-attribute
instance-attribute
#
buffer: InstanceOf[SpooledTemporaryFile] = Field(
default_factory=partial(
SpooledTemporaryFile, mode="w+b", max_size=0
),
exclude=True,
)
compress #
Compress the file_buffer in place using GZIP
Source code in src/koheesio/spark/writers/buffer.py
is_compressed #
read #
reset_buffer #
koheesio.spark.writers.buffer.PandasCsvBufferWriter #
Write a Spark DataFrame to CSV file(s) using Pandas.
Takes inspiration from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
See also: https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Note
This class is not meant to be used for writing huge amounts of data. It is meant to be used for writing smaller amounts of data to more arbitrary file systems (e.g. SFTP).
Pyspark vs Pandas
The following table shows the mapping between Pyspark, Pandas, and Koheesio properties. Note that the default values
are mostly the same as Pyspark's DataFrameWriter
implementation, with some exceptions (see below).
This class implements the most commonly used properties. If a property is not explicitly implemented, it can be
accessed through params
.
PySpark Property | Default PySpark | Pandas Property | Default Pandas | Koheesio Property | Default Koheesio | Notes |
---|---|---|---|---|---|---|
maxRecordsPerFile | ... | chunksize | None | max_records_per_file | ... | Spark property name: spark.sql.files.maxRecordsPerFile |
sep | , | sep | , | sep | , | |
lineSep | \n |
line_terminator | os.linesep | lineSep (alias=line_terminator) | \n | |
N/A | ... | index | True | index | False | Determines whether row labels (index) are included in the output |
header | False | header | True | header | True | |
quote | " | quotechar | " | quote (alias=quotechar) | " | |
quoteAll | False | doublequote | True | quoteAll (alias=doublequote) | False | |
escape | \ |
escapechar | None | escapechar (alias=escape) | \ | |
escapeQuotes | True | N/A | N/A | N/A | ... | Not available in Pandas |
ignoreLeadingWhiteSpace | True | N/A | N/A | N/A | ... | Not available in Pandas |
ignoreTrailingWhiteSpace | True | N/A | N/A | N/A | ... | Not available in Pandas |
charToEscapeQuoteEscaping | escape or
|