Download files
This module provides functionality to download files from URLs specified in a Spark DataFrame column and store the
downloaded file paths in a new column. It leverages the DownloadFileStep
class to handle the file download process
and supports various write modes to manage existing files.
Classes:
Name | Description |
---|---|
DownloadFileFromUrlTransformation |
A transformation class that downloads content from URLs in the specified column and stores the downloaded file paths in a new column. |
Write Modes
DownloadFileFromUrlTransformation
supports the following write modes:
- If the file exists, it will be overwritten.
- If it does not exist, a new file will be created.
(this is the default mode)
- If the file exists, the new data will be appended to it.
- If it does not exist, a new file will be created.
- If the file exists, the method will return without writing anything.
- If it does not exist, a new file will be created.
- If the file exists, an error will be raised.
- If it does not exist, a new file will be created.
koheesio.spark.transformations.download_files.DownloadFileFromUrlTransformation #
Downloads content from URLs in the specified column and stores the downloaded file paths in a new column.
Example
Example usage of the DownloadFileFromUrlTransformation
class:
from pyspark.sql import SparkSession
from koheesio.spark.transformations.download_files import (
DownloadFileFromUrlTransformation,
)
from koheesio.steps.download_file import FileWriteMode
spark = SparkSession.builder.appName(
"DownloadFilesExample"
).getOrCreate()
df = spark.createDataFrame(
[
("http://example.com/file1.txt",),
("http://example.com/file2.txt",),
],
["url"],
)
transformation = DownloadFileFromUrlTransformation(
column="url",
target_column="downloaded_file_path",
mode=FileWriteMode.OVERWRITE,
download_path="/path/to/download",
)
transformed_df = transformation.transform(df)
transformed_df.show()
In this example, the DownloadFileFromUrlTransformation
class is used to download files from the URLs specified in
the url
column of the DataFrame df
. The downloaded file paths are stored in a new column named
downloaded_file_path
. The downloaded files are saved to the /path/to/download
directory with the OVERWRITE
write mode. (The OVERWRITE
mode is the default mode.)
Input DataFrame:#
url |
---|
http://example.com/file1.txt |
http://example.com/file2.txt |
Output DataFrame:#
url | downloaded_file_path |
---|---|
http://example.com/file1.txt | download/file1.txt |
http://example.com/file2.txt | download/file2.txt |
Since the DownloadFileFromUrlTransformation
class is a ColumnsTransformationWithTarget
, the transform
method
is used to apply the transformation to the input DataFrame df
. Alternatively, the execute
method can be used
to apply the transformation in place, or the class can be used on a df.transform()
call.
I.e.:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
Union[Column, str]
|
The column that holds the URLs to download. |
required |
download_path
|
str
|
The local directory path where the file will be downloaded to. |
required |
chunk_size
|
int
|
The size (in bytes) of the chunks to download the file in, must be greater than 16. |
8192
|
mode
|
FileWriteMode
|
Write mode: overwrite, append, ignore, exclusive, or backup. |
FileWriteMode.OVERWRITE
|
Write Modes
The DownloadFileFromUrlTransformation
supports the following write modes:
- OVERWRITE
- APPEND
- IGNORE
- EXCLUSIVE
- BACKUP
chunk_size
class-attribute
instance-attribute
#
chunk_size: int = Field(
8192,
ge=16,
description="The size (in bytes) of the chunks to download the file in, must be greater than or equal to 16.",
)
column
class-attribute
instance-attribute
#
column: Union[Column, str] = Field(
default="",
description="The column that holds the URLs to download.",
)
download_path
class-attribute
instance-attribute
#
download_path: DirectoryPath = Field(
...,
description="The local directory path where the file will be downloaded to.",
)
mode
class-attribute
instance-attribute
#
mode: FileWriteMode = Field(
default=OVERWRITE,
description="Write mode: overwrite, append, ignore, exclusive, or backup.",
)
Output #
execute #
execute() -> Output
Download files from URLs in the specified column.
Source code in src/koheesio/spark/transformations/download_files.py
func #
Takes a set of urls and downloads the files from the URLs in the specified column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
partition
|
set[str]
|
A set of URLs to download the files from. |
required |