S3Fetch is an simple & fast multi-threaded S3 download tool built to make the process of downloading objects from a large S3 bucket as fast and painless as possible, especially if you need to download objects under a specific prefix.
While tools like s4cmd work well for general usage (and have a lot more options!) they are not performant when downloading objects under a prefix from a bucket containing a large amount of objects as they retrieve a listing of all objects in the bucket first, then filter the object keys which results is a extremely long time to first byte. S3Fetch instead requests that S3 return only the objects under the prefix you've requested and as the object listing occurs in a separate thread than the downloads S3Fetch will start to download the first object as soon as that first object key is returned by S3 even while S3 continues to send the rest of the object listing!
Testing shows that (see benchmarks below) s4cmd will not start downloading a single file within 60 minutes, while S3Fetch finishes downloading all objects in approx. 8 seconds when using 100 threads.
S3Fetch is Open Source, written in Python and you can read more detailed instructions and information, raise issues or contribute if that takes your fancy via the GitHub repo.
You can also use the comment section below if you just want to chat about S3Fetch or ask questions.
Installation
Quick and easy install from PyPi. I recommend installation using pipx:
pipx install s3fetch
but you can also install using pip
:
pip install s3fetch
Usage
s3fetch s3://bucket/prefix
Example:
s3fetch s3://fake-test-bucket/fake-prod-data/2020-10-17
The above would download any object that has a prefix of fake-prod-data/2020-10-17
You can find more detailed instructions in the README.
Benchmarks
Downloading 428 objects under the fake-prod-data/2020-10-17
prefix from a bucket containing a total of 12,204,097 objects.
With 100 threads
s3fetch s3://fake-test-bucket/fake-prod-data/2020-10-17 --threads 100
8.259 seconds
s4cmd get s3://fake-test-bucket/fake-prod-data/2020-10-17* --num-threads 100
Timed out while listing objects after 60min.
With 8 threads
s3fetch s3://fake-test-bucket/fake-prod-data/2020-10-17 --threads 8
29.140 seconds
time s4cmd get s3://fake-test-bucket/fake-prod-data/2020-10-17* --num-threads 8
Timed out while listing objects after 60min.