Introducing S3Fetch - Shane Anderson

Why S3Fetch?

If you've ever tried to download a large number of objects from S3 using the AWS CLI, you've probably noticed it can be painfully slow. The AWS CLI's aws s3 sync or aws s3 cp --recursive commands process objects sequentially and don't start downloading until the complete object listing is finished.

S3Fetch solves this by:

Concurrent operations: Lists objects and downloads files simultaneously in separate threads
Immediate downloads: Starts downloading the first object as soon as it's discovered, while still listing remaining objects
Optimized for prefixes: Only requests objects under your specified prefix, reducing API calls
Configurable threading: Tune performance with adjustable thread counts

Note

I wrote the first version of S3Fetch when I wasn't a very good programmer, and arguably I'm still not, but I have been working on a version 2 which gets rid of the most egregious mistakes such as the God Class, etc. That code has been pushed to main on the GitHub repo and it works, but I haven't gotten around to releasing a new package yet.

How It Works

The basic idea is pretty simple: why wait around?

Most tools (like AWS CLI) do this:

Ask S3 for a list of ALL your files
Wait... and wait... for the complete list
Finally start downloading files one by one

S3Fetch does this instead:

Start asking S3 for your file list in one thread
As soon as S3 mentions the first file, start downloading it in another thread
Keep downloading files in multiple threads while the first thread is still getting the rest of the list

So if you're downloading 1,000 files, S3Fetch is already downloading file #1 (and #2, #3, etc.) in separate threads while the main thread is still figuring out what files #500-1000 even are.

That's basically it - just use separate threads so you don't wait around when you don't have to.

Performance Comparison

For downloading 428 objects from a bucket with 12+ million objects:

AWS CLI: ~2-3 minutes (sequential listing + downloading)
S3Fetch (8 threads): 29 seconds
S3Fetch (100 threads): 8 seconds

S3Fetch is Open Source, written in Python and you can read more detailed instructions and information, raise issues or contribute if that takes your fancy via the GitHub repo.

You can also use the comment section below if you just want to chat about S3Fetch or ask questions.

Installation

Quick and easy install from PyPi. I recommend installation using pipx:

pipx install s3fetch

You can also install using pip:

pip install s3fetch

Usage

Basic usage is simple:

s3fetch s3://bucket/prefix

Common Examples

Download logs from a specific date:

s3fetch s3://my-logs-bucket/application-logs/2021-01-28

Download all files in a subdirectory:

s3fetch s3://my-data-bucket/user-uploads/images/

Download with custom thread count for better performance:

s3fetch s3://large-bucket/big-dataset/ --threads 50

Download to a specific local directory:

s3fetch s3://my-bucket/data/ --output-dir ./downloads/

When to Use S3Fetch

Large file counts: Downloading hundreds or thousands of files
Deep prefixes: When your files are buried deep in a bucket with millions of objects
Bandwidth optimization: Making full use of your connection with parallel downloads

Command Options

--threads N: Number of download threads (default: CPU cores)
--output-dir PATH: Download to specific directory (default: current directory)
--dry-run: Show what would be downloaded without actually downloading

You can find more detailed instructions in the README.

Benchmarks

Downloading 428 objects under the fake-prod-data/2020-10-17 prefix from a bucket containing a total of 12,204,097 objects.

With 100 threads

s3fetch s3://fake-test-bucket/fake-prod-data/2020-10-17  --threads 100

8.259 seconds

With 8 threads

s3fetch s3://fake-test-bucket/fake-prod-data/2020-10-17  --threads 8

29.140 seconds