tensorflow/io

Environment variable for `kExecutorPoolSize` in S3 Filesystem

jeongukjae opened this issue · 0 comments

It would be better if I could adjust the value of kExecutorPoolSize with the environment variable as kS3MultiPartDownloadChunkSize does.


For more background:

I recently found that my program (written in C++) using S3 filesystem from tensorflow/io rather than a local filesystem consumes more memory (about 1.3GB~1.5GB). And I'm pretty sure that this is because the transfer manager consumes that memory. (kS3MultiPartDownloadChunkSize (50MB) * kExecutorPoolSize(25 + 1) ~= 1.27GB, and maybe more memory for the threads?)

// Implementation of a filesystem for S3 environments.
// This filesystem will support `s3://` URI schemes.
constexpr char kS3FileSystemAllocationTag[] = "S3FileSystemAllocation";
constexpr char kS3ClientAllocationTag[] = "S3ClientAllocation";
constexpr int64_t kS3TimeoutMsec = 300000; // 5 min
constexpr int kS3GetChildrenMaxKeys = 100;
constexpr char kExecutorTag[] = "TransferManagerExecutorAllocation";
constexpr int kExecutorPoolSize = 25;
constexpr uint64_t kS3MultiPartUploadChunkSize = 50 * 1024 * 1024; // 50 MB
constexpr uint64_t kS3MultiPartDownloadChunkSize = 50 * 1024 * 1024; // 50 MB
constexpr size_t kDownloadRetries = 3;
constexpr size_t kUploadRetries = 3;
constexpr size_t kS3ReadAppendableFileBufferSize = 1024 * 1024; // 1 MB

To test my hypothesis, I set the env var S3_MULTI_PART_DOWNLOAD_CHUNK_SIZE=1024 and the memory utilization dropped as I expected.

Since I can wait for more seconds to download models and want to use a smaller memory footprint in my use case, I want to set both kS3MultiPartDownloadChunkSize and kExecutorPoolSize to smaller values.