quiltdata/quilt

Package.list_object_versions can return keys ending in /, breaking root.set

gdesmarais-ctx opened this issue · 2 comments

In packages.py, set_dir, around line 621, setting a directory for a package calls list_object_versions to get all the objects under the specified dir to add. It is possible to have some of the returned objects end in /. For example, we have sequencer data that is copied into S3 through a storage gateway. Calling list_object_versions on the root directory of the S3 contents returns results like:

from datetime import date, datetime
from quilt3.packages import list_object_versions
import json

objects, _ = list_object_versions('celsius-sequencing', '190828_NB552139_0023_AHKCYJBGXB/')
def json_serial(obj):
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
for i in range(3):
    print(f'{json.dumps(objects[i], default=json_serial, indent=2)}')

Results in

{
  "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
  "Size": 0,
  "StorageClass": "STANDARD",
  "Key": "190828_NB552139_0023_AHKCYJBGXB/",
  "VersionId": "L7UoB6tk.T5bH8XHzNWx63ZjgG_KvCBW",
  "IsLatest": true,
  "LastModified": "2019-08-28T20:12:19+00:00",
  "Owner": {
    "DisplayName": "aws",
    "ID": "5f378d7af9023313f9eb8f0ea138443d2d7629af0efa4c66572dfdb5360dd5c1"
  }
}
{
  "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
  "Size": 0,
  "StorageClass": "STANDARD",
  "Key": "190828_NB552139_0023_AHKCYJBGXB/Config/",
  "VersionId": "7TLUsbBstpl.ei8TeSbe1GZY_mudfTXI",
  "IsLatest": true,
  "LastModified": "2019-08-28T20:12:35+00:00",
  "Owner": {
    "DisplayName": "aws",
    "ID": "5f378d7af9023313f9eb8f0ea138443d2d7629af0efa4c66572dfdb5360dd5c1"
  }
}
{
  "ETag": "\"8048e95a2c72097c274ccbdce9115ebb\"",
  "Size": 264379,
  "StorageClass": "STANDARD",
  "Key": "190828_NB552139_0023_AHKCYJBGXB/Config/Effective.cfg",
  "VersionId": "3vuUg9PqKDWlCyWfk8jbKyEqw2iF4Qk5",
  "IsLatest": true,
  "LastModified": "2019-08-28T20:13:07+00:00",
  "Owner": {
    "DisplayName": "aws",
    "ID": "5f378d7af9023313f9eb8f0ea138443d2d7629af0efa4c66572dfdb5360dd5c1"
  }
}

when root.set is called with the second item, it raises an exception around:

        if not logical_key or logical_key.endswith('/'):
            raise QuiltException(
                f"Invalid logical key {logical_key!r}. "
                f"A package entry logical key cannot be a directory."
            )

We need to be able to add these files. Currently, I have a patch in place that just ignores the QuiltException. Obviously not ideal.

Thanks for the detailed bug report. We'll circle back with a fix.

Adding in a dump of the offending bucket/key structure. I used the following little script to generate the file:

import boto3
import json
from datetime import datetime, date
from quilt3.packages import list_object_versions

def json_serial(obj):
    if isinstance(obj, (datetime, date)):
        return obj.isoformat()

s3_client = boto3.client('s3')
src_bucket = 'celsius-sequencing'
src_key = '190828_NB552139_0023_AHKCYJBGXB'
obj_report_v2 = s3_client.list_objects_v2(Bucket=src_bucket, Prefix=src_key)
obj_report_json_v2 = json.dumps(obj_report_v2, default=json_serial, indent=2)
with open('obj_report_v2.json', 'w') as f:
    f.write(obj_report_json_v2)

# Can't do this - throws exception
# obj_report_q = list_object_versions(src_bucket, src_key)
# obj_report_json_q = json.dumps(obj_report_q, default=json_serial, indent=2)
# print(f'{obj_report_json_q}')

obj_report_v2.json.gz