awslabs/ml-io

[Feature Req] Read raw RecordIO contents for AugmentedManifestFile input

athewsey opened this issue · 3 comments

Per the docs, augmented manifest file inputs in SageMaker (assuming a RecordIO wrapping) seem to produce straight RecordIO streams, rather than RecordIO-protobuf.

In the general case (e.g. object detection on a SM Ground Truth labelled dataset), records might be alternating and complex data types like JPEGs and JSON objects.

Is it possible already in mlio to yield this raw data, kind of like how sagemaker_tensorflow's PipeModeDataset just produces string Tensors?

Maybe this runs against the philosophy of mlio (since pushing the binary data up to Python for processing would not be performant), but from what I understand it could fill some niches such as:

  • Reading arbitrary (RecordIO-wrapped) augmented manifest channels in frameworks like SKLearn or PyTorch, where RecordIO is not so natively supported as MXNet
  • Supporting tensorflow.keras.Model.fit_generator() model fitting as well as via PipeModeDataset and fit(): making it easier to add complex transforms in Python (like json.loads()) that don't have nice within-TensorFlow equivalents.

Would something similar to the code snippet below work for your case?

import json
import mlio

pipe = mlio.SageMakerPipe('/opt/ml/train')

strm = pipe.open_read()

# A record reader is a lower level abstraction compared to a
# data reader. It returns raw records in byte form.
rec_reader = mlio.RecordIORecordReader(strm)

for rec in rec_reader:
    # Each record has a payload property that contains the raw
    # record data as a Python buffer.
    data = json.loads(rec.payload)
    ...

ML-IO internally has record_readers which provide raw access to the underlying records of a data stream. Our data readers simply use those record readers to decode and convert data into tensors. We considered record readers an implementation detail as they were "too low level". However if you think there is a use case for accessing raw/encoded record data, we can think of exposing record readers in our Python API.

It looks like that's exactly what I was after, yes thanks @cbalioglu!

I think exposing the RecordIORecordReader in particular would make it easier to feed SageMaker Ground Truth annotated data into non-MXNet frameworks (where RecordIO isn't natively supported) via Pipe Mode & Augmented Manifest.

By definition these records could contain anything (whatever data is in the S3 file that ref attributes point to), and often JSON (for the non-ref attributes). Some SMGT workflows generate quite rich JSON annotations including extra metadata.

Of course for TF there's PipeModeDataset and tf.io.decode_image(), but I haven't found a nice performant way to extract the bits I need from my other field (the complex JSON label coming out of PipeModeDataset as a string Tensor) into a numeric Tensor... Currently using a tf.py_func!

Even this "standard" SageMaker object detection use case is pretty complex:

  • Process the records in pairs
  • (Either peek the records to see which is which, or just assert the first one is the image)
  • Parse the raw image file data from record A into a pixel tensor (maybe also stretch/letterbox/etc. it into the right shape?)
  • Parse the JSON object contents of record B, and map the list of objects on the annotations key to an X-by-5 matrix of bounding box coordinates/classes (maybe also normalizing by image height/width?)

...So although it would be nice to have high-level data reader classes for standard SageMaker Ground Truth workflows, I think it'd get complicated fast with the possible permutations.

For example: the other SageMaker Ground Truth job types will all have different annotation formats; Some use-cases might use 3 or more manifest fields instead of 2 (if they need to access the metadata SMGT pulls out into a separate field); Some might use multiple file refs per "sample" (e.g. an image segmentation mask saved as a single-channel image).

I guess given this use case it might be nice if ML-IO record readers could batch the records for us? And maybe be able to apply data reader parsers to every Nth record?

Hi @athewsey, did your use case work with mlio? We are interested in the usefulness of pipemode for a pytorch model as well.