marl/openl3

Implement image embedding API

auroracramer opened this issue · 26 comments

Add the image embedding API to the library. This should be fairly similar to the existing audio API. I'll add a candidate interface once I've given it more thought.

Perhaps it’s time to revisit this (post icassp)?

Yes.

Here are my thoughts for how we could address this without significant change to the API. Basically, for all of the functions we'd add a modality kwarg.

openl3.core.get_embedding(audio, sr, modality='audio', input_repr='mel256', content_type='music', embedding_size=6144, center=True, hop_size=0.1, verbose=1)
openl3.core.process_file(filepath, output_dir=None, suffix=None, , modality='audio', input_repr='mel256', content_type='music', embedding_size=6144, center=True, hop_size=0.1, verbose=True)
openl3.models.get_embedding_model(input_repr, content_type, embedding_size, modality='audio')
openl3.models.get_embedding_model_path(input_repr, content_type, modality='audio')

This makes it backwards compatible while allowing for the image embeddings to be extracted. My fear is that adding *_image_* versions of functions would crowd the API and also make the naming convention inconsistent since the audio things wouldn't be named *_audio_*. We'd have to specify that things like hop size would be ignored. And we'd have to be clear about that input_repr (for the spectrogram) still affects the image embedding. Additionally, we'd have to support 8192 embedding size for images (that's what's used in the original L3 paper). We'll have to add checks in various places to make sure that embedding sizes > 512 correspond to the correct modality (i.e. 8192 -> image, 6144 -> audio).

Thoughts?

I agree this would be the most backwards compatible option, but I'm not convinced it's better than breaking the current API in favor of transitioning to separate functions for audio/image. My concern with maintaining the current API is that it can create a lot of ambiguity as to which embedding is being extracted, and having shared parameters but with different allowed values only makes things worse.

The reality is that many users don't read the docs and just explore the API via trial/error, and my intuition is that having separate get_audio_embedding and get_image_embedding functions (etc.) will lead to an API that's easier to understand, safer in terms of avoiding unexpected behavior, and cleaner to e.g. document and unit test.

The major downside to this is that we break backwards compatibility, but that's what version control is for :) We can just make a new release (we can even bump to 1.0.0 if we want to strictly follow the rules of semantic versioning), and there shouldn't be too much confusion. Our user base is still relatively small, so now's a good time to make breaking changes. I'm not too worried about the API being cluttered, it's 8 function calls instead of 4 which is still tiny, and if every pair of functions is distinguished by having audio or image in the filename I think it's pretty self-explanatory.

...unless we want to have built in support for processing video directly (and returning both the audio and image embeddings from a video file). This could be supported e.g. by a 3rd family of function calls for video, e.g. get_video_embedding that takes in a video file and return both audio and image embeddings. My inclination is to leave that outside the scope of OpenL3, at least for now.

Thoughts?

You raise a good point. Maybe backwards compatibility isn't a very important consideration at this point. Sure, let's refactor under a new v1.0.0 release with the separate *_audio_* and *_video_* functions.

I guess with a new v1.0.0 release in mind, we should consider any other possible API changes we might want to include. Maybe we should include the more flexible pooling size interface in @bensteers's PR. I've received a few questions about including the whole AVC model. Though I'm not super convinced if this something we should worry about right now.

Regarding video, this would be a nice feature to have I think. If we're going to have a 1.0.0 release I think this would be a good thing to include. We can reuse some of the l3embedding code using skvideo for this.

OK, sounds like we're in agreement about revamping the API for a 1.0.0 release.

I think we should spend a moment on nomenclature, especially if we want to support audio, video, and joint audio-visual processing.

For audio-specific functions, I think it's a no brainer, and the _audio_ suffix is a good option:

  • get_audio_embedding
  • process_audio_file
  • get_audio_embedding_model (should we consider renaming to load_audio_embedding_model)?

It becomes trickier for the visual frame embeddings: we could use either _image_ or _video_. On the one hand, I think _video_ would be preferable because the literature shows that image embeddings (trained on images) don't generalize well to video frames. On the other hand, it peeves me that the vision community uses the term "video" to mean "sequence of visual frames", ignoring the fact that there's (usually) an audio track as well. But maybe it's a lost battle...?

So if we ignore my beef, we could use _audio_ and _video_. If we want to make a point, we could use _audio_ and _visual_.

The final question is how to name methods that extract both from a video file. If we use _visual_, we could have _audio_, _visual_ and _video_ (the latter being for joint processing). If, on the other hand, we use _audio_ and _video_ (for audio and frames), then we need a term for the joint processing... for example, _audiovideo_, _audiovisual_, or something along these lines.

Thoughts?

I'm on board changing get_audio_embedding_model to load_audio_embedding_model.

I'm more on board with using _image_ since it's an image embedding, not a visual sequence (or "video" minus the the audio) embedding. And that allows us to reserve _video_ for audio/visual. I guess I question that comes to mind is that do we only need a process_video method? For processing videos loaded in memory, it may be sufficient to have the user just make a call to get_audio_embedding and get_image_embedding.

Thoughts?

I'm on board changing get_audio_embedding_model to load_audio_embedding_model.

ok cool, I think it'll be clearer and also cleaner to have only 1 method start with get, especially once we have 3 copies of each method.

I'm more on board with using image since it's an image embedding, not a visual sequence (or "video" minus the the audio) embedding. And that allows us to reserve video for audio/visual.

So, in principle I agree, but... there's a difference between an image embedding (learned on images, e.g. ImageNet) and a video-frame embedding (learned from video frames). The literature suggests these aren't the same and don't necessary generalize well from one to the other. So the problem with using _image_ is that it could imply that this is an image embedding, when it's in fact a video-frame embedding. I'm not sure what a good solution to this issue is... I'll ask around here see if people have any thoughts.

I guess I question that comes to mind is that do we only need a process_video method? For processing videos loaded in memory, it may be sufficient to have the user just make a call to get_audio_embedding and get_image_embedding.

For API consistency I'd be in favor of implementing all 4 methods for video, even if some of them just end up being wrappers around get_audio and get_image under the hood. I think it's cleaner and more consistent that way (and probably more streamlined for those interested in extracting both embeddings).

Okay, one more question. For processing videos, do we want an option to have a different hop size for the audio than the "hop size" of the images (effectively the framerate)?

Hmm without actually asking people, a priori my inclination is not to support different hop sizes for processing videos. I.e., if you're processing a video, you'll get time-aligned audio/video embeddings controlled via a single hop size parameter. If you want separate hop size then it makes sense to require the user to use the separate get_audio and get_image (<--- or whatever we end up calling it).

Also, if we're doing a major revision, would it be a good moment to consider #22 ?

I think we can still forgo an OOP interface before getting more feedback from users.

I think we can still forgo an OOP interface before getting more feedback from users.

Sure, sounds reasonable to me.

I've asked around about naming. People responded that get_image_embedding would be OK, though they expressed preference for get_visual_frame_embedding or get_vframe_embedding (the term "vframes" is used in ffmpeg for video frames).

Another comment was that we might want to avoid get_video_embedding altogether because the embedding is only spatial (i.e. processes a single frame), rather than bien spatio-temporal (processing a tensor of video frames) which is what they'd expect from a video embedding

Something else to consider is a batch processing mode. i.e. making more efficient use of the GPU by predicting multiple files at once. I'm putting together something like this for BirdVoxClassify, so I can adapt it for openl3.

Something else to consider is a batch processing mode.

That seems like a different issue though, right? Worth including for a major revision, but perhaps we should open a new issue for this.

Something else to consider is a batch processing mode.

That seems like a different issue though, right? Worth including for a major revision, but perhaps we should open a new issue for this.

Deferred to #26

Proposed updated API:

def get_audio_embedding(audio, sr, model=None, input_repr="mel256",
                        content_type="music", embedding_size=6144,
                        center=True, hop_size=0.1, verbose=1):
    """
    Computes and returns L3 embedding for given audio data

    Parameters
    ----------
    audio : np.ndarray [shape=(N,) or (N,C)]
        1D numpy array of audio data.
    sr : int
        Sampling rate, if not 48kHz will audio will be resampled.
    model : keras.models.Model or None
        Loaded model object. If a model is provided, then `input_repr`,
        `content_type`, and `embedding_size` will be ignored.
        If None is provided, the model will be loaded using
        the provided values of `input_repr`, `content_type` and
        `embedding_size`.
    input_repr : "linear", "mel128", or "mel256"
        Spectrogram representation used for model. Ignored if `model` is
        a valid Keras model.
    content_type : "music" or "env"
        Type of content used to train embedding. Ignored if `model` is
        a valid Keras model.
    embedding_size : 6144 or 512
        Embedding dimensionality. Ignored if `model` is a valid
        Keras model.
    center : boolean
        If True, pads beginning of signal so timestamps correspond
        to center of window.
    hop_size : float
        Hop size in seconds.
    verbose : 0 or 1
        Keras verbosity.

    Returns
    -------
        embedding : np.ndarray [shape=(T, D)]
            Array of embeddings for each window.
        timestamps : np.ndarray [shape=(T,)]
            Array of timestamps corresponding to each embedding in the output.

    """

def process_audio_file(filepath, output_dir=None, suffix=None, model=None,
                       input_repr="mel256", content_type="music",
                       embedding_size=6144, center=True, hop_size=0.1, verbose=True):
    """
    Computes and saves L3 embedding for given audio file

    Parameters
    ----------
    filepath : str
        Path to WAV file to be processed.
    output_dir : str or None
        Path to directory for saving output files. If None, output files will
        be saved to the directory containing the input file.
    suffix : str or None
        String to be appended to the output filename, i.e. <base filename>_<suffix>.npz.
        If None, then no suffix will be added, i.e. <base filename>.npz.
    model : keras.models.Model or None
        Loaded model object. If a model is provided, then `input_repr`,
        `content_type`, and `embedding_size` will be ignored.
        If None is provided, the model will be loaded using
        the provided values of `input_repr`, `content_type` and
        `embedding_size`.
    input_repr : "linear", "mel128", or "mel256"
        Spectrogram representation used for model. Ignored if `model` is
        a valid Keras model.
    content_type : "music" or "env"
        Type of content used to train embedding. Ignored if `model` is
        a valid Keras model.
    embedding_size : 6144 or 512
        Embedding dimensionality. Ignored if `model` is a valid
        Keras model.
    center : boolean
        If True, pads beginning of signal so timestamps correspond
        to center of window.
    hop_size : float
        Hop size in seconds.
    verbose : 0 or 1
        Keras verbosity.

    Returns
    -------

    """

def get_vframe_embedding(image_arr, frame_rate=None, model=None,
                         input_repr="mel256", content_type="music",
                         embedding_size=8192, verbose=1):
    """
    Computes and returns L3 embedding for given video frame (image) data

    Parameters
    ----------
    image_arr : np.ndarray [shape=(H, W, C) or (N, H, W, C)]
        3D or 4D numpy array of image data. If the images are not 224x224,
        the images are resized so that the smallest size is 256 and then
        the center 224x224 patch is extracted from the images.
    frame_rate : int or None
        Video frame rate (if applicable), which if provided results in
        a timestamp array being returned. If None, no timestamp array is
        returned.
    model : keras.models.Model or None
        Loaded model object. If a model is provided, then `input_repr`,
        `content_type`, and `embedding_size` will be ignored.
        If None is provided, the model will be loaded using
        the provided values of `input_repr`, `content_type` and
        `embedding_size`.
    input_repr : "linear", "mel128", or "mel256"
        Spectrogram representation used for to train audio part of audio-visual
        correspondence model. Ignored if `model` is a valid Keras model.
    content_type : "music" or "env"
        Type of content used to train embedding. Ignored if `model` is
        a valid Keras model.
    embedding_size : 8192 or 512
        Embedding dimensionality. Ignored if `model` is a valid
        Keras model.
    verbose : 0 or 1
        Keras verbosity.

    Returns
    -------
        embedding : np.ndarray [shape=(N, D)]
            Array of embeddings for each frame.
        timestamps : np.ndarray [shape=(N,)]
            Array of timestamps for each frame. If `frame_rate` is None,
            this is not returned.
    """

def process_vframe_file(filepath, output_dir=None, suffix=None, model=None,
                        input_repr="mel256", content_type="music",
                        embedding_size=8192, verbose=True):
    """
    Computes and saves L3 embedding for given image file

    Parameters
    ----------
    filepath : str
        Path to image file to be processed.
    output_dir : str or None
        Path to directory for saving output files. If None, output files will
        be saved to the directory containing the input file.
    suffix : str or None
        String to be appended to the output filename, i.e. <base filename>_<suffix>.npz.
        If None, then no suffix will be added, i.e. <base filename>.npz.
    model : keras.models.Model or None
        Loaded model object. If a model is provided, then `input_repr`,
        `content_type`, and `embedding_size` will be ignored.
        If None is provided, the model will be loaded using
        the provided values of `input_repr`, `content_type` and
        `embedding_size`.
    input_repr : "linear", "mel128", or "mel256"
        Spectrogram representation used for model. Ignored if `model` is
        a valid Keras model.
    content_type : "music" or "env"
        Type of content used to train embedding. Ignored if `model` is
        a valid Keras model.
    embedding_size : 8192 or 512
        Embedding dimensionality. Ignored if `model` is a valid
        Keras model.
    center : boolean
        If True, pads beginning of signal so timestamps correspond
        to center of window.
    hop_size : float
        Hop size in seconds.
    verbose : 0 or 1
        Keras verbosity.

    Returns
    -------

    """

def process_video_file(filepath, output_dir=None, suffix=None, model=None,
                        input_repr="mel256", content_type="music",
                        audio_embedding_size=6144, audio_center=True,
                        audio_hop_size=0.1, vframe_embedding_size=8192,
                        verbose=True):
    """
    Computes and saves L3 audio and video frame embeddings for given video file

    Parameters
    ----------
    filepath : str
        Path to WAV file to be processed.
    output_dir : str or None
        Path to directory for saving output files. If None, output files will
        be saved to the directory containing the input file.
    suffix : str or None
        String to be appended to the output filename,
        i.e. <base filename>_<modality>_<suffix>.npz.
        If None, then no suffix will be added,
        i.e. <base filename>_<modality>.npz.
    model : keras.models.Model or None
        Loaded model object. If a model is provided, then `input_repr`,
        `content_type`, and `embedding_size` will be ignored.
        If None is provided, the model will be loaded using
        the provided values of `input_repr`, `content_type` and
        `embedding_size`.
    input_repr : "linear", "mel128", or "mel256"
        Spectrogram representation used for audio model. Ignored if `model` is
        a valid Keras model.
    content_type : "music" or "env"
        Type of content used to train embedding. Ignored if `model` is
        a valid Keras model.
    audio_embedding_size : 6144 or 512
        Audio embedding dimensionality. Ignored if `model` is a valid Keras model.
    audio_center : boolean
        If True, pads beginning of audio signal so timestamps correspond
        to center of window.
    audio_hop_size : float
        Hop size in seconds.
    vframe_embedding_size : 8192 or 512
        Video frame embedding dimensionality. Ignored if `model` is a valid Keras model.
    verbose : 0 or 1
        Keras verbosity.

    Returns
    -------

    """

Thoughts?

Generally I think this looks great. A couple of comments:

  1. You propose get_vframe_embedding and process_vframe_file, which I imagine is in order to be consistent with get_audio_embedding and process_audio_file. On the one hand I think it makes sense to be consistent. On the other hand, what's a "vfame_file"? Basically it's just an image file. Now I realize it was I who suggested the term "vframe" rather than "image" to distinguish between video frames and images. But, do you think process_vframe_file could be confusing? The two alternatives would be to have diverging nomenclature (get_vframe_embedding + process_image_file), or to revert back to image for both (get_image_embedding + process_image_file), but that takes us back to the term "image" for vframes. Maybe I'm overthinking it and people will understand process_vframe_file without too much trouble, though, if we do stick to this nomenclature, then I would try to be over explicit in the docstring (a la "Computes and saves L3 embedding for given image file in a standard image format such as JPG, PNG, etc.").

  2. We have process_video_file but no get_video_embedding. Do we not expect that to also be a useful function call?

1. You propose `get_vframe_embedding` and `process_vframe_file`, which I imagine is in order to be consistent with `get_audio_embedding` and `process_audio_file`. On the one hand I think it makes sense to be consistent. On the other hand, what's a "vfame_file"? Basically it's just an image file. Now I realize it was I who suggested the term "vframe" rather than "image" to distinguish between video frames and images. But, do you think `process_vframe_file` could be confusing? The two alternatives would be to have diverging nomenclature (`get_vframe_embedding` + `process_image_file`), or to revert back to image for both (`get_image_embedding` + `process_image_file`), but that takes us back to the term "image" for vframes. Maybe I'm overthinking it and people will understand `process_vframe_file` without too much trouble, though, if we do stick to this nomenclature, then I would try to be over explicit in the docstring (a la "Computes and saves L3 embedding for given image file in a standard image format such as JPG, PNG, etc.").

Yeah, the issue of diverging nomenclature was why I opted for vframe for both. However, after reviewing the downstream evaluation on the image embeddings in the L3 paper I don't necessarily think that image is that much of an issue considering the downstream classification was done on ImageNet. But if it would upset a lot of people for us to call it image then maybe diverging nomenclature wouldn't be so bad.

2. We have `process_video_file` but no `get_video_embedding`. Do we not expect that to also be a useful function call?

What would we expect get_video_embedding to return? We don't have proper video embeddings. If you mean just getting the embeddings for the audio and video frames in a loaded video, I don't think having two separate function calls for the audio and video frames is a big deal.

Re: 1, I'm not a big fan of process_vframe_file because there's no such thing. So I'd either diverge or stick to image for both. Which do you prefer?

Re: 2, ok, fair enough.

I'd prefer image for both.

Regarding videos, I've been thinking about whether we should use opencv which is maybe overkill but is the most popular library for video processing. However, in order to load audio from the videos we'll probably need to use ffmpeg, which is already used as a frontend for skvideo (which we used in the l3embedding experiments). Though this library isn't as widely used as opencv it would allow us to have fewer dependencies since we need ffmpeg anyway.

moviepy also looks like an option we can use for both loading video frames and audio, without having to have users install ffmpeg.

I'd prefer image for both.

OK, let's do that then.

Regarding video loading, I think the lighter we can keep the repo the better. Whatever our dependencies are people will install them (or rather, pip will install them), so I don't think it matters too much if the dependency is super popular, what matters most is that it's (1) light and (2) well supported (with expectation that it'll continue to be supported long term).

FYI: being addressed in #29

Closed by #37.