jatinchowdhury18/RTNeural

Layers support & Input Data issues

vackva opened this issue · 10 comments

First of all, thanks for providing RTNeural. It is a really elegant way to get a model running in the audio c++ world!

I got two questions:

  1. Is there any chance to see Conv1D Transpose and transformer Layers support in the future? It's quite a complex architecture..

  2. The other one is probably a beginner question. I have a model with an input size of 64 Samples. Is there any way how to put in 64 Samples at a time and get 64 Samples back? The RTNeural examples and NeuralPi are all using an input size of 1

Hello! Glad you're enjoying the library.

For question 1, having more advanced layers would definitely be cool. I don't know much about the two layer types you mentioned here, except as high-level concepts. Would it be possible to share some articles about them, or (even better) some PyTorch or TensorFlow code that uses them in a neural net (preferably audio-related).

For question 2, you should be able to do that sort of thing like this:

float testInput[64] = { 0.0f, 1.0f, 2.0f, ... };  // create the input data
model->forward (testInput);                       // pass a pointer to the input data to the `forward()` method
float* testOutput = model->getOutputs();          // get a pointer to the output data

Hey, thanks for your quick & helpful response! Your work is really impressive :)

Regarding 1: I will go through some papers I read and send it to you the following days!

About question 2: I followed your example code, but get strange results (some float values > 700) and the python model outcome is completely different.

For test purposes, I am loading the model at run-time

Therefore, I declared the model like in my PluginProcessor.h :

std::unique_ptr<RTNeural::Model<float>> neuralNet[1];

I used this to load the json file:

void PluginProcessor::loadModel(std::string pathToJsonFile)
{
    this->suspendProcessing(true);

    std::ifstream jsonStream (pathToJsonFile, std::ifstream::binary);

    auto jsonInput = nlohmann::json::parse (jsonStream);

    neuralNet[0] = RTNeural::json_parser::parseJson<float> (jsonInput, true);

    neuralNet[0]->reset();

    this->suspendProcessing(false);
}

And this is how I am feeding the data into to model (I am using a buffer size of 64 in my DAW)

void PluginProcessor::processBlock (....)
{

     float testInput[64] = { }; 

     //filling in the audio data to testInput...
     ....

     neuralNet[0]->forward (testInput);
     const float* testOutput = neuralNet[0]->getOutputs();

     //writing the model output to audio buffer...
}

Is there something I'm doing wrong? Does it make a different if I declare the model architecture at compile time (except performance)?

Hmm, the code in your post seems like ti should work... Would it be possible to provide an example JSON model file, as well as an example input and output? That would be useful for debugging.

The run-time and compile-time models should give identical results, the only difference besides performance is in how the models are loaded. For the run-time models, the model architecture is determined based on the contents of the JSON file. With a compile-time model, the user defines the model architecture in the code, and if it does not match the model size defined in the JSON file then the output will be incorrect. In both cases, any errors in the model loading process should be printed to the console. Would it also be possible to share the console output coming from the call to RTNeural::json_parser::parseJson<float> (jsonInput, true);?

One other thing to check is that the input size of the model is the same as the size of the input array, and that the input array is initialized either with values or zeros. If the input array is too small, or if some values are uninitialized, then the model output will be unpredictable. I know it seems simple, but I figured I'd mention it since I've run into this problem before.

thanks again for your help! I really appreciate you taking the time to help me with this problem

I just invited you to a private git repository, where I uploaded the json File, example audio input/output files (generated by the python model) and example input of an array of one's (note that this model is using now an input size of 144 (of course I updated the JUCE Plugin to this input size).

Note: don't be irritated by the speaker of the audio files, I had the audio files lying around and then I used it as training data ;)

The RTNeural Debug printout while loading the model looks like that (should be fine)

# dimensions: 1
Layer: conv1d
  Dims: 18
Layer: conv1d
  Dims: 18
Layer: conv1d
  Dims: 36
Layer: conv1d
  Dims: 36
Layer: lstm
  Dims: 32
Layer: dense
  Dims: 1

I know that the model is quite huge - I will definitely reduce the sizes of the layers at some but, it's just for a first prototype

According to your last hint of passing in the right input_size, should be right:

void PluginProcessor::processBlock (....)
{
        //monoBuffer.numofSamples --> 144 
        float testInput[144] = { };

        for (int n = 0; n < 144; ++n)
        {
            testInput[n] = monoBuffer[n];
        }
}

Ah, the debug output from loading the model gives some idea of what's going on. The # dimensions line signifies that the model is expecting an input block of 1 sample rather than 144. Similarly, the final dense layer also has 1 dimension, so the model is only supplying one sample of output.

Exporting convolutional layers can be a little tricky since the way that TensorFlow/PyTorch handle the layer dimensions can be a little funky. The RTNeural unit tests use the convolutional network defined in this script, which results in this model file. Hopefully that example will help us get a better idea of where the dimensions disconnect is coming from.

thanks for that hint! As I have experienced, it's not ethical to change the last index of the data tensor to 144. Instead our data tensor has the shape of :(batch_size = 4096, timesteps = 144, features = 1). Do you have any successions to proceed from here?

I upload the python training script on the private github repository as model.py. But here is also an overview about the model:

model = Sequential()
model.add(keras.layers.InputLayer(batch_input_shape=(4096, 144, 1)))
model.add(Conv1D(18, 12,strides = 1, activation=None, padding='same', name = 'Convolution1'))
model.add(Conv1D(18, 12,strides = 1, activation=None, padding='same', name = 'Convolution2'))
model.add(Conv1D(36, 12,strides = 2, activation=None, padding='same', name = 'Convolution3'))
model.add(Conv1D(36, 24,strides = 2, activation=None, padding='same', name = 'Convolution4'))
model.add(LSTM(32, name = 'LSTM_Layer', stateful = True))
model.add(Dense(1, activation=None, name = 'Dense_Layer'))
model.compile(optimizer=Adam(learning_rate=learning_rate), loss='mse', metrics=[error_to_signal])
model.summary()

Ah, that's interesting. The batch size shouldn't matter, since I'm assuming you only want to run one batch at a time when doing inferencing. The "timesteps" dimension is a little more interesting... I'm guessing those are consecutive samples in time? It looks like when you run the model in Tensorflow, the network output is an array of 144 values.

If I'm understanding everything correctly, then I think the correct implementation with RTNeural would be something like this:

void inference(float* output, const float* input) // both input and output are arrays of length 144
{
    for(int i = 0; i < 144; ++i)
    {
        output[i] = model.forward(&input[i]);
    }
}

Shouldn't the inferencing method you proposed attempt a sample by sample forward propagation? I already tried to do it this way, but still not getting the expected result out (rather it sounds quite similar to the input).

About the time step dimension, to answer your question, Yes, time step dimensions are comprised of consecutive samples. Is this the kind of data structuring RTNeural supports? If not, should we structure our data another way? How would you structure and preprocess audio data to meet RTNeural requirements?

In the example you provided, The layers have smaller dimensions compared to ours. For instance, we have outputs with shape:
(4098, 8, 18)
whereas the layers in the example are having shapes of:
(None, None, 8).

Furthermore, another thing to note, is that the LSTM layer is outputting a different shape. A scenario I didn't came across in the examples.

Do you think this is causing the problem in any sort of way?

Hmm yeah, I'm a little bit stumped by this one. I don't think I've tried training a network before where the input goes directly into a convolutional layer, without going through something else first. (Frankly I'm using recurrent networks much more than convolutional these days anyway.)

The fact that the LSTM layer has a 2D shape, rather than 3D is a little bit off-putting. I wonder if there's some internal "flattening" happening there.

One other thing I noticed is that the Conv1D layer in your script is using the argument padding='same', whereas the layer in the example script is using padding='causal'. I'd have to review the RTNeural implementation, but I believe I had intentionally set up the layer to expect causal padding given the expectation that it would be used in real-time context. If that is the case, it's definitely my fault for not properly documenting it. I'll take a look and see if I can confirm if that is the problem.