githubharald/CTCWordBeamSearch

optimizations for ARM

iit2014128 opened this issue · 22 comments

Has anyone tried this algorithm on ARM Architecture? It is taking very long(around 7 secs) for input dimension 700*80 with 100 BeamWidth in ARM processor which is around 5 times in comparison to x86 architecture(1.4 secs) for the same hyperparameters. Any optimization we can do to reduce execution time in an ARM to bring it down at least equal to x86?

Which mode do you use? I would suggest only using "Words" or "NGrams" mode, as they are much faster than the forecast modes, while still achieving good accuracy. Then, limit the beam width (see README): at some point, increasing it only accounts for a small accuracy improvement, while slowing down the algorithm quite a lot. Something around 30 should give a reasonable trade-off.

We are using Words mode .

and beam width?

100

try to go down to 30.

and about which hardware are we talking? there is a wide range of ARM processors. can you give more details?

for beam width 30 it is working in 2secs but CER is going down by 5%.

  1. did you compile with parallel mode? https://github.com/githubharald/CTCWordBeamSearch#1-compile
    if you have a batch with multiple elements, this might also improve the runtime.
  2. please see my last question about hardware

yes trying to get more info about hardware, will share in a minute

ARMV7 processor rev 0(v7l)

No, We didn't compile in parallel mode. we are not using Tensorflow.

How can we use parallel for the c++ test program?

parallel mode is only implemented for TF. And it also makes sense in case a batch is processed. Do you use batches, or do you process single elements (e.g., just one input image a time)?

No, we are not using batches. we are directly feeding logits. (input to model is array of (x,y) pen co-ordinates. we are not using images). Any suggestions to optimize in case of single element ?

Start with the simple things:

  1. search for a good beam-width, which is both fast and accurate.
  2. make sure to use some fast way to pass the data into the C++ program. CSV files are used in the test program, but not a good idea if runtime matters
  3. Ideally, plug the C++ code into the main program, as I did for TF. Then, the data is directly passed from TF to the C++ program, instead of writing a file with the data

If this does not help, then there is no way around profiling the program on the hardware and searching for performance bottle-necks.

we tried profiling this program and find out that push_back in vector<vector<>> ex. wordList is taking
majority of the time. is there any alternative to this ? or any way we can make it faster?

the variable wordList is only used when the language model is created - which should only happen once (while initialization). Are you creating the language model for each sample you decode? Or are you even starting the program for each input file, running it to decode the sample, and then terminating it again?

we are creating language model only once. newBeam->wordHist.push_back(newBeam->m_wordDev)
this is getting called approx 160000 times. so this is taking more time while wordList is getting called aroung 65000 times(approx).

you said that you use Words mode. But the code newBeam->m_wordHist.push_back(newBeam->m_wordDev) is not called in words mode. Please clarify.

Sorry previously we were using words but now switched to Ngram mode.

you could try to move m_wordDev instead of copying it - add a std::move and comment out the second line:

newBeam->m_wordHist.push_back(std::move(newBeam->m_wordDev));
//newBeam->m_wordDev.clear();

closing because of inactivity.