JaneliaSciComp/G4_Display_Tools

Requested `display start` time and actual sequence response differ

floesche opened this issue · 25 comments

The command start display takes an argument which defines the time of how long it runs. In a series of 165 test cases, this was only correct in about 43% of the measured times.

Here is how I generated the data:

  1. start a new instance of G4 Main Host.exe
  2. connect via TCP
  3. set root directory
  4. set controller mode to 2
  5. set pattern ID to 1
  6. generate a list of wait times
    1. in most cases randomly between 0 and 65534
    2. at the end of the attached list I used the times "1878 637 1265 1748" repeatedly
  7. send "start display" (0x03 0x21) + wait time
  8. check for immediate response that confirmed accepted command
  9. wait for "Sequence completed in ... ms" response (= response time).

In the attached file, the "test time" is when I recorded the experiment (same "test time" means all experiments were run in a block). Mode and pattern ID are the same (see above), responses is "1" if I saw the "sequence completed" message, "0" if my wait timed out. runTime is the requested time, seqTime is the actually recorded response time. timeout is the time when my script stops waiting for a response and moves on.

startDisplayTimes.xlsx

The following code shows how the data above was collected. You won't be able to run this code right now, because the updated panelsController is not yet published on github.

What startDisplay does: Once the command 0x03 0x21 0xAA 0xBB (with AA and BB being runtime duration high/low) is sent, it waits for 0xLL 0xEE 0x21 ... (with LL being the length of the message). If EE is 00 this means the command was received successfully. In that case startDisplay waits for a second response from the Main Host which is another 0xLL 0x00 0x21 and contains the characters Sequence completed in %d ms. %d is replaced by the runtime duration * 100.

That means, even for the example with index 154 in the 'startDisplayTimes.xlsx' file, where the requested runtime was 1878 and the actual seqTime was 11661ms, the Main Host returned the message containing Sequence completed in 187800 ms.

Below the unit test that I used to record the data.

function sendStartDisplayXLSX(testCase)
            testCase.panelsController.setRootDirectory("C:\matlabroot\G4");
            testCase.panelsController.setControlMode(2);
            testCase.panelsController.setPatternID(1);
            rsps = [];
            runTime = [];
            seqTime = [];
            %for i =   [randi([0, 25], 1, 50)]% randi([0, 65534], 1, 20)] % 
            for i = [1878 637 1265 1748]
                disp(i);
                startTime = tic;
                rsp = testCase.panelsController.startDisplay(i);
                seqComplete = toc(startTime);
                rsps = [rsps; rsp];
                runTime = [runTime; i];
                seqTime = [seqTime; seqComplete];
            end
            T = table(rsps, runTime, seqTime);
            writetable(T, "startDisplayTimes2.xlsx", "Sheet", "2022-03-25")
        end

(via email on 2022-03-30T09:55)

I did a quick test with some random runtimes and the response times are lining up with the runtime. I basically did a start-display command and then waited for the response message before sending the next start-display command. I did run it in mode 1, but I don't think that should make a difference. I will try it in mode 2 just as a sanity check.

image

Thanks for your response. Sorry that you can't replicate the issue right now. I can also run many hundreds of tests where the requested response time and the actual response time align, which makes the issues when the error occurs even more severe.

I was also under the impression that once a Main Host instance returns the correct time for the first few start displays, the other ones are correct, too. On the other hand, if one of the first few start displays have the wrong timing, the others are also wrong. ("Of course" the attached example from right now seems to be an exception from that rule 🙂)

I added more information in the comment above to explain how the data was collected. The code I shared there won't run on your system, but I explained as good as I can what the code does. Also, the attached screenshot should demonstrate that the problem is real. I took the (first) screenshot while the test script was still running and right after the second start-display​ was issued (second screenshot attached). It is not always easy to replicate the issue and it took me some patience to get the attached screenshot right now (I had to try for about 10 times).

On the left side (highlighted in green), you see the runtime of 1878 that I hand over to the startDisplay​ method. The Main Host confirms receiving this at 3/30/22 11:06:02.505. Only roughly 11 seconds later at 3/30/2022 11:06:13.007 the Main Host prints the sequence complete​ message and also sends it via TCP. For this example the other response times are correct, but the first one is clearly wrong.

I realize replicating this issue is probably not easy, so please let me know if you need anything else from me that would make this easier for you.

image

(response via email on 2022-03-30T13:55):

You had a deque timeout and request FIFO TO errors on the right. Are you always getting this?

I am not sure about "always", but right now I had another incorrectly timed sequence complete​ message -- and both lights were on, again.

Running the test again, I just noticed another error condition: the sequence complete​ message was sent via TCP, but not displayed on the Main Host GUI -- although the timing appears to be correct:

image

Timing:

rsps runTime seqTime
1 1878 10.7231562
0 637 64.7010958
0 1265 127.5009502
0 1748 175.8042722

In the GUI and in addition to the Deque Timeout​ and the Request FIFO TO​, there is an SPI error with numbers from the int16​ range changing all the time.

As you can see from the screenshot in this comment, the lights stay on even if the timing is correct later (for second to fourth start display​ command).

Do those lights mean anything for the operation of the arena or are they purely for debugging?

(response via email 2022-03-30T15:47):

The SPI Error doesn't matter, but getting a Dequeue timeout is not good. Can you send me your pattern and HHMI Panels Configuration.ini file?

Find attached the configuration file and one of the patterns.

To be clear, the goal is not to debug this one system, but to find errors that are regularly happening across different setups.

If the Dequeue timeout is relevant for the validity of the data, does it get logged in the TDMS files? I also don't find it in the TCP responses...

config.zip

(response via email 2022-03-30T16:46):

The dequeue timeout indicates that there is no data on the DMA buffer when it is trying to read it so it is definitely an indication that the data being sent to the display is now incorrect. Is the application starving for resources in any way? You could get that error if that application can't keep up with putting new data into the DMA for whatever reason. You would need to restart the application once that error is received. It is not currently logged, but I can add it in the future. Is there any way to run your test application on my computer because I am having a hard time replicating the issue?

Here some quick responses to your comment:

The dequeue timeout indicates that there is no data on the DMA buffer when it is trying to read it so it is definitely an indication that the data being sent to the display is now incorrect

Does "now incorrect" mean that all the following data will be incorrect? Or does the system recover after some messages (my 2nd to 4th start display seemed fine after the initial error)?

Is the application starving for resources in any way?

I am not aware of resource starvation. It's a relatively new computer with our standard Windows installation for lab computers. What kind of resources do you mean?

You could get that error if that application can't keep up with putting new data into the DMA for whatever reason.

What are possible reasons? Slow hard drive, memory boundaries, PCI issues, network issues (firewall...)? Is there some way of testing or benchmarking a system before installing or running the Main Host?

You would need to restart the application once that error is received.

If that's the case, would it be a solution that the Main Host returns that information via TCP or doesn't accept new commands?

In everyday experiments the Main Host GUI is running somewhere in the background and I don't think anyone checks if a light came on at some point.

It is not currently logged, but I can add it in the future.

I think logging any type of error, especially if it is relevant for the validity of the experimental data, would be good.

Is there any way to run your test application on my computer because I am having a hard time replicating the issue?

I can prepare a branch at github with the current state of my code.

(response via email on 2022-03-31T14:17):

The dequeue timeout indicates that there is no data on the DMA buffer when it is trying to read it so it is definitely an indication that the data being sent to the display is now incorrect

Does "now incorrect" mean that all the following data will be incorrect? Or does the system recover after some messages (my 2nd to 4th start display seemed fine after the initial error)?

So that error occurs when the FPGA is trying to read data from the DMA, but there is none available. So at this point partial frame data has probably been sent to the arena panels already so now the arena is in a bad state. The software might recover for the subsequent commands, but the arena is already in bad state and needs to be reset.

Is the application starving for resources in any way?

I am not aware of resource starvation. It's a relatively new computer with our standard Windows installation for lab computers. What kind of resources do you mean?

We have seen issues with changing PCs in the past and the DMA performance dropping because of it. I don't know if that has anything to do with the issue you are seeing, but you can run the tester I linked below to measure the DMA speed to see what benchmarks you are getting. This application can also be CPU intensive because we configure smaller DMA FIFO sizes so our latency is shorter, but that also requires the software to run faster to keep those buffers filled to ensure no delay between frames.

https://www.dropbox.com/s/ajngpvx5d6ulb5n/Bandwidth%20Tester%20%28LV2016%29.zip?dl=0 (local mirror)

You could get that error if that application can't keep up with putting new data into the DMA for whatever reason.

What are possible reasons? Slow hard drive, memory boundaries, PCI issues, network issues (firewall...)? Is there some way of testing or benchmarking a system before installing or running the Main Host?

See above

You would need to restart the application once that error is received.

If that's the case, would it be a solution that the Main Host returns that information via TCP or doesn't accept new commands?

In everyday experiments the Main Host GUI is running somewhere in the background and I don't think anyone checks if a light came on at some point.

I can add a feature to send a message back on the tcp interface when this occurs and you can send a "Display Reset" which should trigger a hardware reset on the display

It is not currently logged, but I can add it in the future.

I think logging any type of error, especially if it is relevant for the validity of the experimental data, would be good.

I can add it to the log.

Running the benchmark, I am seeing the following output. Is that telling me things are good or bad?

I am guessing the Bytes/sec is the critical value? What is a threshold for the Main Host application?

image

How "expensive" is the reset? Until we defined the behavior (just a TCP response, or should future commands just fail...?), would a 'reset' between experiments be a valid workaround in your opinion?

Find my current code basis at https://github.com/floesche/LED-Display_G4_Display-Tools/tree/feature-panelscontroller. The actual test is currently commented out since it is not a real unit test, but you can just uncomment L214..232 to activate it. So far, the new code is not well tested, so take it with a grain of salt...

(response via email 2022-03-31T16:37):

Your speed should be fine. I was getting around 25 MB/sec on my old PC.

The reset command probably takes a few seconds, but sending the reset command in between experiments is probably not a valid workaround because the issue is more likely to happen during an experiment while it is streaming. Another thing I found was the All-off command was essentially the same as the Stop-display. I think I should actually modify it to send blank frames instead so it doesn't try to empty out the DMA. I started running the test script and it is doing the All-off and All-on sequence now, how long does it take to completely run this script and is there some result file?

Good to know that resource starvation isn't the cause for the timing issues for the start-display command. Do you have a lead what else it could be and are you able to replicate the error?

By All-on All-off sequence you mean the function sendAllOnAndOff()? That takes less than a second – it takes longer to start the Main Host than to run the function.

The whole file is just a class of Unit tests, comparing actual results with expected results, and would throw an error if that's not the case – so they don't produce result files themselves (I actually deactivated the function sendStartDisplayXLSX because it does create a file…).
You could collect the unit test results via results = runtests(….) and then generate a table (and a spreadsheet) from it via table(results) if needed (also see MATLAB documentation)... Does that help?

(response via email 2022-03-31T17:50):

It was taking so long because it was trying to run this test 1,000,000 times

for i = 1:1000000
               while toc(boff) < delayOffOn

Looks like I uploaded the weekend edition of the file – and the test you ran will only become relevant in the future once we look at issue #57.

You can "Run current test" (there is a button for that in MATLAB when your cursor is inside the unit test) -- running the whole test suite takes too long and most of tests pass successfully.

For the current issue only the function sendStartDisplayXLSX should be relevant.

(response via email 2022-04-01T09:24):

How are you calculating this number below because it does not match the command I received? The start command below had a runtime of byte 1 = 55 and byte 2=8 which equals 14088.
Running PanelsControllerFunctionalTest
2133

Just to confirm my own understanding for the number format: Most (all?) of the Main Host commands that use 16 bit numbers have the low value first (LO in the documentation), then the high (HI). This would mean the two bytes 0x55 0x08 would represent the number 0x0855 (=2133). Did I understand this correctly? Just wanna make sure I am not misreading your "TCP Commands.xlsx" spreadsheet…

(response via email 2022-04-01T11:46):

You are correct in your understanding of the number format.

By the way, how long is the sendStartDisplayXLSX test supposed to last? I've been running it for a couple hours now.

Thanks for the confirmation.

To your question about sendStartDisplayXLSX: The durations I described in #59 (comment) are actually [1878 637 1265 1748] (also at the end of the line 221). If you want to replicate the issue, change that line 221 to for i = [1878 637 1265 1748] as shown in the issue's comment #59 (comment).

As I mentioned earlier, if the problem doesn't happen for the first start-display it might happen later, but seems more unlikely. This means, if you don't see an error within ~3 minutes it is most efficient to restart the experiment.

The sendStartDisplayXLSX as is uses the random numbers generated in [randi([0, 2500], 1, 50) randi([0, 65534], 1, 20)] for the duration sent to your start-display command (eg 50 values between 0 and 2500 deciseconds, 20 between 0 and 65534 deciseconds). In average this takes almost two hours for the first 50 numbers, 18 hours for the second 20 numbers (this was for long-time data collection during the weekend).

If you want to save the current state of your experiment, set a breakpoint eg at line 222 and wait until execution stops. Then run the following two commands: T = table(rsps, runTime, seqTime);writetable(T, "startDisplayTimes2.xlsx", "Sheet", "2022-03-25"); and you should have the xlsx file. Move it to a different location before running the method again, the code doesn't replace an existing xlsx file.

Sorry that the code itself is not that well documented at this point, but we tried to be as clear as possible in the issue descriptions.

(response via email 2022-04-01T14:23):

Can we do a remote session on the PC you are running the tests on? I am available the rest of the day or early next week.

Sure, we can try. I can be there in 5min.

(response via email 2022-04-04T14:24):

This build 237 (GitHub mirror) is to try to address issue #59 and #21. For issue #21, I suggested waiting for the return message for when completed. I also suggested sending control mode 0 when stopping the 'start-display' run prematurely. I did realize that doing that will cause a reconfiguration of the FIFO size (there are two different size FIFOs, one for stream and non-stream mode) which could cause a blank display. I disabled that reconfiguration and will just be using the stream FIFO size. The initial intent was to have a larger FIFO size for modes where latency wasn't as big of a deal and having a larger FIFO is a bit safer to prevent possible overrun, but I think the software can probably run fine on the smaller FIFO size that is used in stream mode.

(response via email 2022-04-07T16:47):

I forgot to remove some debugging that can grow infinitely when receiving tcp commands. Here is the version 238 (GitHub mirror) with it removed.

After updating the G4 Host to version 239, the error seemed to have reappeared:

The G4 Host reports to complete the start-display sequence after 2300ms:

gapsInLog-v239-test4

A look in the TDMS file for the same trial shows something different:

Time Command Data
88468486 Set Pattern ID B/0
88469250 Set Pattern Function ID B/0
88469646 Set-Position X B/0
88469861 Start-Display 17/0
90346437 Stop Log  

Here the start-display command with value 0x17 (=23) is sent at the timestamp 88469861. Then, the next command stop-log is sent at timestamp 90346437. That is 1876.576ms later, not the expected 2300ms.

Upon closer inspection of the screenshot it shows the same problem: The start-display command is sent (for example) at 16:12:23.980 and the sequence completed statement is sent at 16:12:25.855, less than 2000ms later.

The previous issue was probably caused by a wrong configuration: the ini file suggested it had 3×12 panels, while the pattern had 4×12 panels.

It seems that this issue is fixed by v238 (and following).