sinara-hw/Booster

ERROR LED

Closed this issue · 25 comments

We got a red ERROR LED on one of our Booster v1.4s. @dnadlinger will post details soon.

@wizath what diagnostics should we be using to determine the source of this error?

First of all you can try i2cdetect <ch number>

If you've got something like this

> i2cdetect 5
[i2c_scan] start
[i2c_scan] end

Most likely I2C bus on channel is shorted (with highest probability of dead temperature sensor)

The temperature suddenly went to 63C when the channel died. So an I2C issue was my top suspicion. Power cycling the device cleared it.

Protection is set to 60 degrees. Please always check logstash command before power cycling. There should be an error

Ah, sorry, I didn't think about logstash. Here is the status output before power-cycling:

> status 5
[status] e=0 s=0 r1=12 r2=350 tx=0.000 rf=0.000 curr=0.000 t=63.00 i=1.03 ip=-nan
PGOOD: 1
FAN SPEED: 100 %
AVG TEMP: 63.00 CURRENT: 63.00
CHANNELS INFO
==============================================================================
                #0      #1      #2      #3      #4      #5      #6      #7
DETECTED        1       1       1       1       1       1       1       1
HWID            02:22   CD:F9   7B:9F   20:8D   E8:27   4F:79   21:12   C6:3A
INPWR [V]       0.00    0.13    0.00    0.07    0.63    1.03    0.00    0.61
TXPWR [V]       0.01    0.01    0.15    0.12    1.69    0.01    0.01    0.02
RFLPWR [V]      0.04    0.01    0.01    0.01    0.60    0.22    0.12    0.07
INPWR [dBm]     -nan    -nan    -nan    -nan    -nan    -nan    -nan    -nan
TXPWR [dBm]     5.00    5.00    5.00    5.00    26.49   0.00    5.00    5.00
RFLPWR [dBm]    -4.40   -4.53   -4.25   -3.55   9.62    0.00    -0.01   -4.17
I30V [A]        0.044   0.046   0.049   0.046   0.095   0.000   0.048   0.001
I6V0 [A]        0.243   0.252   0.243   0.247   0.247   0.000   0.251   0.254
5V0MP [V]       4.924   4.922   4.930   4.938   4.910   0.000   4.930   4.950
ON              1       1       1       1       1       0       1       1
SON             1       1       1       1       1       0       1       0
IINT            0       0       0       0       0       0       0       0
OINT            0       0       0       0       0       0       0       1
SINT            0       0       0       0       0       0       0       0
ADC1            14      13      250     191     2765    12      13      34
ADC2            59      15      15      13      990     356     203     111
INTSET [dBm]    35.00   38.00   37.99   37.99   37.99   37.99   37.00   36.00
DAC1            4095    4095    4095    4095    4095    4095    4095    4095
DAC2            3245    3268    3415    3341    3322    3252    3385    3683
SCALE1          83      85      82      83      87      88      85      87
OFFSET1         470     446     727     619     460     375     565     571
BIASCAL         1865    1539    1879    1935    1527    1939    1761    1929
HWIS            82.08   84.33   83.17   82.83   85.17   83.92   83.00   85.25
HWIO            865.08  823.33  1003.17 939.83  852.17  818.92  978.00  957.25
LTEMP           30.25   32.00   32.00   32.00   32.50   63.00   32.50   32.00
RTEMP           30.00   32.25   31.00   31.00   32.50   63.00   30.00   30.00
==============================================================================

@wizath do you think the reading of 63.00 degrees could be an I2C error, or do you think something went wrong that pushed the temperature that high?

@wizath I added a troubleshooting page to the wiki. Can you check that the advice I gave there is correct please?

AFAIK it's hard to heat up module even to 50 degrees. Maybe without cooling and with all channels at maximum power.

Yes, but the (@dnadlinger correct me if I'm wrong) the fans were spinning, and all other channels were at 30C, so it seems hard to believe that one channel could actually be drawing that much current -- particularly if the 30V foldback limiting was working.

So, I assume this has to be some kind of issue with the temperature measurement, doesn't it?

Status from previous comment stated that fan speed was at 100% with 63 degrees

Right, so assuming nothing went wrong with the fan controller, this seems like an issue with the temperature measurements, I think.

@dnadlinger if we have more problems like this, let's log the booster statuses to influx db. That will show up things like sudden (non-physical) changes in temperature.

@wizath, 63.00 C seems like an odd temperature measurement for a fault condition, doesn't it. IIRC last time I had I2C issues it read 150.25. Any ideas about potential causes?

@wizath if you can think at all of anything that could cause an erroneous temperature reading of 63C please let me know

Also, can you remind me what the difference between LTEMP and RTEMP is?

That's temperature sensor internal temperature and remote diode temperature

I see. You mean the detector has both its own internal diode, but can also use a diode-connected external transistor? Out of curiosity, what's the point of having both? Are they in different places (e.g. the diode nearer the FET) or something?

Is there anything special about 63.00?

@wizath: Re "all channels at maximum power", the output above was the steady state, i.e. low power on channel 4, all others idle. Fan exhaust was cool to the skin too.

That's why I think it was I2C error. Tomorrow I'll review i2cerror command to provide more information about bus errors

@wizath could this have been a bit flip error on the temperature bus or something like that?

I can't think of anything other than bus error that caused this problem. Since air was cool, channel was running at low power.

Can you next time use i2cerr command? It'll list bus error count on each of the channels

> i2cerr
                #0      #1      #2      #3      #4      #5      #6      #7
I2C ERR         0       0       0       3       0       0       0       0

Nice, thanks!

Out of curiosity, why doesn't #282 produce anything in the logstash?

Updated the Troubleshooting page on the wiki, but can you add this to the VCP page as well please?

Out of curiosity, why doesn't #282 produce anything in the logstash?

Only one command that lights up red led is not producing any error message - and that's checking if channel already has an error. But generation of that error status should log error message to stash.

eth-scpi.dfu.zip

I've added log message to this sequence, can you test this now?

Okay, will reflash and try to reproduce.

@dnadlinger I'm assuming this is a duplicate of #282