beagleboard/beaglebone-black

8710A Reset Design Flaw

Closed this issue · 35 comments

There is an issue in the way the reset circuity on the BBB resets the 8710A ethernet PHY.

When coming out of reset, the 8710A can end up in an indefinite state as a result. This is problematic because you can loose the ability to communicate with the BBB occasionally and it requires a physical reset to clear the problem. For a remote system if a difficult to reach location this creates a major problem.

We have found that you can hack the current pcb (Rev B or Rev C) and disconnect the reset line from the 8710A. Then, hack a wire from the 8710A reset pin to a GPIO on the BBB. Now, when the ethernet PHY hangs it can be reset by asserting the GPIO pin.

I would like to recommend a revision to the PCB to address this issue. At the moment the reset trace near the 8710A is on an inner layer and not easily accessible. It would be a big help if the trace was more easily accessible. It would be an even bigger help if there was a jumper and a couple pads.

We would be happy to contribute the firmware driver we developed for this.

This was the exact same fix done to later boards, on mainline, the kernel mdio device tree supports the concept of a gpio-reset line:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/Documentation/devicetree/bindings/net/mdio.txt?id=69226896ad636b94f6d2e55d75ff21a29c4de83b

If we ever did a rev D pcb, this is something i'd request..

Regards,

@RobertCNelson

Do you know if anyone has converted the schematic and pcb from Allegro format to either Kicad or Altium?

Also, on my wish list would be to increase the amount of RAM.

On Rev D, would it be more appropriate to use a reset controller rather than an RC circuit?

@rowsail we have been using a kernel mdio address fixup patch mentioned in that link you show. A dedicated gpio to reset the phy is the best option.

Regards,

So until a design with this fix is released (I have the design in Altium BTW) would an external reset controller on a cape (or equivalent) fix the issue?

@rowsail, in your design based off the beagle, cut the reset line so they aren't shared, wire a spare gpio to it with a pull-up/pull-down (haven't looked at the phy reset logic in awhile) and use the phy-reset binding, to control the gpio.

Regards,

Thanks for your help Robert. That's a very short trace on Pin 19 before it gets to the via carrying the regular reset signal, but if this goes into production, it's not really the sort of thing I want to do on 100+ units! It's difficult to find out what the actual issue is from the posts re the fix: is it that the reset duration is not long enough, or something else? If the former, I can fix it simply with a reset controller on my main board that can drive the reset low for longer and when the voltage is at a sufficiently high level. If it is a problem with the PHY (a silicon issue) then sure, the only fix might be to additionally control its reset line by GPIO.
image

@rowsail , i haven't looked at the actual timing signals, but from what i've been told, the sys_resetn is long enough to reset the am335x but not long enough to correctly reset the phy in 100% of all boards.

Regards,

OK - great info - thank you. I will check the timing requirements but if that is the case it sounds like adding an external reset controller should work.

@rowsail. It is a timing issue and power sequencing issue ... not just duration of the reset. So, unfortunately, simply adding an external reset controller will not do the job.

It is an expensive hack. To do it reliably you need a laser to cut the reset line.

Can you convert the design files to Kicad? If so, I can propose some changes.

I updated my comment to be more specific. The main cause is the power sequencing. As I recall (it has been a while), I think I concluded the 8710a is likely on the wrong power bus. In theory, separating the reset and using a reset controller for the 8710a should work. But the underlying issue remains if you do not address the power sequencing. So, likely the issue is best resolved by addressing the power sequencing issue and separating the reset.

Attached should be a PDF of the schematic. I have attempted to create a structured schematic which I think is easier to understand to someone looking at it anew (i.e. me!). I have also made the changes which bought it to Rev C and also added a link between a GPIO pin and the reset of the PHY. I'd be grateful for any input good or bad.

BEAGLEBONEBLACK.pdf

I am designing a board based on BeagleCore and would add an ethernet PHY similar to this one of the BBB.

Is the solution of Rowsail correct ? With the fix in the mdio_driver ? (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/Documentation/devicetree/bindings/net/mdio.txt?id=69226896ad636b94f6d2e55d75ff21a29c4de83b)

@RobertCNelson is there a safe gpio line we can use for the MDIO reset? We will eventually need to address the microSD card cage issue and therefore do a hardware rev.

I think the reasonable way to do this (for as much compatibility as possible) is to find an unused GPIO that is able to float high with a pull-up and add an AND gate ahead of the reset to the PHY MDIO reset. That way, old code will still rely on the SYS_RESETn, but new code can reset the PHY even after SYS_RESETn is high.

Fixing the power sequencing would be nice such that this device is powered early enough for the SYS_RESETn to do the trick, but I'm worried about the validation process (risk) for that.

I took a look at the technical reference manual and found this:

Caution must be used when implementing the nRESETIN_OUT as an bi-directional reset signal. Because of the short maximum time allowed using RSTTIME1, it does not supply an adequate debounce time for an external push button circuit. The processor could potentially start running while external components are still in reset. It is recommended that this signal be used as input only (do not connect to other devices as a reset) to implement a push button reset circuit to the AM335x, or an output only to be able to reset other devices after an AM335x reset completes.

It appears the BBB does not adhere to this caution. Moreover, the large capacitance on this signal line may cause issues due to the very slow signal rise time. It would likely be a good idea to properly debounce the reset button and remove the capacitor from this reset signal line.

I recall that when we did some work on this a few years ago we tested whether or not a reset controller would resolve the issue. If my recollection is accurate, what was reported to me was that a reset controller did not resolve the issue. Unfortunately I did not perform the testing so I am not certain of how the tests were carried out. I think there are reports from others elsewhere that attempted to use a reset controller and observed the same outcome.

I took a closer look today on the power sequencing and the TPS65217. The PMIC's default sequence has about 26ms of delay from the rise of LDO4. The LAN8710a datasheet indicates a minimum reset of 25ms is required. So, that delay plus the delay associated with the time constant should be adequate to reset the PHY.

Another thought is to increase PGDLY. It can be changed from 25ms to 100, 200 or 400ms.

image

Fix in version C3.

That looks like a lot of capacitance. Might want to double check the rise time spec on the 8710A.

Did anyone try changing PGDLY?

Where is PGDLY set?

I'm pretty sure PGDLY is in the TPS65217C, not sure we can reprogram that..

PGDLY

I'm pretty sure PGDLY is in the TPS65217C, not sure we can reprogram that..

Yeah, it is a TPS65217C register. I looked into this and I think it can be changed but I was not able to determine which driver to make the change to.

There might be a fix here: https://wp.josh.com/2018/06/04/a-software-only-solution-to-the-vexing-beagle-bone-black-phy-issue/

This is not a good fix, it causes most processor supplies to briefly power off while SYS_5V remains powered hence the VDD_3V3B regulator bug will cause the 3.3V supply on the P9 header to remain powered. Depending on what's connected to the beaglebone externally, this can easily cause external hardware to fry the AM335x's I/O.

I'm pretty sure PGDLY is in the TPS65217C, not sure we can reprogram that..

Yeah, it is a TPS65217C register. I looked into this and I think it can be changed but I was not able to determine which driver to make the change to.

Changing its value at runtime is pointless since that's too late, you'd need to change its non-volatile programming. This is presumably possible using the programming sequence documented for the TPS652170 (section 7.6.1.1), however this requires putting 8V on PWR_EN and therefore cannot be done in-situ.

As I also commented on that article linked above, my own testing results show that the primary cause of the phy problems is not the reset time but the rise time of the reset signal:

I’ve seen the inverted link led thing as well. It suggests the logic level of the led pin was somehow incorrectly recorded at reset, which is also the strapping option for REGOFF, hence the phy will not work in that case. In general, all of the phy problems (ranging from having an incorrect phy address to not working at all) appear to be due to incorrect strapping options being latched at reset.

Based on testing I’ve done the primary cause seems to be the slow rise of the reset line, which is caused by a 2.2μF capacitor on it (C24), apparently to ensure the phy’s specified reset timing is met, and to lesser extent by a 0.1μF capacitor (C30).

I’ve done some tests on a beaglebone (known to be susceptible to the phy issue) with a reset extender added to ensure reset timing is met and additional pull-up to decrease the rise time on reset deassertion. The impact on the phy failure rate was pretty clear:

2.4% (34/1431) with no external pull-up (just the on-board 10K).
1.0% (12/1189) with 1K pull-up.
0.4% (5/1153) with 240Ω pull-up.
0.15% (2/1354) with 1K pull-up and C24 removed.
0 failures in 16901 power cycles with both caps (C24 and C30) removed.

In other words, the faster the reset rise time, the less frequently it failed.

How or why the phy is managing to misread the strapping options is still a mystery to me. We tried shorting the link led to make REGOFF pulled down more convincingly and reduce the opportunity for noise pickup, but it did absolutely nothing. Adding 0.25s delay between bootrom and U-Boot SPL, just in case the AM335x is released from reset earlier than the phy, likewise had zero impact. Perhaps the phy is just really intolerant of a slow-rising reset, but that seems very odd given that the datasheet actually suggests using an RC-circuit on the reset input to generate the required reset timing.

In short summary, the phy just sucks. Has it ever been considered to just swap it out for one that doesn't suck?

Just to add, the fixes that I believe would solve the problem are:

  1. replace the phy by one that doesn't have this problem
  2. use a GPIO to reset the phy instead of using the processor reset signal
  3. extend the reset time (by reprogramming PGDLY or using an external reset extender) and remove the capacitance on the reset line to ensure a sharp rising edge

is there a safe gpio line we can use for the MDIO reset? We will eventually need to address the microSD card cage issue and therefore do a hardware rev.

You could reuse the eMMC reset line, since this line has not worked for the intended purpose (keeping eMMC in reset to ensure it does not cause problems when reusing the eMMC pins) since the Micron eMMC (whose reset input is low-level-triggered) was swapped out for Kingston eMMC (whose reset input is rising-edge-triggered).

Resolved in 5b06500. Reset GPIO is GPIO1_8.

I only just noticed in the C3 schematic that the phy reset is being driven by an AND-gate with open-drain output and weak pull-up (10KΩ) and large capacitance (4.7μF) on its output, yielding a 47 ms RC-time, which seems like a bad idea considering slow rise time on the reset line appears to be the main cause for the phy problems in the first place. A push-pull output would have been more appropriate (and eliminates 2 components). It's not a huge deal since the reset gpio allows for multiple attempts at resetting the phy if necessary, but it is a bit of a weird choice.

Is C3 already in production btw? There's no EEPROM identifier listed for it yet, will it be A335BNLT00C3? I noticed the BBBs currently produced by Seeed erroneously identify themselves as A335BNLTEIA0 (Element14 BBB Industrial A0).

It seems that completely removing C24 fixes the issue for us (powering the board through the cape connectors).
If I understand correctly the side effect should be that the reset button does not work reliably anymore, but we are not using that anyway.
Can anybody see a problem with this approach I am missing?

@svdmark As shown in my comment earlier, in my experience removing C24 helped a lot but did not fix it completely. The sensitivity to this issue varies per board though, and the one I tested on was particularly sensitive. Also, the purpose of C24 is to ensure the minimum reset time for the phy is met, and removing it without somehow extending the power-on-reset duration will violate the reset timing requirements specified in the phy's datasheet. Whether or not that will cause problems in practice, I do not know.

And of course having to patch the board is not exactly an ideal workaround.

It seems the C3 revision with the new phy reset gpio is either currently shipping or about to be, based on this forum thread. If a small bit of logic is added to u-boot to reset the phy until it's working properly, this problem will finally be fixed.

@jadonk This issue should perhaps remain open until that software fix is actually implemented? The reset gpio by itself doesn't fix the problem if it's not being used.

@RobertCNelson do you know if this has been implemented yet?

@jadonk this is a todo... i think it's best to implement the "reset" in u-boot... mainline linux gpio-reset for phy's cpsw, is ongoing (last i checked)..

We should just do this in u-boot, blindy reset the gpio, connected or not on the "bone-black" target..