esprfid/esp-rfid

Still cannot restore users with over 100

windy54 opened this issue · 32 comments

I have re-opened this issue, see 448.

Background.
We currently use the software with a 522 reader and have 120 Hackspace members.
A member has just donated several wiegand readers and I have been investigating what we have to do .
For information, it is a wiegand 32 bit reader and I have had to modify the wiegand library. The 522 reader outputs the UID in little median format, the wiegand in big endian format. So I have created a script to read in the existing database, convert the UID’s and write it back out. When I try and restore it using the web interface the system crashes .
I have not tried it yet but I have been able to update this number of users in the past over MQTT.

so how to proceed?

I have investigated this in the past and there seems to be some handshaking going on, I.e. a user is read from the file and the next one is transmitted only when a response is received.

@windy54 are you using the backup/restore functionality? I've never tried it, I'll try with a test file of around 100 users and I'll let you know

So I am assuming this has not been fixed in the latest demo build, apologies if it has.

I am investigating this myself and after loading (say) 10 users an exception (3) is generated, stack overflow.
`[ DEBUG ] userfile received
{"command":"userfile","uid":"f479340","user":"Phillip Hayward No.45","acctype":1,"validuntil":2145916800}[ DEBUG ] userfile saved

Exception (3):
epc1=0x40101199 epc2=0x00000000 epc3=0x00000000 excvaddr=0x4000f230 depc=0x00000000`

I will try re-building with an increased stack size, once I find out where it is set :)

Steve

@windy54 I'm working on a PR that might fix this! More details here: #572

Please stay tuned, I might be able to publish it this week :)

@windy54 I think this PR: #577 should fix your issue.

I'm going to close this, but please reopen if it's not fixed. Thank you very much!

Great! Thank you for the feedback :)

ahaha, I was surprised that it went so smooth :)

Yes, if you can get me some logs using the debug build and the stack trace I'm going to check

Hey Steve, thank you for helping out.

Just to double check, are you using the code from the PR #577, not dev, right?

Then, can you please paste the stacktrace and what you see in the logs before the reboot?

From what you are sharing looks like the watchdog reset, maybe I need to try with more users to try and replicate what you see. I couldn't hit the watchdog anymore with my change.

I'll try with more users and I'll report back!

I have downloaded source code from dev branch fix_websockets , I could not see any code under 557, only the binaries which I originally used.
` ets Jan 8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1392, room 16
tail 0
chksum 0xd0
csum 0xd0
v3d128e5c
~ld

[ INFO ] ESP RFID v2.0.0-dev
Flash real id: 001640E0
Flash real size: 4194304

Flash ide size: 4194304
Flash ide speed: 40000000
Flash ide mode: DIO
Flash Chip configuration ok.
`

So I am getting the same error message which looks like the watch dog.

If I can spot where this is setup I will try changing it.

Steve

hey @windy54 sorry, but just to be extra sure, I've increased the version number in my test. Can you please get the build from here: https://github.com/esprfid/esp-rfid/actions/runs/4028030821 and try again? It should show ESP RFID v2.0.0-dev.1 as version number.

About the watchdog instead, it's the system watchdog that gets triggered, you cannot do anything about it. The problem of the websocket is that it runs in a callback, meaning that it's outside of the main loop. This causes problems when the watchdog starts because it can mess around memory and if it's outside of the loop there's no guarantee about what happens. That's why sometimes it breaks and the stacktrace is always different.

My solution of moving the logic from the callback to the main loop should fix this problem, I think it's the only way to solve the issue properly. If there's still a problem it might be something else that I haven't catched yet.

Let me know if this build still breaks and please keep sharing the logs as you've done before, it's very helpful. Possibly share also the full stacktrace that you get when it breaks. Now I'm going to test with 150+ users and I'll report how it goes.

actually, wait, I've been able to reproduce! Thank you :) I'll try to understand more about the error and hopefully fix it :)

(I've linked the wrong build before, the right one is this one: https://github.com/esprfid/esp-rfid/actions/runs/4028030821)

I think now it's not the watchdog anymore, it's something else. Which is a good news, but it means it might take a while.

hey @windy54 I've changed the pagination system so that it now fetches one page at the time when loading the users (still to be done for logs). Can you please check if that works? Here's the build: https://github.com/esprfid/esp-rfid/actions/runs/4033629622

I have ideas to make it more robust if you click too quickly and queue too many requests, but for now if you use it normally it should work... Hopefully! It works well for me with more users, but let me know if that works for you too!

thank you @windy54 I can reproduce some of the times :( I'll get back here as soon as I have news!

unfortunately I have bad news for now :( From some digging I'm pretty sure that the problem here is in the websocket implementation of ESPAsyncWebServer, a dependency of this project that has some instability issues with a somewhat heavy usage of websockets.

I've managed to import a long list of users by adding some delay between one socket message and another, but it's really annoying :( And still is not a real solution as sometimes it breaks doing something else.

Having said that, I don't know if there's a real solution with the current stack, we can mitigate the problem as much as possible, but the crashes and resets are still going to happen when using the web UI.

On a positive note, MQTT seems pretty stable instead, so I would recommend moving as much logic as possible to MQTT in order to minimise the restarts.

I'm going to ship some mitigations here and there in the near future, but still I don't think I'm going to change the library anytime soon, and the development there seems stalled.

Hello,

Thanks for the cool project.
I also encountered the user file restore issue. To further isolate the problem I switched to a fork of ESPAsyncWebServer, this fork contains many improvements.

After the switch I immediately noticed that the free heap increased from 31,032 bytes to 38,984 bytes.
In addition, the stability of sending the user list to the WebSocket client improved.
Despite these improvements, however, I encountered some limitations when trying to restore many users.
Occasionally, messages sent from the ESP were not received by the socket client, causing the restore process to stall.
In addition, the ESP would occasionally experience crashes, as before the change.

To solve this problem (for the user restore), I implemented a ticker that sends the WebSocket messages to retrieve the next entry outside of the asynchronous context.
With this changes, I was able to successfully recover about 200 users.

tickerGetNextUserEntry.once_ms_scheduled(5,[]() {		
	ws.textAll("{\"command\":\"result\",\"resultof\":\"userfile\",\"result\": true}");
});

Please note that if you want to switch to the fork, you will need to make changes to the SPIFFSEditor.cpp file.
You have to comment out line 10 and 12 to get it compiled.

//#ifdef ESP32 
 #define fullName(x) name(x)
//#endif

However, the stability has improved, but crashes do still occur.
I hope this info helps.

Best regards
Renstec

hey @Renstec thank you very much for the feedback. I've tried implementing your changes, which improved a bit, and together with my latest changes here: #577 I think I'm happy with how it works.

Now if the esp breaks the browser waits 5 seconds and then tries again sending the last message. This should help fixing long imports.

@windy54 I'm not sure if you still care about this project, but if you do and if you want to give this a try it would be very helpful! :)

Also, I've changed how the users table works. Now it fetches only one page at the time, not the full list of users, making the table a lot faster if you have a lot of users. Try that as well, it's a bit hacky, but it should work good enough.

Excellent! No rush :)

I'm trying to release the V2 by mid-September after which I'm going to only focus on bug-fixing for a while.

This was a pretty significant effort, on which I plan to only do minor improvements if necessary. Unfortunately I think I cannot do much better with what we have at the moment :(

Hello,

I wanted to make a suggestion about the challenges we are facing with WebSockets stability. Instead of constantly trying to work around the WebSocket server errors, have you thought about switching to Server-Sent Events (SSE) as well as to the fetch API?

Using SSE and fetch could potentially provide a more reliable solution to completely avoid the crashes caused by Websocksock communication.

It might be worth investigating this switch further.

Thanks
Renstec

Hey @Renstec if I had to build this from scratch, for sure I would not use websockets for everything. Moreover I would not use EspAsyncWebServer in general, as the problem is with this library simply breaking under moderate usage. The less free memory you have the easier it breaks, so that's why it started becoming a bigger problem recently after having added more functionality.

If my latest PR: #577 works well enough, I'm going to stop there and issue only bugfixes for this project after having released V2.

If the release goes well and there some interest I'm willing to spend some time re-implementing everything for ESP32, since this project is not worth porting. Too much work and too many existing issues that are difficult to solve.

If you can test the new PR it would be great! Thank you :)

Hey @windy54 I have merged in dev my work to improve stability for the websockets.

It's a bit better, not massively, but I think it's better than before.

In any case, use MQTT to import/export users, it's far more stable and I think you end up wasting less time.

I'm closing this for now as I think there's not much else that I can do with the current set of libraries and with ESP8266 :)