Peter-van-Tol/LiteX-CNC

LinuxCNC crashes on exit when component `litexcnc_eth` is used

Peter-van-Tol opened this issue · 11 comments

Describe the bug
As noted in #28 , LinuxCNC crashes with the following error:

Shutting down and cleaning up LinuxCNC...
Running HAL shutdown script
task: 603 cycles, min=0.000041, max=0.012258, avg=0.009716, 0 latency excursions (> 10x expected cycle time of 0.010000s)
mb2hal quit_signal DEBUG: signal [15] received
mb2hal quit_cleanup DEBUG: started
mb2hal quit_cleanup DEBUG: unloading HAL module [16] ret[0]
mb2hal quit_cleanup DEBUG: done OK
mb2hal main OK: going to exit!
litexcnc: LitexCNC etherbone driver unloaded 
rtapi_app: caught signal 11 - dumping core
free(): invalid pointer
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
Waited 3 seconds for master.  giving up.
Note: Using POSIX realtime
motmod: not loaded
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
Note: Using POSIX realtime
trivkins: not loaded
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
<commandline>:0: unloadrt failed
Note: Using POSIX realtime

To Reproduce
This error is due to an old loadrt statement in your hal-files. You have now:

loadrt litexcnc
loadrt litexcnc_eth connection_string="192.168.178.15

Above statements have been replaced with:

loadrt litexcnc connection_string="eth:192.168.178.150"

Expected behavior
An error message that the component litexcnc_eth does not exist (as it cannot be used as stand-alone).

Additional context
Why this error emerges at this moment? It is because the FPGA is reset to its safe state when LinuxCNC is unloaded. This means that litexcnc will send a last message to the FPGA. When the FPGA is loaded using two separate statements, the etherbone driver is already unloaded (and memory thus freed up). Thus writing to a closed device, without allocated memory leads to a core dump.

Removing the component registration from LinuxCNC leads to incomprehensible error messages. Instead, the component litexcnc_eth will now produce the following message before it stops LinuxCNC:

litexcnc: ERROR: Direct usage of the module `litexcnc_eth` is not supported
litexcnc: This is caused by the following loadrt-commands in your HAL-file:
litexcnc:     loadrt litexcnc
litexcnc:     loadrt litexcnc_eth connection_string="10.0.0.10"
litexcnc: Please use the folllowing single command in your hal-file instead:
litexcnc:     loadrt litexcnc connections="eth:10.0.0.10"
litexcnc: For more information, see: https://github.com/Peter-van-Tol/LiteX-CNC/issues/32 

Users can easily switch to the new standard.

Sorry.
That doesn't work for me....
Terminal keeps sending "errors" and the Z-axis is moving by itself with very low speed even when LinuxCNC is in emergency stop mode

Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Running HAL shutdown script
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
task: 543 cycles, min=0.000044, max=0.024437, avg=0.009821, 0 latency excursions (> 10x expected cycle time of 0.010000s)
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
Unexpected read length: -1, expected 88
mb2hal quit_signal DEBUG: signal [15] received
mb2hal quit_cleanup DEBUG: started
mb2hal quit_cleanup DEBUG: unloading HAL module [16] ret[0]
mb2hal quit_cleanup DEBUG: done OK
mb2hal main OK: going to exit!
Unexpected read length: -1, expected 88
litexcnc: LitexCNC driver unloaded 
free(): invalid pointer
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
Waited 3 seconds for master.  giving up.
Note: Using POSIX realtime
trivkins: not loaded
<commandline>:0: exit value: 255
<commandline>:0: rmmod failed, returned -1
<commandline>:0: unloadrt failed
Note: Using POSIX realtime

Is ssems you've lost communication with the card. Unexpected read length indicates that no data has been received from the FPGA.

Most likely this is due to a malformed connection string. Which means that the error message is wrong... To verify this, can you add your hal-file here?

I had to leave until saturday. Then I can Upload the Hal.

It is the same as in #28

Just deleted the two lines and added

loadrt litexcnc connections="eth:192.168.178.150"

Just tested
loadrt litexcnc connections="eth:10.10.10.10"

Works fine.

@ozzyrob : thanks for testing! Same behavior here. I will merge and close this issue.

@OJthe123 : there is most likely a mistake somewhere. Will help you coming weekend to resolve it.

@Peter-van-Tol : you will merge?...So that could be the problem... I did not pull the "32" branch 😆

@OJthe123 : This branch only added the error-message. Nothing has changed to the communications and the hal command loadrt litexcnc connections="eth:10.10.10.10" was already supported long time ago in #11. I guess that there is an issue in your hal-files or athe communications came disrupted in another way.

@OJthe123 : Basically you can use the version now from pypi.org:

pip install litexcnc

If the error persists, please start a Q&A discussion and we will fix your config (or code in that perspective).

Semse.hal.txt
Hi.
Here is my hal...this is the working version.
when I change to connections="eth:...." it doesn't work.
and yes. I comment out the two other loadrt lines when trying

Now when I try to install driver after pulling the latest "11" I get this...

INFO: Compiling LitexCNC driver...
Compiling realtime litexcnc.c
Linking litexcnc.so
sudo cp litexcnc.so /usr/lib/linuxcnc/modules/
[sudo] Passwort für oj: 
Compiling realtime litexcnc_eth.c
Linking litexcnc_eth.so
sudo cp litexcnc_eth.so /usr/lib/linuxcnc/modules/
Compiling realtime litexcnc_stepgen.c
In file included from litexcnc_stepgen.c:44:
/tmp/tmppg5hjrun/litexcnc_stepgen.h:153:5: error: unknown type name ‘litexcnc_stepgen_pin_t’
     litexcnc_stepgen_pin_t *instances;
     ^~~~~~~~~~~~~~~~~~~~~~
litexcnc_stepgen.c: In function ‘litexcnc_stepgen_config’:
litexcnc_stepgen.c:144:80: error: ‘stepgen->hal’ is a pointer; did you mean to use ‘->’?
 *(stepgen->data.clock_frequency) / (1 << (shift + 1)) > stepgen->hal.param.max_driver_freq) {
                                                                     ^
                                                                     ->
litexcnc_stepgen.c:170:49: warning: initialization of ‘litexcnc_stepgen_instance_t *’ {aka ‘struct <anonymous> *’} from incompatible pointer type ‘int *’ [-Wincompatible-pointer-types]
         litexcnc_stepgen_instance_t *instance = &(stepgen->instances[i]);
                                                 ^
litexcnc_stepgen.c: In function ‘litexcnc_stepgen_prepare_write’:
litexcnc_stepgen.c:260:18: warning: assignment to ‘litexcnc_stepgen_instance_t *’ {aka ‘struct <anonymous> *’} from incompatible pointer type ‘int *’ [-Wincompatible-pointer-types]
         instance = &(stepgen->instances[i]);
                  ^
litexcnc_stepgen.c: In function ‘litexcnc_stepgen_process_read’:
litexcnc_stepgen.c:450:18: warning: assignment to ‘litexcnc_stepgen_instance_t *’ {aka ‘struct <anonymous> *’} from incompatible pointer type ‘int *’ [-Wincompatible-pointer-types]
         instance = &(stepgen->instances[i]);
                  ^
litexcnc_stepgen.c: In function ‘litexcnc_stepgen_init’:
litexcnc_stepgen.c:586:17: error: ‘stepgen->hal’ is a pointer; did you mean to use ‘->’?
     stepgen->hal.param.max_driver_freq = 400e3;
                 ^
                 ->
litexcnc_stepgen.c:590:24: warning: assignment to ‘int *’ from incompatible pointer type ‘litexcnc_stepgen_instance_t *’ {aka ‘struct <anonymous> *’} [-Wincompatible-pointer-types]
     stepgen->instances = (litexcnc_stepgen_instance_t *)hal_malloc(stepgen->num_instances * sizeof(litexcnc_stepgen_instance_t));
                        ^
litexcnc_stepgen.c:599:49: warning: initialization of ‘litexcnc_stepgen_instance_t *’ {aka ‘struct <anonymous> *’} from incompatible pointer type ‘int *’ [-Wincompatible-pointer-types]
         litexcnc_stepgen_instance_t *instance = &(stepgen->instances[i]);
                                                 ^
make: *** [/usr/share/linuxcnc/Makefile.modinc:115: litexcnc_stepgen.o] Fehler 1
Error: Compilation of the driver failed.