Attempting to remove a variable ... that is used by existing constraints ...
Closed this issue · 11 comments
When running a fixed lag smoother occasionally the following error will occur at random times, despite no client side code removing any variables, so I am assuming this is from the marginalization. This error is uncatchable in the sensor model code. Is it possible to have this be changed to be a warning message so that it does not crash the program, or does this error cause a total failure of the problem being optimized? Is it possible to have a setting where if this is the case, then said constraints can be removed?
This sounds like a bug in the marginalization code. Is there any way you can shared your fuse configuration and maybe a bagfile that I can replay to reproduce the problem? If I can reproduce the error, I'll get the bug fixed as soon as possible.
I have it configured to throw because this error simply should not happen. But I will review the code to determine if this error is truly catastrophic, or if I can reduce that exception to a warning/error log.
I'm not able to share the fuse config as I am running custom sensor models in a private repository right now. However the error seems to occur more frequently when the lag duration is small (~5 seconds) and the optimization period is low (~0.5 seconds). I'm currently adding new position/orientation variables at a rate of 5hz
That makes finding the issue harder. Is there anything else you can tell me about your setup:
- How long the system runs between errors
- Number of active sensors
- What variables are involved in the sensor constraints? e.g. A fuse_models IMU sensor only uses the current angular velocity. A custom odometry constraint might use the current position and orientation and an older position and orientation. And a visual odometry system might use the current position and orientation as well as some unstamped visual landmarks.
- Are you using the fuse_models motion model, or a custom motion model? If custom, can you tell me anything about the involved variables?
If I know more details, I could create some trivial sensors that roughly mimic the connectivity of your configuration. I can then run that setup with some simulated data and see if I can reproduce your issue. At a minimum, I'd be able to share that configuration with you, and we could iterate on that.
- Completely different each time, sometimes it'll occur after about 5 seconds, sometimes it may not occur at all
- Currently running with just one visual odom sensor model
- The variables being added at each keyframe timestep (5hz) are position, orientation then any unstamped position visual landmarks that it sees, each pose variable has a reprojection constraint to a unstamped position variable and I am tracking around 100 landmarks so a rough estimate of the # of reprojection constraints is probably around 70 per keyframe, each new keyframe adds anywhere from 0-30 new landmarks depending
- Using a 3d version of the unicycle 2d motion model, which has all the same variables as the 2d version just extended to 3d
Perfect. Thanks. It may take me a few days to put something together. But I'll let you know when I have something -- either a fix or a test system for you to try out.
One last question -- what ROS version are you using? Kinetic, Melodic, or Noetic?
kinetic
Actually, I've changed the algorithm for selecting the variables to be marginalized in the melodic/noetic release. I've created a branch with that change backported into kinetic. Would you mind testing if that branch fixes your issue? If it still fails, then I'll create a simulated environment and attempt to reproduce the issue.
branch: RST-3240-backport-kinetic
PR: #249
Ill start testing this and get back to you
I was still able to get this error with this fix
This error occurs more frequently when I let the bag run. When i pause the bag every once in a while the error seems to not occur.
As part of a recent ROSCon talk, I added a fuse_tutorials package. One of those tutorials creates a "range to beacon" sensor. I have modified that tutorial in the branch "backport-issue248-landmark-tracking-example" to better approximate what you have described:
- A robot drives around a simulated space in a big circle
- There are a set of "beacons" at fixed locations within the space
- The robot can measure the range to each beacon within the range sensor's field of view (180deg)
- When the range sensor detects a new/inactive landmark, it adds a prior constraint on the landmark position from a database of landmark positions. This new landmark is now "active".
- As long as the landmark is continuously observed, the landmark remains active and range constraints are added for each landmark.
- When a landmark leaves the field of view, it is no longer considered "active" by the range sensor.
- The optimizer will keep the landmarks as part of the graph until N seconds after the last robot pose observed the landmark, where N is the fixed-lag duration (5s).
So, new landmarks are added to the graph as soon as they enter the field of view. Roughly 5s after the robot passes by a landmark, it will be marginalized out from the graph. Since the simulated robot is driving in a circle, it will eventually encounter the landmark again. It will get re-added to the graph at that point. I think this is a rough approximation/simplification of your visual odometry/slam system. At least, that is my intent.
If you checkout the branch "backport-issue248-landmark-tracking-example" and run "roslaunch fuse_tutorials range_sensor_tutorial.launch", you should be able to watch this in action. The ground-truth beacon positions are display in rviz as red dots, and the set of landmarks contained in the graph are the yellow dots. You can see new landmarks being added to the graph and old landmarks getting removed as the robot drives in a circle.
I've left that running for over 12 hours now without any issues at all. I hoping this example will either point out an issue in your code, or you can modify this example in some way to reproduce the issue you are having.
Let me know if you have any problems running the example, and if you are able to reproduce your graph errors somehow using this example. If we can make something that reproduces your issues, I'll have a shot of fixing it.