Amount of nsms increase rapidly
Closed this issue · 12 comments
Setup
Backend: 1x hdd backend, 3x flash backend (accell backend)
Each volumedriver have his own accell backend.
amount of namespaces:
root@pocops-voldrv01:~# alba list-namespaces --config arakoon://config/ovs/arakoon/flash-10-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini | grep Namespace.id | wc -l
236
amount of volumes in the volumedriver:
In [3]: client.list_volumes()
Out[3]: ['243269cd-9a50-46ca-90ad-bab8dbdd4cb9']
Maintenance agent is still running but not sure if the thread is hanging to clean up the namespaces.
alba version:
root@pocops-voldrv01:~# alba version
1.3.3
git_revision: "tags/1.3.3-0-g43d384e"
git_repo: "https://github.com/openvstorage/alba.git"
compile_time: "10/01/2017 16:28:22 UTC"
machine: "62bac6200957 4.4.0-36-generic x86_64 x86_64 x86_64 GNU/Linux"
model_name: "Quad-Core AMD Opteron(tm) Processor 2350"
compiler_version: "4.03.0"
The main issue here is that deleted namespaces are deleted from the HDD backend but not from the flash backend. This is causing the amount of NSMs.
The HC is creating multiple namespaces per 5 mins (CheckMK). For example: volume on the hdd backend, proxy test object, backend test object ... (contact @openvstorage/qa for more details on this).
The healthcheck is also removing the namespaces but due to the eviction of 90% on the flash backend, the deletes of the namespaces on the HDD backend won't trigger a delete of the namespace children('s) on the flash backend.
how was the delete namespace triggered?
using proxy-delete-namespace
or using delete-namespace
?
The healthcheck create a xml file and remove it in the volumedriver mountpoint
return subprocess.check_output('touch /mnt/{0}/{1}.xml'.format(vp_name, test_name), stderr=subprocess.STDOUT, shell=True)
subprocess.check_output('rm -f /mnt/{0}/ovs-healthcheck-test-*.xml'.format(vp_name), stderr=subprocess.STDOUT, shell=True)
I suppose the volumedriver calls the proxy to delete the namespace.
@JeffreyDevloo see Jan's question above ...
using proxy-delete-namespace
Creating the volumes/xml has been disabled as of the last two releases of the healthcheck though.
I can reproduce this, and have a hypothesis... further investigating, hope to have a fix available soon.
I was a bit optimistic about being able to reproduce ... but thinking about it some more, I can imagine a few scenarios where this might happen in combination with the bug in ensure single from earlier on.
Is it possible re-enable these checks on a testing environment which has a correctly functioning ensure single decorator, and then keep a close eye on the number of namespaces in the fragment cache to see if this still happens?
(Regardless there are probably still some improvements that could still be made to alba in regards to this issue.)
@domsj any update on this? Is this still an issue ( @jeroenmaelbrancke @matthiasdeblock ) ?
I vaguely recall making some improvement related to this subject, but can't remember if it was before or after this ticket ... so I'm wondering too whether this is still an issue (@openvstorage/operations ?)
Ran a test on the pocops and it looks like the amount namespaces decrease on the cachebackend when deleting the file on the volumedriver.
for i in {0..100}
do
echo "file: $i"
dd if=/dev/urandom of=/mnt/vmstor/test_$i.log bs=1M count=2
rm /mnt/vmstor/test_$i.log
done
result on the cachebackend:
root@pocops-voldrv01:~# alba list-namespaces --config arakoon://config/ovs/arakoon/flash-10-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini | grep Namespace.id | wc -l
2018-01-11 15:17:53 375791 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 0 - info - Albamgr_client.make_client :flash-10-abm
2018-01-11 15:17:53 379540 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 1 - info - connect_with : 10.100.200.12 26404 None Net_fd.TCP (fd:3)
2018-01-11 15:17:53 379706 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 2 - info - connect_with 10.100.200.12 26404 None Net_fd.TCP (fd:3) succeeded
2018-01-11 15:17:53 381368 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 3 - info - closing (fd:3)
24
root@pocops-voldrv01:~# alba list-namespaces --config arakoon://config/ovs/arakoon/flash-10-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini | grep Namespace.id | wc -l
2018-01-11 15:17:55 420875 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 0 - info - Albamgr_client.make_client :flash-10-abm
2018-01-11 15:17:55 424287 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 1 - info - connect_with : 10.100.200.12 26404 None Net_fd.TCP (fd:7)
2018-01-11 15:17:55 424850 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 2 - info - connect_with 10.100.200.12 26404 None Net_fd.TCP (fd:7) succeeded
2018-01-11 15:17:55 426775 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 3 - info - closing (fd:7)
23
Looks like the problem has been solved.
Tested with Alba version 1.6.1.