openvstorage/alba

Amount of nsms increase rapidly

Closed this issue · 12 comments

Setup
Backend: 1x hdd backend, 3x flash backend (accell backend)
Each volumedriver have his own accell backend.

nsms load:
nsms_load

amount of namespaces:

root@pocops-voldrv01:~# alba list-namespaces --config arakoon://config/ovs/arakoon/flash-10-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini | grep Namespace.id | wc -l
236

amount of volumes in the volumedriver:

In [3]: client.list_volumes()
Out[3]: ['243269cd-9a50-46ca-90ad-bab8dbdd4cb9']

Maintenance agent is still running but not sure if the thread is hanging to clean up the namespaces.

alba version:

root@pocops-voldrv01:~# alba version
1.3.3
git_revision: "tags/1.3.3-0-g43d384e"
git_repo: "https://github.com/openvstorage/alba.git"
compile_time: "10/01/2017 16:28:22 UTC"
machine: "62bac6200957 4.4.0-36-generic x86_64 x86_64 x86_64 GNU/Linux"
model_name: "Quad-Core AMD Opteron(tm) Processor 2350"
compiler_version: "4.03.0"

The main issue here is that deleted namespaces are deleted from the HDD backend but not from the flash backend. This is causing the amount of NSMs.

The HC is creating multiple namespaces per 5 mins (CheckMK). For example: volume on the hdd backend, proxy test object, backend test object ... (contact @openvstorage/qa for more details on this).

The healthcheck is also removing the namespaces but due to the eviction of 90% on the flash backend, the deletes of the namespaces on the HDD backend won't trigger a delete of the namespace children('s) on the flash backend.

domsj commented

how was the delete namespace triggered?
using proxy-delete-namespace or using delete-namespace?

The healthcheck create a xml file and remove it in the volumedriver mountpoint

return subprocess.check_output('touch /mnt/{0}/{1}.xml'.format(vp_name, test_name), stderr=subprocess.STDOUT, shell=True)
subprocess.check_output('rm -f /mnt/{0}/ovs-healthcheck-test-*.xml'.format(vp_name), stderr=subprocess.STDOUT, shell=True)

I suppose the volumedriver calls the proxy to delete the namespace.

@JeffreyDevloo see Jan's question above ...

using proxy-delete-namespace

Creating the volumes/xml has been disabled as of the last two releases of the healthcheck though.

@domsj what's next?

domsj commented

I can reproduce this, and have a hypothesis... further investigating, hope to have a fix available soon.

domsj commented

I was a bit optimistic about being able to reproduce ... but thinking about it some more, I can imagine a few scenarios where this might happen in combination with the bug in ensure single from earlier on.

Is it possible re-enable these checks on a testing environment which has a correctly functioning ensure single decorator, and then keep a close eye on the number of namespaces in the fragment cache to see if this still happens?

(Regardless there are probably still some improvements that could still be made to alba in regards to this issue.)

@domsj any update on this? Is this still an issue ( @jeroenmaelbrancke @matthiasdeblock ) ?

domsj commented

I vaguely recall making some improvement related to this subject, but can't remember if it was before or after this ticket ... so I'm wondering too whether this is still an issue (@openvstorage/operations ?)

Ran a test on the pocops and it looks like the amount namespaces decrease on the cachebackend when deleting the file on the volumedriver.

for i in {0..100}
do
    echo "file: $i"
    dd if=/dev/urandom of=/mnt/vmstor/test_$i.log bs=1M count=2
    rm /mnt/vmstor/test_$i.log
done

result on the cachebackend:

root@pocops-voldrv01:~# alba list-namespaces --config arakoon://config/ovs/arakoon/flash-10-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini | grep Namespace.id | wc -l
2018-01-11 15:17:53 375791 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 0 - info - Albamgr_client.make_client :flash-10-abm
2018-01-11 15:17:53 379540 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 1 - info - connect_with : 10.100.200.12 26404 None Net_fd.TCP (fd:3)
2018-01-11 15:17:53 379706 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 2 - info - connect_with 10.100.200.12 26404 None Net_fd.TCP (fd:3) succeeded
2018-01-11 15:17:53 381368 +0100 - pocops-voldrv01 - 10508/0000 - alba/cli - 3 - info - closing (fd:3)
24
root@pocops-voldrv01:~# alba list-namespaces --config arakoon://config/ovs/arakoon/flash-10-abm/config?ini=%2Fopt%2FOpenvStorage%2Fconfig%2Farakoon_cacc.ini | grep Namespace.id | wc -l
2018-01-11 15:17:55 420875 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 0 - info - Albamgr_client.make_client :flash-10-abm
2018-01-11 15:17:55 424287 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 1 - info - connect_with : 10.100.200.12 26404 None Net_fd.TCP (fd:7)
2018-01-11 15:17:55 424850 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 2 - info - connect_with 10.100.200.12 26404 None Net_fd.TCP (fd:7) succeeded
2018-01-11 15:17:55 426775 +0100 - pocops-voldrv01 - 10549/0000 - alba/cli - 3 - info - closing (fd:7)
23

Looks like the problem has been solved.
Tested with Alba version 1.6.1.