cea-hpc/shine

shine start sometimes fails to start some targets

Closed this issue · 15 comments

When starting a filesystem with shine, some targets sometimes fail to start with error 524.
We have investigated the issue, and it appears, this error comes from the parallel loading of 'fsfilt_ldiskfs' on I/O nodes.

As 'fsfilt_ldiskfs' is not yet loaded if no targets were started on the node, starting multiple targets at once leads to the mount command trying to load 'fsfilt_ldiskfs' multiple times and it sometimes ends with a race condition that causes one of the loading processes to fail.

[root@valx0 models]# shine start -f fs1test
[15:45] In progress for 3 component(s) on valx7 ...
valx7: Failed to start fs1test-OST0000 (/dev/ldn.da2.dost1)
valx7: >> mount.lustre: mount /dev/mapper/mpathb at /mnt/fs1test/ost/0 failed: Unknown error 524
[15:45] In progress for 3 component(s) on valx7 ...
[15:45] In progress for 1 component(s) on valx10 ...
Tuning skipped.
= FILESYSTEM STATUS (fs1test) =
TYPE # STATUS NODES
---- - ------ -----
MDT 1 online valx10
OST 1 offline valx7
OST 2 online valx7
[root@valx0 models]#

The workaround we found is to load module 'fsfilt_ldiskfs' manually before start.
But, as version 1.3 now allows to perform some server actions, we have set up a fix to have 'fsfilt_ldiskfs' automatically loaded at target start.
We have had modules loaded sequentially because, otherwise we fall in the same kind of race conditions problems.

Index: b/lib/Shine/Lustre/FileSystem.py
===================================================================
--- a/lib/Shine/Lustre/FileSystem.py
+++ b/lib/Shine/Lustre/FileSystem.py
@@ -442,7 +442,11 @@
 
         # Add module loading, if needed.
         if need_load and first_comps is not None:
-            first_comps.depends_on(localsrv.load_modules())
+            lustremodule = localsrv.load_modules()
+            # fsfilt_ldiskfs is not installed on clients
+            if action == 'start':
+                lustremodule.depends_on(localsrv.load_modules(modname='fsfilt_ldiskfs'))
+            first_comps.depends_on(lustremodule)
         # Add module unloading to last component group, if needed.
         if need_unload and last_comps is not None:
             localsrv.unload_modules().depends_on(last_comps)

Regards.

Reported by: theryf

  • status: new --> accepted

Original comment by: degremont

I agree we could workaround this issue with something similar but there is several things to take care:

  • _prepare() should be generic and should not take decision based on action name. It will be very a mess if behaviour is based on flags set in start(), stop(), ... and inside _prepare(). Instead a adding a second module load action, you could use 'lustre ldiskfs' instead of just 'lustre' by default.
  • I think it is better to load ldiskfs instead of fsfilt_ldiskfs.

Original comment by: degremont

modprobe does not accept multiple modules

[root@lama7 ~]# modprobe lustre ldiskfs
keep_port_988: no process killed
FATAL: Error inserting lustre (/lib/modules/2.6.32-358.6.2.el6.Bull.40.x86_64/extra/kernel/fs/lustre/lustre.ko): Unknown symbol in module, or unknown parameter (see dmesg)
FATAL: Error running install command for lustre
[root@lama7 ~]# 
[root@lama7 ~]# modprobe ldiskfs lustre
FATAL: Error inserting ldiskfs (/lib/modules/2.6.32-358.6.2.el6.Bull.40.x86_64/extra/kernel/fs/lustre-ldiskfs/ldiskfs.ko): Unknown symbol in module, or unknown parameter (see dmesg)
[root@lama7 ~]#

That is why we turned the patch the way it is.

Original comment by: theryf

You can use "modprobe -a ldiskfs lustre"
Maybe LoadModules should be modified to support a list of module names (and adds -a when needed).
But, in all cases, _prepare() should not take decisions based on action names.
Either we add a new option to pass the module list or new flag to indicate loading ldiskfs. It should be possible to load only "ldiskfs".
(this is also useful for tunefs/fsck)

Original comment by: degremont

modprobe -a can be a good option. Thanks for pointing to it.
Regarding modules, 'fsfilt_ldiskfs' is actually the one that fixes the issue.
I've made some tests with loading ldiskfs, but it is not enough, as seen in the following output:

# shine start -f fs1
Updating configuration file `tuning.conf' on lama[5-10]
lama10: start of fs1-OST0006 (/dev/ldn.cook.ost6) failed
lama10: >> mount.lustre: mount /dev/mapper/mpathj at /mnt/fs1/ost/6 failed: Unknown error 524
lama8: start of fs1-OST0003 (/dev/ldn.cook.ost4) failed
lama8: >> mount.lustre: mount /dev/mapper/mpathh at /mnt/fs1/ost/3 failed: Unknown error 524
[13:35] In progress for 1 component(s) on lama6 ...
Tuning skipped.
====== FILESYSTEM STATUS (fs1) =====
TYPE # STATUS                  NODES
---- - ------                  -----
MDT  1 online                  lama6
OST  2 offline                 lama[8,10]
OST  3 online                  lama[8-10]
OST  2 recovering for 0s (0/1) lama7

The lines in syslog corresponding to device fs1-OST0003 on lama8

1383741322 2013 Nov  6 13:35:22 lama8 kern err kernel LustreError: 315:0:(fsfilt.c:122:fsfilt_get_ops()) Can't find fsfilt_ldiskfs interface
1383741322 2013 Nov  6 13:35:22 lama8 kern err kernel LustreError: 315:0:(osd_handler.c:5366:osd_mount()) fs1-OST0003-osd: Can't find fsfilt_ldiskfs
1383741322 2013 Nov  6 13:35:22 lama8 kern err kernel LustreError: 315:0:(obd_config.c:572:class_setup()) setup fs1-OST0003-osd failed (-524)
1383741322 2013 Nov  6 13:35:22 lama8 kern err kernel LustreError: 315:0:(obd_mount.c:201:lustre_start_simple()) fs1-OST0003-osd setup error -524
1383741322 2013 Nov  6 13:35:22 lama8 kern err kernel LustreError: 315:0:(obd_mount_server.c:1666:server_fill_super()) Unable to start osd on /dev/mapper/mpathh: -524
1383741322 2013 Nov  6 13:35:22 lama8 kern err kernel LustreError: 315:0:(obd_mount.c:1275:lustre_fill_super()) Unable to mount  (-524)

fsfilt_ldiskfs depends on ldiskfs, so ldiskfs gets loaded when loading fsfilt_ldiskfs.

I understand that the _prepare() method must remain as generic as possible.
But then, how can we differentiate actions that must be different depending on whether we are on a server or on a client node ?

  • the server would need 'fsfilt_ldiskfs' to be loaded prior to the first
    target start.
  • the client has only the 'lustre' module to load prior to mounting.

Original comment by: theryf

Regarding fsfilt_ldiskfs, I've double checked the issue we experimented couple of month/year ago and the analysis which was done at that time and everything is pointing ldiskfs. Since, we always load lustre and ldiskfs module before starting Shine without experimenting the problem anymore.
This is the case for our 2.1 configuration. Maybe something changed for 2.4. Could you try this on 2.1?

I do not like using fsfilt_ldiskfs as this module will surely disappeared in the future.

Regarding the _prepare() modification.
If two commands has different needs, _prepare() should have an additional parameter to take this difference in account.
Here is just a simple example of this idea. I think this needs to be improved to be landed, but here is the principle:

diff --git a/lib/Shine/Lustre/FileSystem.py b/lib/Shine/Lustre/FileSystem.py
index 9d1b8fb..4fa5d62 100644
--- a/lib/Shine/Lustre/FileSystem.py
+++ b/lib/Shine/Lustre/FileSystem.py
@@ -442,7 +442,10 @@ class FileSystem:
 
         # Add module loading, if needed.
         if need_load and first_comps is not None:
-            first_comps.depends_on(localsrv.load_modules())
+            modlist = ['lustre'] # Use default
+            if type(need_load) is list:
+                modlist = need_load
+            first_comps.depends_on(localsrv.load_modules(" ".join(modlist)))
         # Add module unloading to last component group, if needed.
         if need_unload and last_comps is not None:
             localsrv.unload_modules().depends_on(last_comps)
@@ -468,7 +471,7 @@ class FileSystem:
     def tunefs(self, comps=None, **kwargs):
         """Modify component option set at format."""
         comps = (comps or self.components).managed(supports='tunefs')
-        actions = self._prepare('tunefs', comps, **kwargs)
+        actions = self._prepare('tunefs', comps, need_load=['ldiskfs'], **kwargs)
         actions.launch()
         self._run_actions()
 
@@ -511,7 +514,7 @@ class FileSystem:
                                                OST.START_ORDER, MDT.START_ORDER
 
         actions = self._prepare('start', comps, groupby='START_ORDER',
-                                need_load=True, **kwargs)
+                                need_load=['lustre', 'ldiskfs'], **kwargs)
         actions.launch()
         self._run_actions()
 

There is 2 ways to manage this module loading:

  • Either LoadModules supports a list of module to be loaded.
  • Either _prepare() loops over the module list passed in argument and instanciate a LoadModule action for each of them, with dependency between them.

LoadModules._already_done() should be fixed.

non-regression tests should be adapted too.

Original comment by: degremont

Your patch has defect. It loads fsfilt_ldiskfs unconditionally at start, even if it is for router nodes, where fsfilt_ldiskfs is useless and possibly not even installed.

After further thoughts, I think we need a bigger patch to manage this.
I plan to have a component list per Server. This is needed for several things. This needs to be done before coding this patch.

Then, ServerAction could analyse this list to decide which command they need to run.

Original comment by: degremont

I've set up a patch that relies on the fact that the Action subclasses know both the action to perform and the component type on which the action is performed.
Based on this, as the _prepare() method knows the list of local actions to do (saved in compgrp), it is able to build a list of modules to load by requesting each action for its needs in terms of module loading.
This allows to handle the format/tunefs case as well.
I've added this patch as attachement.
Could you please have a look at it ?
Thanks.

Original comment by: theryf

Thanks for this interesting patch. Several comments:

  • Idea to have Actions returning the list of modules they need is interesting. I need to work more on this to be sure it is THE good approach. By example, what about module unloading? It will be handled differently than module loading.

  • Most of the time, avoid prefixing methods by "get_". This is not Java code :). I think something like "needed_modules()" or "required_modules()" is better here.

  • But the big problem with this patch, is you usage of compgrp. First, it is out of scope where you are using it. Then, it does not contains the full list of actions. The full list is in contained in ActionGroup instances, contained in 'subgraph'. You need to crawl the full graph to be sure you get all actions. With this code, if you do a 'start' on a node with MDT and OST. You will only check OST (the last compgrp for a 'start')

  • To chain all ModuleLoad actions you are creating, please use this kind of construct:

      for act1, act2 in zip(subparts, subparts[1:]):
          act2.depends_on(act1)
    

Original comment by: degremont

Please find a new patch based on your comments at the following location:
bullxpfs/lustre-shine@af5f21c

I've added testscases in tests/Lustre/Actions/CmdTest.py and slightly modified
one test in tests/Lustre/Actions/LaunchTest.py to take into account the addition of fsfilt_ldiskfs in Server's lustre_check method.

Regards.

Original comment by: theryf

Hi, I've modified the patch based on your inline comments.
You can find the new one at the following URL:
bullxpfs/lustre-shine@57ccd58

Regards.

Original comment by: theryf

I forgot to tell you to also run pylint on your patch.
I've identified one point of improvement for your patch. It could be nice to update if you have time to address it.

Original comment by: degremont

Based on your comments, here is a version of the patch using iter for ActionGroup members:
bullxpfs/lustre-shine@c0405dd

Regards.

Original comment by: theryf

  • status: accepted --> closed
  • Milestone: 1.3 --> 1.3.1
  • Resolution: --> fixed

Original comment by: degremont

Fixed in [608f07] and [7dfe33]

Original comment by: degremont