wtsi-ssg/pcp

preserve stripe information option?

Closed this issue · 11 comments

It seems like the stripe preservation options for lustre are simply stripe=1 or stripe=-1. Would it be possible to get an option to preserve striping as it was on the source?

Yes; at the moment the stripe logic is "all or nothing" (modulo the size filtering options). I can look at preserving other layouts. (The reason I originally went with -1/1 was that the number of OSTs was different on the source and destination, and so preserving exact stripe information did not make sense.)

Can you give me some examples for stripe layouts other than stripe=1 / stripe=-1 that you use? It will help me come up with a workable solution.

Thanks,

Guy

(forgot the mention:) The code in the xlstripe branch has the "-lc" option that allows you to specify the amount of data to write per OST, which gives a variable stripe size, depending on the total size of the file. Would that be useful?

I'm migrating data between filesystems as they need to be cleared to reformat and also upgrade to gridraid, which will actually create a scenario where I have differing OST counts available. In one example it will be 144 OSTs down to 36. In another, it will be 1440 OSTs down to 720, then later from 720 to 180. I might be wrong, but my impression was that -1 forced an object creation per OST, so when that width wasn't needed, it would create unnecessary overhead several operations.
In addition to that, the 1440->720 migration is to a target that doesn't support wide striping yet. e.g. -1 breaks as the stripes can't go beyond 160 (until after the upgrade is performed).
To answer your question about examples, 32, 64, or 128 wide stripes are pretty common. Since I'm doing migration, my initial goal was to preserve what the user originally set if possible, and reducing when necessary. I had also thought about using the opportunity to "restripe" to appropriate levels, but later decided to delay that to another operation. I had thought about two options:
A) So I can imagine it being useful for it to preserve stripe widths where possible but to just go to -1 when the target filesystem's max width was reached. (this could be supplied by a cmd line argument if not efficient to detect automatically for some reason)
B) I'd also considered usefulness of an alternative feature which would preserve a stripes to some specified ratio. e.g. Since I'm migrating to OSTs that are 4x the capacity and drives, It's conceivable that I might translate stripes according to that ratio- 32 would become 8, etc. Though the single stream tests on the "fatter" OSTs are not showing 4x performance, so I think that option is less likely.

So option A is the preferred one I think. I was very excited to have happened across this tool. I haven't tested it yet, but if it's speedy, it could really be a life saver. I've got nearly 20 PB of data to move around between filesystems that are getting upgraded, a huge volume of data movement nodes, and not much time to do it. I'm aiming to move a copy while the filesystem is in use at first, then do a final resynchronization pass while it's offline.

Hi Jeremy,

If you want the closest possible replica of your source filesystem, the "xlstripe" branch includes a "-l" option, which copies Lustre stripe settings from source to destination. Unlike the "-l" option in the master branch, this version copies the stripe count and size from the source file, rather than simply classifying each file as striped or unstriped. The number of stripes is capped at the maximum value supported by the destination filesystem.

Guy mentioned the "-lc" option, which is also in the "xlstripe" branch. The "-lc" option only has an effect if used with the "-lf" option. These options will force restriping of every file, with the number of stripes calculated so that the content of each file is divided amongst OSTs in slabs of a specified size. If your Lustre filesystems are anything like those that I have seen, many users do not give a lot of thought to stripe settings, so the "-lc" option may help to streamline the system and improve overall performance.

I suggest that you try the "xlstripe" branch to see if it does what you need. But please bear in mind that this branch has not been widely used, so if there are any problems, please report them so that we can try to fix them! As a precaution against data corruption, I recommend that you use the "-c" option to detect any differences between the source and destination files.

Regards,
Milton.

On 27 Mar 2016, at 5:43 AM, jeremyenos notifications@github.com wrote:

I'm migrating data between filesystems as they need to be cleared to reformat and also upgrade to gridraid, which will actually create a scenario where I have differing OST counts available. In one example it will be 144 OSTs down to 36. In another, it will be 1440 OSTs down to 720, then later from 720 to 180. I might be wrong, but my impression was that -1 forced an object creation per OST, so when that width wasn't needed, it would create unnecessary overhead several operations.
In addition to that, the 1440->720 migration is to a target that doesn't support wide striping yet. e.g. -1 breaks as the stripes can't go beyond 160 (until after the upgrade is performed).
To answer your question about examples, 32, 64, or 128 wide stripes are pretty common. Since I'm doing migration, my initial goal was to preserve what the user originally set if possible, and reducing when necessary. I had also thought about using the opportunity to "restripe" to appropriate levels, but later decided to delay that to another operation. I had thought about two options:
A) So I can imagine it being useful for it to preserve stripe widths where possible but to just go to -1 when the target filesystem's max width was reached. (this could be supplied by a cmd line argument if not efficient to detect automatically for some reason)
B) I'd also considered usefulness of an alternative feature which would preserve a stripes to some specified ratio. e.g. Since I'm migrating to OSTs that are 4x the capacity and drives, It's conceivable that I might translate stripes according to that ratio- 32 would become 8, etc. Though the single stream tests on the "fatter" OSTs are not showing 4x performance, so I think that option is less likely.

So option A is the preferred one I think. I was very excited to have happened across this tool. I haven't tested it yet, but if it's speedy, it could really be a life saver. I've got nearly 20 PB of data to move around between filesystems that are getting upgraded, a huge volume of data movement nodes, and not much time to do it. I'm aiming to move a copy while the filesystem is in use at first, then do a final resynchronization pass while it's offline.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #22 (comment)

I'll try this asap and will let you know how it goes- this sounds very promising.

Finally got a chance to try out xlstripe branch, and this is the first I've used mpi4py as well. I'm up to the point of hitting a failure, and I'm not sure what to make of it.

jenos@nid24518:~/build/pcp> mpirun -np 20 -genvlist PATH,LD_LIBRARY_PATH -f ~/mnf ~jenos/build/pcp/pcp -l -c ~jenos/build /scratch/staff/jenos/test/
R0: All workers have reported in.
Starting 20 processes.
Files larger than 500 Mbytes will be copied in parallel chunks.
Will copy lustre stripe information.
Will md5 verify copies.
Maximum number of destination stripes is 1440.

Starting phase I: Scanning and copying directory structure...
Exception on rank 0 host nid24518:
Traceback (most recent call last):
File "/u/staff/jenos/build/pcp/pcp", line 1607, in
scantree(sourcedir, destdir, statedb)
File "/u/staff/jenos/build/pcp/pcp", line 369, in scantree
listofpaths = walker.Execute(sourcedir)
File "/mnt/a/u/staff/jenos/build/pcp/pcplib/parallelwalk.py", line 260, in Execute
self._ProcessNode()
File "/mnt/a/u/staff/jenos/build/pcp/pcplib/parallelwalk.py", line 175, in _ProcessNode
self.ProcessDir(filename)
File "/u/staff/jenos/build/pcp/pcp", line 1401, in ProcessDir
copyDir(directoryname, newdir)
File "/u/staff/jenos/build/pcp/pcp", line 1103, in copyDir
stripesize=stripesize, stripeoffset=stripeoffset)
File "/mnt/a/u/staff/jenos/build/pcp/pcplib/lustreapi.py", line 214, in setstripe
raise IOError(err, os.strerror(err))
IOError: [Errno 17] File exists

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
jenos@nid24518:~/build/pcp>

Hi,

I suspect that this is due to the lustre API throwing an error due to the
code trying to set the striping on a file that already exists or already
has some striping set. I'll take a look at it.

Thanks,

Guy

On 8 April 2016 at 07:10, jeremyenos notifications@github.com wrote:

Finally got a chance to try out xlstripe branch, and this is the first
I've used mpi4py as well. I'm up to the point of hitting a failure, and I'm
not sure what to make of it.

jenos@nid24518:~/build/pcp> mpirun -np 20 -genvlist PATH,LD_LIBRARY_PATH
-f ~/mnf ~jenos/build/pcp/pcp -l -c ~jenos/build /scratch/staff/jenos/test/

R0: All workers have reported in.
Starting 20 processes.
Files larger than 500 Mbytes will be copied in parallel chunks.
Will copy lustre stripe information.
Will md5 verify copies.
Maximum number of destination stripes is 1440.

Starting phase I: Scanning and copying directory structure...
Exception on rank 0 host nid24518:
Traceback (most recent call last):
File "/u/staff/jenos/build/pcp/pcp", line 1607, in
scantree(sourcedir, destdir, statedb)
File "/u/staff/jenos/build/pcp/pcp", line 369, in scantree
listofpaths = walker.Execute(sourcedir)
File "/mnt/a/u/staff/jenos/build/pcp/pcplib/parallelwalk.py", line 260, in
Execute
self._ProcessNode()
File "/mnt/a/u/staff/jenos/build/pcp/pcplib/parallelwalk.py", line 175, in
_ProcessNode
self.ProcessDir(filename)
File "/u/staff/jenos/build/pcp/pcp", line 1401, in ProcessDir
copyDir(directoryname, newdir)
File "/u/staff/jenos/build/pcp/pcp", line 1103, in copyDir
stripesize=stripesize, stripeoffset=stripeoffset)
File "/mnt/a/u/staff/jenos/build/pcp/pcplib/lustreapi.py", line 214, in
setstripe
raise IOError(err, os.strerror(err))
IOError: [Errno 17] File exists

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
jenos@nid24518:~/build/pcp>


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#22 (comment)

Dr. Guy Coates
+44(0)7801 710224

I suppose I could/should have included some extra detail... the target "test" directory is empty. I thought maybe it wanted to create the target, so I also attempted a .../test/dir (where dir doesn't exist). Got an error (a different error) when I tried that, but that would be expected though. Thanks for your extremely rapid response!

(Just to help my troubleshooting; does pcp from the main branch work for
you?)

On 8 April 2016 at 14:30, jeremyenos notifications@github.com wrote:

I suppose I could/should have included some extra detail... the target
"test" directory is empty. I thought maybe it wanted to create the target,
so I also attempted a .../test/dir (where dir doesn't exist). Got an error
(a different error) when I tried that, but that would be expected though.
Thanks for your extremely rapid response!


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#22 (comment)

Dr. Guy Coates
+44(0)7801 710224

Arrgh- that helped my troubleshooting as well. The main branch had the same issue, and I went back to include the extra detail on the target I meant to include before, and found the issue. (test was a file, not a directory, so pcp was telling me exactly what was going on- and I wasn't listening) Sorry- working too late at night.
The first test I did passed very well. I have a lot to verify yet, but signs are positive. Sorry for the false alarm.

I have verified that stripe information is preserved. I didn't have the opportunity to test the scenario where the source stripe is wider than the target can accommodate, but the lfs setstripe command seems to handle that ok, and I presume the same library function is used.