scarlehoff/pyHepGrid

Issue with SyncJobs

Opened this issue · 5 comments

As per #82, you shouldn't automate proxy renewal unless you use MyProxy.

Also, you should have the CE list configurable or somehow much smarter than this.

An arcsync is a heavy operation, and trying to sync against 12 difference CEs that you're not running on is costly in time and network; also theres no check to see if the servers are even active, so you're replying on timeouts.

I'm not sure on the best way to improve this but presumably, you could write the list of CEs somewhere as part of the submission and read it back in later?

Hi Adam,

I assume the loop over arcsync was adapted from Jeppe's tutorial here:

for i in {arc-ce0{1..4}.gridpp.rl.ac.uk,svr0{09,10,11,19}.gla.scotgrid.ac.uk,ce{1..4}.dur.scotgrid.ac.uk};
do arcsync -f -c $i -j ~/jobs.xml;
done
#for i in {ce{1..4}.dur.scotgrid.ac.uk}; do arcsync -f -c $i -j ~/jobs.xml;done
(nohup arcrenew -a -j ~/jobs.xml >& ~/renew.res &)

I would assume (and hope) that if the intersection of jobs present in both the local and remote databases is empty or very short, the arcsync command terminates quickly, without imposing a heavy load at either end.

It presumably becomes heavy when both lists (and their intersection) are long, but that load is unavoidable (since they're the ones we need to update whatever else we do).

So I don't think this loop is necessarily problematic, if arcsync indeed exits cleanly when there are no jobs to update.

I have never tried testing this - is the load of an 'empty' arcsync substantial?

arcsync checks the remote system to update the local database; so for the likes of larger sites, this can take a very long time to search their database list; needless syncs from many users at a similar interval will cause Memory, Disk and Network loading issues.

arcrenew would impose a load only if it was renewing, which you blindly do for any job in the database, this should really include a renew if the proxy is too short if running every hour, or only renew say once every 12 hours? You would have to make an assumption that the proxy in the running jobs is the same age as the one locally.

The renew is ok if the job database is empty as it poses only a little load locally.

You could use something like this to pull the CEs from the job database although this expects you to be using the central arc job database as well as XML; it also expect xmllint to be available too. It might also cause issues with locking although the cat is fast so unlikely. This could be too many assumptions.

cat ~/.arc/jobs.dat | xmllint --format - | grep "JobID" | awk '{print $1}' | sed 's/<JobID>//g; s/<\/JobID>//g; s/gsiftp\:\/\///g; s/https\:\/\///g;' | awk -F '/' '{print $1}' | awk -F ':' '{print $1}' | sort -u

This is highly unoptimised, feel free to laugh at my awk and sed usage 😆

arcsync checks the remote system to update the local database; so for the likes of larger sites, this can take a very long time to search their database list; needless syncs from many users at a similar interval will cause Memory, Disk and Network loading issues.

So the server side of arcsync searches its (potentially very large) database dynamically on every sync request rather than maintaining a list of users with active jobs and short-circuiting out if a user doesn't have any?

If so I see the problem. We only submit to a subset of these servers in pyHepGrid so can probably remove the others.

You could use something like this to pull the CEs from the job database

If our local DB is missing jobs it may also be missing CEs, so I think the grep approach is likely just to lead to persistently missing jobs in the local database?

this should really include a renew if the proxy is too short if running every hour, or only renew say once every 12 hours?

The whole script only runs periodically as a cronjob (exactly as in Jeppe's tutorial) - I think we usually have it set to once every ~12 hours, Jeppe's example in his grid tutorial for PhD students looks like it's every 4 hours.

arcsync checks the remote system to update the local database; so for the likes of larger sites, this can take a very long time to search their database list; needless syncs from many users at a similar interval will cause Memory, Disk and Network loading issues.

So the server side of arcsync searches its (potentially very large) database dynamically on every sync request rather than maintaining a list of users with active jobs and short-circuiting out if a user doesn't have any?

From my usage experience, even an arcsync to a ce with zero knowledge of my user or jobs still sits and waits; from the admin side of things, we just see arc spike when people arcsync although we cant see the output ourselves.
But even if it ended quickly, its still a waste of communication between systems that a client based submission management system should be able to reduce by simply keeping a lost of places it knows jobs went to.

Really pyHepGrid shouldn't require an arcsync anyway if its keeping its own database/log of job IDs as it's really only to be used if you lose your job id list or need another list.

If so I see the problem. We only submit to a subset of these servers in pyHepGrid so can probably remove the others.

Yes

If our local DB is missing jobs it may also be missing CEs, so I think the grep approach is likely just to lead to persistently missing jobs in the local database?

That would be the downside of the grep approach; however if pyHepGrid keeps its own database/logs then you could use that? If not then there isn't an easy solution to reduce the load if jobs get lost from the local database often.

The whole script only runs periodically as a cronjob (exactly as in Jeppe's tutorial) - I think we usually have it set to once every ~12 hours, Jeppe's example in his grid tutorial for PhD students looks like it's every 4 hours.

Thats ok; Jeppes tutorial is a bit often but its to reduce the issues with new users rather than power users that might be using this tool.

Many of the issues we've seen are purely due to scaling, the ease and speed that pyHepGrid does it means users don't hit the problems until they're submitting 1000's of jobs but some careful changes could be reduced.