Opening a write-locked file, but can't find contents (when opening a pickled pandas DataFrame)

Question

Opening a write-locked file, but can't find contents (when opening a pickled pandas DataFrame)

mar-ses opened this issue 6 years ago · 5 comments

Hello, I posted a stackoverflow post about this: https://stackoverflow.com/questions/53044203/write-locked-file-sometimes-cant-find-contents-when-opening-a-pickled-pandas-d

Basically, I need to write-lock a .pickle file, read it, add a row, and save it again. This is done by up to 300 hundred simultaneous processes on the same file, they're adding data.

Anyway, the issue I'm getting is that as I increase the number of simultaneous processes, I start to get an error where a process will obtain write-lock, but isn't able to find the contents somehow. So the read fails. Here's my code:

with portalocker.Lock('/path/to/file.pickle', 'rb+', timeout=120) as file:
    file.seek(0)
    df = pd.read_pickle(file)

    # ADD A ROW TO THE DATAFRAME

    # The following part might not be great,
    # I'm trying to remove the old contents of the file first so I overwrite
    # and not append, not sure if this is required or if there's
    # a better way to do this.
    file.seek(0)
    file.truncate()
    df.to_pickle(file)

It fails at pd.read_pickle. I get a convoluted traceback from pandas, and the following error:

EOFError: Ran out of input

The contents are there afterwards (like after all the processes finish, I have no problem reading the DataFrame. Not to mention, some (most) of the processes end up finding the contents and updating the DataFrame without a hitch. But with 300 hundred simultaneous processes, up to 30-40% end up failing.

Since it works sometimes but not all the time, I assumed it must be some problem where the previous process saves the file and exits write-lock, but the contents don't get saved in time or for some reason can't be read if the next process opens up the file too early. Is this possible in any way?

Also, since you're the experts here, perhaps my code above could use improvements, I'd been glad to hear if there's a better way of doing it.

EDIT: Another thing I wanted to ask is; what happens if the write-lock waits for too long but times out? I gave it 120 seconds which seemed enough to me (I estimate on average I have about 2.5 writes per second, for a 300KB pickle file). I tried adding a flag that would trip if the write-lock timed out, but is there a way to make portalocker.Lock return an error if it times out, just to be sure?

Answer 1 · 2018-10-29T13:03:53.000Z

One option could be that somehow it didn't sync yet, in which case a file.flush() followed by os.fsync(file.fileno()) could fix it.

Another option could be the underlying filesystem. Since you're talking about 300 processes in a cluster I'm thinking that they might not be on the same system and writing to a networked filesystem. With those it actually can sometimes be too much to expect 2.5 writes per second, it's really hard to say.

In general when you have many concurrent reads and writes its a good idea to collect them in a single process so you can avoid these types of issues. Actually, ideally this sounds like a job for a database which does exactly that.

I'm not sure what options you have available but another option could be to have a single python process that receives all data through a network socket and writes it for you. Or simply go for a map reduce approach and write many files in the map stage only to combine it after at the reduce phase

Answer 2 · 2018-10-29T13:30:44.000Z

Thanks for the help. Yeah this is all happening across a number of "nodes" in the cluster, so maybe it takes more time to sync everything. First of all, would I put the ´flush´ and ´fsync´ after the write, i.e ´to_pickle´"?

Also, regarding you paragraph, is there anything I can do? One thing I was wondering is, if I introduce a "pause", i.e ´time.sleep()´ after the file is write-locked but not synced yet, should it in principle help? Or would I need to wait for it to sync up before the write-lock? Which would be harder. Like once I write-lock it, are the contents of the binary file completely fixed, as far as my process is concerned?

Like the following:

with portalocker.Lock('/path/to/file.pickle', 'rb+', timeout=120) as file:
    # file is not synced yet
    time.sleep(1)
    # if I do file.read here, would it be synced? Or does the "syncing" need to happen before we portalock.Lock?

    file.seek(0)
    df = pd.read_pickle(file)

Or perhaps it would be better to do it after the ´write´ part at the end.

Also, any idea how long this type of syncing across filesystems could take?

And do you have any pointers for where I could look to for the database option, or how to write a "receiving" python process.

Answer 3 · 2018-10-29T14:07:14.000Z

Yes, exactly. The flush and fsync should be right after the to_pickle()

First of all I think you need to test if the locking has any use at all. Locking relies on the underlying filesystem, because of performance many networked filesystems either don't bother to consider locking at all or choose to ignore it by default. If that's the case you will have no direct locking method available and will have to fall back to multiple files and writing in an atomic manner.

As for writing atomically, writing to a temporary file and renaming the file once you're done writing is generally considered the safest method.

If the locking does work but is simply too slow a sleep before reading might help. But thinking about it, the most likely cause for your problems is that your filesystem does not do any actual locking.

When writing to a local harddisk system you can expect about 25ms for a write, for an ssd this can be about 2-3ms. With networked filesystems there's no clear answer, it depends on whether the storage is disk based or ssd based, whether there's a raid system with or without battery backed write cache, the load of the system, the operating system running on top of it, the type of filesystem and the filesystem settings. Way too many variables to make a useful guess, it could be 100ms, it could be about 10 seconds. I've seen both.

A really easy option, assuming the file won't be too large for memory, is to use Redis. It offers many useful data structures such as ordered sets and bitmaps: https://redis.io/topics/data-types

Just about any type of database server will probably work though, the most important factor is that you need to have a single process responsible for writing to the filesystem.

Answer 4 · 2018-10-30T15:16:12.000Z

Thanks for the help, I'll test this out further to make sure, but at the moment it seems to be working. I put the ´flush´ and ´fsync´ after saving, and then did a ´time.sleep´ just in case (will also check if it's actually required). Like I said, it currently seems to work, with a 1 second sleep for me, though it might be different for others.

So thanks for the help. Perhaps if I could suggest adding a note concerning this to the documentation? I suspect others might also try to use write-locking in the same way as me; it was actually suggested to do it that way to me by a guy more experienced in HPC.

So maybe some note about the possibility that it could have problems with cluster filesystems not syncing up could help out another poor soul. Hopefully my stackoverflow question comes up on google, I'll answer it there since no one answered and link them here.

Cheers for the help mate.

Answer 5 · 2018-10-31T08:25:51.000Z

It might be that the fsync forces the networked filesystem to properly flush whereas it normally would. But I would still suggest testing the locking to see if it actually works across multiple systems.

If the locking doesn't work you'll most likely get corrupted files at some point.