PetrGlad/python-prevayler

data loss: need to flush logFile

Closed this issue · 1 comments

Hi,

I have unit test that insert 40,000+ zip codes. I hacked psys.exe() to raise a ValueError on the 30,000'th transaction. Log file with existing code:

$ ls -l dat/0000000001.log$           
-rw-rw-rw-  1 markbucciarelli  staff  5337088 Sep 28 08:24 dat/0000000001.log

Then, ran again with flush():

$ ls -l dat/0000000001.log
-rw-rw-rw-  1 markbucciarelli  staff  5340077 Sep 28 08:29 dat/0000000001.log

With ASCII pickle protocol, you can also diff the two log file versions and see that without the flush(), the log is incomplete.

diff -r c5bc24a68c12 topics/tornado/pv/core.py
--- a/topics/tornado/pv/core.py Wed Sep 28 08:16:01 2011 -0400
+++ b/topics/tornado/pv/core.py Wed Sep 28 08:34:34 2011 -0400
@@ -86,6 +86,7 @@
     def put(self, value):
         self.serialId += 1
         pickle.dump(value, self.logFile, PICKLE_PROTOCOL)
+   self.logFile.flush()

     def putSnapshot(self, root):
         # TODO refine error handling 

I observe a 5% decrease in performance with flush() for this simple test: from 7,000 inserts/sec to 6,700 inserts/sec :).

This is actually trade-off between performance and durability.
I'll apply this but be aware that this does not guarantee that flushed file will be physically on disk. AFAIK this only means that application's write cache is flushed.

http://stackoverflow.com/questions/3167494/how-often-does-python-flush-to-a-file