All local file revisions hardlink to the latest revision
zbitouzakaria opened this issue · 3 comments
I tested this out by creating and tracking a single file through multiple revisions.
Let's say we have a big_file.csv
whose content look like this:
a, b, c
1, 2, 3
We first track it using this script:
from lazydata import track
# store the file when loading
import pandas as pd
df = pd.read_csv(track("big_file.csv"))
print("Data shape:" + str(df.shape))
Change the file content multiple times, for ex:
a, b, c
1, 2, 3
4, 5, 6
And keep executing the script between the multiple revisions:
(dev3.5) ~/test_lazydata > python my_script.py
LAZYDATA: Tracking new file `big_file.csv`
Data shape:(1, 3)
(dev3.5) ~/test_lazydata > vim big_file.csv # changing file
(dev3.5) ~/test_lazydata > python my_script.py
LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
Data shape:(2, 3)
(dev3.5) ~/test_lazydata > vim big_file.csv # changing file
(dev3.5) ~/test_lazydata > python my_script.py
LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
Data shape:(3, 3)
(dev3.5) ~/test_lazydata > vim big_file.csv # changing file
(dev3.5) ~/test_lazydata > python my_script.py
LAZYDATA: Tracked file `big_file.csv` changed, recording a new version...
Data shape:(4, 3)
A simple ls
afterwards points to the mistake:
(dev3.5) ~/test_lazydata > ls -lah
total 20
drwxrwxr-x 2 zakaria zakaria 4096 sept. 5 16:14 .
drwxr-xr-x 56 zakaria zakaria 4096 sept. 5 16:14 ..
-rw-rw-r-- 5 zakaria zakaria 44 sept. 5 16:14 big_file.csv
-rw-rw-r-- 1 zakaria zakaria 482 sept. 5 16:14 lazydata.yml
-rw-rw-r-- 1 zakaria zakaria 158 sept. 5 16:12 my_script.py
Notice the number of hardlinks to big_file.csv
. There should only be one. What is happening is that all the revisions point to the same file.
You can also check ~/.lazydata/data
directly for the content of the different files. It'a all the same.
Yes, you are right. Because they are hardlinked editing one file will also edit the cached file. One would need to overwrite one of the files to get a new inode. I guess this means the file do need to be copied unless the user wants to specifically use hardlinking.
Thanks for this bug report!
I've now switched to using copy instead of hardlink as a default. Will probably add hardlinking as an option, and probably still need to write a test case for this specific case.
The latest version lazydata 1.0.16
should have this bug solved.
... and further fixes in lazydata 1.0.17