Diffing large files
Opened this issue · 8 comments
Hello,
I'm trying to diff-daff two big csv files (each a million rows). Daff is doing a great job on regular sized files. But handling those big files takes a lot of time and (even worse) a lot of memory.
By looking at the Haxe generated ("obfuscated", since they're hard to read) Java sources, I haven't found a way of using streams with daff/coopy. It seems as if every table has to be read in a huge string ahead of comparison.
Is there a way of comparing large csv files in a memory efficient manner?
I've also tried the python utility, but it seems to handle table data the same way. After allocating more than 1.5 GiB of RAM, Linux started swapping on my small ARM machine.
Is daff able to render tdiff files or do I have to use coopy for that? DiffKit (from Paul's co-author of the tdiff format) seems to do better on large files, but doesn't seem to have such fancy output as daff has.
I'd love to use daff (especially the tiny Python lib) since it fits my needs almost perfectly. But when it comes to those big files, I haven't found a way. And I'm not really keen on using multiple tools (daff, coopy, DiffKit).
Maybe you can help me out.
Thanks in advance!
Alex
For time, I'd suggest the node/javascript version of daff (npm install daff
). It is a lot faster than python, particularly python2.
For memory, there's no quick fix in the general case. Daff has to do a lot of comparisons that are awkward to structure to work on streams. If you happen to have data with a clear primary key, then things are simpler. For Sqlite databases, daff can be configured to be a lot more memory efficient. But I haven't set that up for csv files.
Well, see if the node/js version is usable in your situation and if it helps with speed, and then we'll see what we can do from there.
Thanks for your quick reply. Speed isn't the primary target. RAM is the bigger problem.
I've tried to use the npm package and I'm stunned, because JS is really a lot faster than Python 3. Here some metrics for two csv files (about 2 MB each):
Python:
daff diff --id 0 --input-format ssv --output-format csv --output 1-2.diff 1.csv 2.csv
real 4m 11.33s
user 3m 42.94s
sys 0m 0.37s
JS:
daff diff --id 0 --input-format ssv --output-format csv --output 1-2.diff 1.csv 2.csv
real 0m 4.61s
user 0m 5.71s
sys 0m 0.22s
By looking at the top stats, I could tell, that NodeJS is using both cores. Python just one.
Python:
top - 10:54:10 up 13:34, 13 users, load average: 3,47, 2,63, 2,24
Tasks: 178 total, 2 running, 176 sleeping, 0 stopped, 0 zombie
%Cpu(s): 32,5 us, 1,8 sy, 0,0 ni, 65,6 id, 0,1 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem: 2063412 total, 1766136 used, 297276 free, 12980 buffers
KiB Swap: 4177916 total, 18092 used, 4159824 free. 375608 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7069 alex 20 0 846892 844340 3348 R 96,5 40,9 82:22.95 daff
7445 alex 20 0 5132 2708 2396 R 11,4 0,1 0:00.04 top
1 root 20 0 23584 4224 3024 S 0,0 0,2 0:08.78 systemd
2 root 20 0 0 0 0 S 0,0 0,0 0:00.02 kthreadd
[...]
JS:
top - 09:17:24 up 1 day, 11:57, 13 users, load average: 2,06, 1,16, 1,08
Tasks: 180 total, 2 running, 178 sleeping, 0 stopped, 0 zombie
%Cpu(s): 18,9 us, 1,8 sy, 0,0 ni, 79,1 id, 0,1 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem: 2063412 total, 1705356 used, 358056 free, 4488 buffers
KiB Swap: 4177916 total, 56376 used, 4121540 free. 354424 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11327 root 20 0 821780 713184 25064 R 164,8 34,6 1:09.12 node
11339 alex 20 0 5132 2560 2248 R 11,0 0,1 0:00.04 top
1 root 20 0 23584 3676 2488 S 0,0 0,2 0:11.54 systemd
2 root 20 0 0 0 0 S 0,0 0,0 0:00.06 kthreadd
[...]
(BTW: I had to run NodeJS as root. Is that necessery? I wasn't able to start it as normal user.)
To be honest, I don't think the JS interpreter is so much better than Python's. It's probably because the Haxe generated Python code is not good performancewise. So it's rather a Haxe issue.
As you can see in the example above, I've tried to use the id option. Does it do anything on csv files or is it just usable with SQLite? It seems to make no difference in performance.
But, as I've said, perfomance is less a problem than memory. Handling big data is still a problem. NodeJS is also grabbing all the memory it can get and it led to a core dump ("JavaScript heap out of memory").
<--- Last few GCs --->
[40:0x23bf000] 104533 ms: Mark-sweep 699.1 (716.1) -> 698.9 (716.6) MB, 8148.5 / 0.0 ms (+ 0.2 ms in 9 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 8170 ms) (average mu = 0.070, current mu = 0.009) allocation failu
<--- JS stacktrace --->
==== JS stack trace =========================================
0: ExitFrame [pc: 0x3e00615e]
1: StubFrame [pc: 0x3e0092a1]
Security context: 0x45d12715 <JSObject>
2: parseCellPart [0x4dd5b9f5] [/usr/local/lib/node_modules/daff/bin/daff.js:~2507] [pc=0x9c1065a0](this=0x3cc7fdd5 <JSObject>,txt=0xa0584101 <Very long string[106781486]>)
3: parseTable [0x4dd5b955] [/usr/local/lib/node_modules/daff/bin/daff.js:~2459] [pc=0x9c106ef4](this=0x3cc7fdd5 <JSObject>,txt=0xa0584101 <Very long strin...
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Command terminated by signal 6
Too bad I'm unable to write Haxe code. I'd love to do something about it. I've tried to deal with the generated Java sources and reading them is hard enough.
But from an more abstract point of view: If I'm telling the program, that the input is ordered/sorted and there is a primary key, why is so much calculating neccessary?
Alex
@osgger it is definitely true that if the input is sorted and there's a primary key, daff is doing a lot of unnecessary work.
Right now, the most practical way to get what you want would be to convert the csv files to sqlite, and then diff the sqlite files. On a test I did just now comparing a pair of million-row tables, daff
took 2 seconds and peak memory usage was 29MB.
If you have sqlite3
installed, here is a basic script for converting a csv file to a sqlite file with the first column used as a primary key:
#!/bin/bash
if [[ -z "$2" ]]; then
echo "Call as foo.csv foo.sqlite"
exit 1
fi
header=$(head -n1 big1.csv)
header=$(echo $header | sed "s/,/ primary key,/")
cat $1 | tail -n +2 | sqlite3 $2 "create table sheet($header);" ".mode csv" ".import /dev/stdin sheet"
For me, this takes a few seconds.
In principle daff
could do all this for you, but it doesn't today.
Thanks for your help.
If I wanted to compare two tables (PK is first column) within one sqlite file, what's the call?
daff diff --id 0 --table old --table new ab.db
Hmm that wasn't possible, but seemed a good thing to add. So as of 1.3.39, you can do:
daff --input-format sqlite --table old:new ab.db ab.db
- You need to give the name of the database twice.
- If the name of the sqlite db does not end in .sqlite, you need to specify
--input-format sqlite
. - You need to use an
old:new
format for the table pair. Using multiple--table
s would just make sqlite compare multiple pairs of tables. - If you don't have a primary key on your tables, you can add
--id id
(repeat for multiple columns; use the names of columns rather than integers). Comparisons will be a lot faster if you have a primary key set up. If not, at least add an index I'd suggest.
Thanks again for adding this new feature.
I wasn't even able to compare two database files. But your answer gave me a hint. You called daff to compare sqlite databases with the parameter --input-format sqlite
. I didn't know that and I used the file extension ".db". This info is missing in the help (1.3.38):
--input-format [csv|tsv|ssv|psv|json]: set format to expect for input
The only hint given, that it's even possible to compare database tables, is:
--table NAME: compare the named table, used with SQL sources
So you could advertise this (in my opinion) great feature a little more.
I'll do some testing and will keep you informed.
Hi,
sorry for this late reply, but I've been struggling with sqlite.
The application I wrote is importing data from csv into an Oracle database, doing some conversions and calculations, and finally exporting it back to csv. (Oracle, because it was there and I had nothing to setup. Just connect and use it. I would prefer an Open Source RDBMS, of course. :-) )
The last step is creating a wonderful visual diff to compare the input and output csv files and also the current export files to the last export files.
Comparing the csv files worked well using daff, except the large files (> 500,000 rows) caused problems. But you gave me the hint, to use sqlite.
Then I thought: "Hey, if I have to use sqlite for comparison, why not use it for the calculations and cut out Oracle." After spending the last few days on doing so, I'm giving up.
Sqlite has a very "special" way of handling data types: They call it dynamic type system and type affinity. That means, a column can have multiple data types. And the only data type for storing exact decimal values is string, since real is inaccurate.
See this example:
sqlite> CREATE TABLE JDBC (R REAL, DM DECIMAL(3, 2), N NUMERIC(3, 2), F FLOAT, D DOUBLE, S VARCHAR(20));
sqlite> INSERT INTO JDBC VALUES(9.95, 9.95, 9.95, 9.95, 9.95,'9.95');
sqlite> SELECT * FROM JDBC;
9.95|9.95|9.95|9.95|9.95|9.95
Looks good so far. But then:
sqlite> SELECT ROUND(R, 1) FROM JDBC;
9.9
Dump shows the reason:
sqlite> .dump JDBC
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE JDBC (R REAL, DM DECIMAL(3, 2), N NUMERIC(3, 2), F FLOAT, D DOUBLE, S VARCHAR(20));
INSERT INTO JDBC VALUES(9.9499999999999992894,9.9499999999999992894,9.9499999999999992894,9.9499999999999992894,9.9499999999999992894,'9.95');
COMMIT;
sqlite> SELECT TYPEOF(R), TYPEOF(DM), TYPEOF(N), TYPEOF(F), TYPEOF(D), TYPEOF(S) FROM JDBC;
real|real|real|real|real|text
Checking the length of strings at the time of import, doesn't seem to be possible:
sqlite> CREATE TABLE ABC (V VARCHAR(20));
sqlite> INSERT INTO ABC VALUES ('01234567890123456789012345');
sqlite> SELECT TYPEOF(V) FROM ABC;
text
Therefore using sqlite for business critical data just isn't possible.
So I'll have to stick to the big RDBMS (or maybe HSQLDB, which seems very promising) for converting and calculating, and use sqlite for the diffs.
But that's my question to you: If I'm importing the csv columns as text to sqlite, then the diffs created by daff should be fine. Only if I was comparing a table right of a csv file to one from sqlite not using text as as data type, it could create a wrong diff. (Or is it not possible to mix the input types sqlite and csv?)
Thanks again!