paulfitz/daff

Diffing large files

Opened this issue · 8 comments

Hello,

I'm trying to diff-daff two big csv files (each a million rows). Daff is doing a great job on regular sized files. But handling those big files takes a lot of time and (even worse) a lot of memory.

By looking at the Haxe generated ("obfuscated", since they're hard to read) Java sources, I haven't found a way of using streams with daff/coopy. It seems as if every table has to be read in a huge string ahead of comparison.

Is there a way of comparing large csv files in a memory efficient manner?

I've also tried the python utility, but it seems to handle table data the same way. After allocating more than 1.5 GiB of RAM, Linux started swapping on my small ARM machine.

Is daff able to render tdiff files or do I have to use coopy for that? DiffKit (from Paul's co-author of the tdiff format) seems to do better on large files, but doesn't seem to have such fancy output as daff has.

I'd love to use daff (especially the tiny Python lib) since it fits my needs almost perfectly. But when it comes to those big files, I haven't found a way. And I'm not really keen on using multiple tools (daff, coopy, DiffKit).

Maybe you can help me out.

Thanks in advance!

Alex

For time, I'd suggest the node/javascript version of daff (npm install daff). It is a lot faster than python, particularly python2.

For memory, there's no quick fix in the general case. Daff has to do a lot of comparisons that are awkward to structure to work on streams. If you happen to have data with a clear primary key, then things are simpler. For Sqlite databases, daff can be configured to be a lot more memory efficient. But I haven't set that up for csv files.

Well, see if the node/js version is usable in your situation and if it helps with speed, and then we'll see what we can do from there.

Thanks for your quick reply. Speed isn't the primary target. RAM is the bigger problem.

I've tried to use the npm package and I'm stunned, because JS is really a lot faster than Python 3. Here some metrics for two csv files (about 2 MB each):

Python:

daff diff --id 0 --input-format ssv --output-format csv --output 1-2.diff 1.csv 2.csv
real	4m 11.33s
user	3m 42.94s
sys	0m 0.37s

JS:

daff diff --id 0 --input-format ssv --output-format csv --output 1-2.diff 1.csv 2.csv
real	0m 4.61s
user	0m 5.71s
sys	0m 0.22s

By looking at the top stats, I could tell, that NodeJS is using both cores. Python just one.

Python:

top - 10:54:10 up 13:34, 13 users,  load average: 3,47, 2,63, 2,24
Tasks: 178 total,   2 running, 176 sleeping,   0 stopped,   0 zombie
%Cpu(s): 32,5 us,  1,8 sy,  0,0 ni, 65,6 id,  0,1 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem:   2063412 total,  1766136 used,   297276 free,    12980 buffers
KiB Swap:  4177916 total,    18092 used,  4159824 free.   375608 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 7069 alex      20   0  846892 844340   3348 R  96,5 40,9  82:22.95 daff
 7445 alex      20   0    5132   2708   2396 R  11,4  0,1   0:00.04 top
    1 root      20   0   23584   4224   3024 S   0,0  0,2   0:08.78 systemd
    2 root      20   0       0      0      0 S   0,0  0,0   0:00.02 kthreadd
[...]

JS:

top - 09:17:24 up 1 day, 11:57, 13 users,  load average: 2,06, 1,16, 1,08
Tasks: 180 total,   2 running, 178 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18,9 us,  1,8 sy,  0,0 ni, 79,1 id,  0,1 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem:   2063412 total,  1705356 used,   358056 free,     4488 buffers
KiB Swap:  4177916 total,    56376 used,  4121540 free.   354424 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11327 root      20   0  821780 713184  25064 R 164,8 34,6   1:09.12 node
11339 alex      20   0    5132   2560   2248 R  11,0  0,1   0:00.04 top
    1 root      20   0   23584   3676   2488 S   0,0  0,2   0:11.54 systemd
    2 root      20   0       0      0      0 S   0,0  0,0   0:00.06 kthreadd
[...]

(BTW: I had to run NodeJS as root. Is that necessery? I wasn't able to start it as normal user.)

To be honest, I don't think the JS interpreter is so much better than Python's. It's probably because the Haxe generated Python code is not good performancewise. So it's rather a Haxe issue.

As you can see in the example above, I've tried to use the id option. Does it do anything on csv files or is it just usable with SQLite? It seems to make no difference in performance.

But, as I've said, perfomance is less a problem than memory. Handling big data is still a problem. NodeJS is also grabbing all the memory it can get and it led to a core dump ("JavaScript heap out of memory").

<--- Last few GCs --->

[40:0x23bf000]   104533 ms: Mark-sweep 699.1 (716.1) -> 698.9 (716.6) MB, 8148.5 / 0.0 ms  (+ 0.2 ms in 9 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 8170 ms) (average mu = 0.070, current mu = 0.009) allocation failu

<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0x3e00615e]
    1: StubFrame [pc: 0x3e0092a1]
Security context: 0x45d12715 <JSObject>
    2: parseCellPart [0x4dd5b9f5] [/usr/local/lib/node_modules/daff/bin/daff.js:~2507] [pc=0x9c1065a0](this=0x3cc7fdd5 <JSObject>,txt=0xa0584101 <Very long string[106781486]>)
    3: parseTable [0x4dd5b955] [/usr/local/lib/node_modules/daff/bin/daff.js:~2459] [pc=0x9c106ef4](this=0x3cc7fdd5 <JSObject>,txt=0xa0584101 <Very long strin...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Command terminated by signal 6

Too bad I'm unable to write Haxe code. I'd love to do something about it. I've tried to deal with the generated Java sources and reading them is hard enough.

But from an more abstract point of view: If I'm telling the program, that the input is ordered/sorted and there is a primary key, why is so much calculating neccessary?

Alex

@osgger it is definitely true that if the input is sorted and there's a primary key, daff is doing a lot of unnecessary work.

Right now, the most practical way to get what you want would be to convert the csv files to sqlite, and then diff the sqlite files. On a test I did just now comparing a pair of million-row tables, daff took 2 seconds and peak memory usage was 29MB.

If you have sqlite3 installed, here is a basic script for converting a csv file to a sqlite file with the first column used as a primary key:

#!/bin/bash
if [[ -z "$2" ]]; then
  echo "Call as foo.csv foo.sqlite"
  exit 1
fi
header=$(head -n1 big1.csv)
header=$(echo $header | sed "s/,/ primary key,/")
cat $1 | tail -n +2 | sqlite3 $2 "create table sheet($header);" ".mode csv" ".import /dev/stdin sheet"

For me, this takes a few seconds.

In principle daff could do all this for you, but it doesn't today.

Thanks for your help.

If I wanted to compare two tables (PK is first column) within one sqlite file, what's the call?

daff diff --id 0 --table old --table new ab.db

Hmm that wasn't possible, but seemed a good thing to add. So as of 1.3.39, you can do:

daff --input-format sqlite --table old:new ab.db ab.db
  • You need to give the name of the database twice.
  • If the name of the sqlite db does not end in .sqlite, you need to specify --input-format sqlite.
  • You need to use an old:new format for the table pair. Using multiple --tables would just make sqlite compare multiple pairs of tables.
  • If you don't have a primary key on your tables, you can add --id id (repeat for multiple columns; use the names of columns rather than integers). Comparisons will be a lot faster if you have a primary key set up. If not, at least add an index I'd suggest.

Thanks again for adding this new feature.

I wasn't even able to compare two database files. But your answer gave me a hint. You called daff to compare sqlite databases with the parameter --input-format sqlite. I didn't know that and I used the file extension ".db". This info is missing in the help (1.3.38):

--input-format [csv|tsv|ssv|psv|json]: set format to expect for input

The only hint given, that it's even possible to compare database tables, is:

--table NAME: compare the named table, used with SQL sources

So you could advertise this (in my opinion) great feature a little more.

I'll do some testing and will keep you informed.

Good point @osgger, I've talked up sqlite support in #129.

Hi,

sorry for this late reply, but I've been struggling with sqlite.

The application I wrote is importing data from csv into an Oracle database, doing some conversions and calculations, and finally exporting it back to csv. (Oracle, because it was there and I had nothing to setup. Just connect and use it. I would prefer an Open Source RDBMS, of course. :-) )

The last step is creating a wonderful visual diff to compare the input and output csv files and also the current export files to the last export files.

Comparing the csv files worked well using daff, except the large files (> 500,000 rows) caused problems. But you gave me the hint, to use sqlite.

Then I thought: "Hey, if I have to use sqlite for comparison, why not use it for the calculations and cut out Oracle." After spending the last few days on doing so, I'm giving up.

Sqlite has a very "special" way of handling data types: They call it dynamic type system and type affinity. That means, a column can have multiple data types. And the only data type for storing exact decimal values is string, since real is inaccurate.

See this example:

sqlite> CREATE TABLE JDBC (R REAL, DM DECIMAL(3, 2), N NUMERIC(3, 2), F FLOAT, D DOUBLE, S VARCHAR(20));

sqlite> INSERT INTO JDBC VALUES(9.95, 9.95, 9.95, 9.95, 9.95,'9.95');

sqlite> SELECT * FROM JDBC;
9.95|9.95|9.95|9.95|9.95|9.95

Looks good so far. But then:

sqlite> SELECT ROUND(R, 1) FROM JDBC;
9.9

Dump shows the reason:

sqlite> .dump JDBC
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE JDBC (R REAL, DM DECIMAL(3, 2), N NUMERIC(3, 2), F FLOAT, D DOUBLE, S VARCHAR(20));
INSERT INTO JDBC VALUES(9.9499999999999992894,9.9499999999999992894,9.9499999999999992894,9.9499999999999992894,9.9499999999999992894,'9.95');
COMMIT;

sqlite> SELECT TYPEOF(R), TYPEOF(DM), TYPEOF(N), TYPEOF(F), TYPEOF(D), TYPEOF(S) FROM JDBC;
real|real|real|real|real|text

Checking the length of strings at the time of import, doesn't seem to be possible:

sqlite> CREATE TABLE ABC (V VARCHAR(20));

sqlite> INSERT INTO ABC VALUES ('01234567890123456789012345');

sqlite> SELECT TYPEOF(V) FROM ABC;
text

Therefore using sqlite for business critical data just isn't possible.

So I'll have to stick to the big RDBMS (or maybe HSQLDB, which seems very promising) for converting and calculating, and use sqlite for the diffs.

But that's my question to you: If I'm importing the csv columns as text to sqlite, then the diffs created by daff should be fine. Only if I was comparing a table right of a csv file to one from sqlite not using text as as data type, it could create a wrong diff. (Or is it not possible to mix the input types sqlite and csv?)

Thanks again!