Clarification of embedded newlines in TSV files
copiousfreetime opened this issue · 1 comments
The keywords data file appears to have an embedded newline in one of the records. I just want to clarify if this is expected or not. It looks like the given psql loading instructions do account for newlines in the TSV file, but if folks are processing the file outside of that without using quote-escaping rules they may process the data incorrectly.
To Reproduce
% wc -l *.tsv000
1646598 collections.tsv000
4075505 conversions.tsv000
2689741 keywords.tsv000
25001 photos.tsv000
8436845 total
Load the according to the documented instructions:
% psql -h localhost -U jeremy -d unsplash_lite -f load-data-client.sql
COPY 25000
COPY 2689739 # <-- Hmm.. this one is NOT 1 less than keywords.tsv000 above
COPY 1646597
COPY 4075504
Check the db row count
unsplash_lite=# select count(*) from unsplash_keywords;
count
---------
2689739
(1 row)
Expected behavior
I initially expected there to be 1 record for each non-header line of TSV, this appears to be an incorrect assumption. It looks like the psql commandline parsed the TSV according to quoted escape rules, so that is good.
I wrote a program to check the keywords file and it reports
% ruby check-tsv.rb keywords.tsv000
Headers: photo_id -- keyword -- ai_service_1_confidence -- ai_service_2_confidence -- suggested_by_user
[1590611 - PF4s20KB678-"fujisan] parts count 2 != 5
[1590612 - mount fuji"-] parts count 4 != 5
lines in file : 2689741
data lines : 2689740
unique row count: 2689740
Then looking at the lines around line 1590610 we see:
% sed -n '1590610,1590615p' keywords.tsv000
PF4s20KB678 night 22.3271160125732 f
PF4s20KB678 "fujisan
mount fuji" t
PF4s20KB678 pier 22.6900939941406 f
PF4s20KB678 viaduct 30.6490669250488 f
PF4s20KB678 architecture 33.084938049316399 f
And the db reports that row and the preceding and following rows correctly loaded.
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%fujisan%';
photo_id | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+------------+-------------------------+-------------------------+-------------------
PF4s20KB678 | fujisan +| | | t
| mount fuji | | |
(1 row)
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%pier%';
photo_id | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
PF4s20KB678 | pier | 22.6900939941406 | | f
(1 row)
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%night%';
photo_id | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
PF4s20KB678 | night | 22.3271160125732 | | f
If folks are processing these TSV simplistically without using quote-escaping logic then they may process the files incorrectly. I don't want folks to encounter that. And maybe this points to and upstream data input issue, if users are entering newlines in the keyword input - how are they getting processed in the main app.
We may just want to document that there can be embedded newlines in the TSV files.
Thanks!
Thanks a lot for the detailed troubleshooting!
I think we definitely need some cleanup here. I don't think a newline character is to be expected in the keywords although it's understandable how and why it can be there (this is a keyword suggested by a user, so user input).
I'll make sure we do some cleanup and maybe just transform newline characters into spaces for the next version. 👍