
Corrupt WOF records are not skipped during import

there were quite a few corrupt WOF postal code records during download, e.g.

error downloading whosonfirst-data-postalcode-ca-latest.tar.bz2 bundle: Error: Command failed: curl | tar -xj --strip-components=1 --exclude=README.txt -C /srv/pelias_importer_ext/data/wof && mv /srv/pelias_importer_ext/data/wof/whosonfirst-data-postalcode-ca-latest.csv /srv/pelias_importer_ext/data/wof/meta

That was a few days ago. Weirdly,
curl | tar -xj --strip-components=1 --exclude=README.txt
works now, if done manually. Even though the timestamp of that record didn't change since July..

There's a bunch of bzip2 and tar errors in the logs for the above command. Same for Japan, Portugal and GB postal codes. When downloading is finished and he's trying to inject them to ES, the following fatal error occurs:

2018-09-27T19:29:31.322Z - ESC[32minfoESC[39m: [whosonfirst] Loading whosonfirst-data-postalcode-ca-latest.csv records from /srv/pelias_importer_ext/data/wof/meta
      throw er; // Unhandled 'error' event

Error: ENOENT: no such file or directory, open '/srv/pelias_importer_ext/data/wof/meta/whosonfirst-data-postalcode-ca-latest.csv'
npm ERR! errno 1
npm ERR! pelias-whosonfirst@0.0.0-development start: `node import.js`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the pelias-whosonfirst@0.0.0-development start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/ubuntu/.npm/_logs/2018-09-27T19_29_31_479Z-debug.log

Then the WOF importer ends on error instead of skipping the Canada postal codes.
Any chance of improving that behavior?

And if this would be a good first issue, let me know. I'm serious about contributing. At some point I wanna have another Pelias importer for custom user data anyways. Getting to know the stack with smaller issues would be helpful. But then a quick pointer in the right direction would be appreciated:)

Hi @nilsnolde. What do you think the issue is here?

Do you think maybe it was just that the server was having a bad day and now it works, or is there a difference between executing the commands manually on the CLI vs. using the Pelias scripts?

Well, it has to be a network issue. I'm not saying there's a flaw in the code, obviously I executed the exact same command manually without problems.
I does happen consistently though: canada won't get processed when using the importer, it's always the above error over the last 2-3 weeks. The real problem is likely curl and its GnuTLS. So, yes, I have to dig in there.

My question here is: the importer could also skip files, in case it doesn't find them right. Instead of failing entirely.

We had a bug a while back where one layer's data was being downloaded twice, and the two downloads interacted badly together, so something like that could happen still.

However, in general I think all our importers need to support a missingFilesAreFatal flag everywhere, like this importer already does in some places.

Sometimes, you want a build to fail if anything at all goes wrong. Sometimes you want to ignore failures to download/parse data. Most Pelias code was written to fail if anything goes wrong, since that's what we wanted for Mapzen Search, but I think we need to support both modes everwhere.

Ah, damn it, there's the switch right there!! And of course it's set true in our pelias.json..

Thanks @orangejulius again!

And for the record: I agree, having a switch is really good. At least for importers importing heaps of file like wof and oa.

Oh, I didn't know it would actually work :) There's a lot of ways to download files in this repo, and I bet in at least some of the code paths, failures are still not ignored. So if you do see that, let us know.

Hm ok, sorry, apparently setting missingFilesAreFatal=false didn't fix the issue:

Any idea?

I also hit a download error (in a file that downloads with wget in a few seconds) and so set missingFileAreFatal to false, after which the downloads seem to be fine - either due to the switch or a serendipitously faster connection but this didnt help matters . I did see that the tar took a long time (more than 5 min) once I had the file with wget . SO possibly the line in download_data_all.js

`curl -s ${wofDataHost}/bundles/${bundle} | tar -xj --strip-components=1 --exclude=README.txt -C ` +
`${directory} && mv ${path.join(directory, csvFilename)} ${path.join(directory, 'meta')}`;

should be broken into several lines. I don;t know if wget is more reliable than curl

ok by changing to wget and local files this runs to completion , I will do a pr

  function generateCommand(bundle, directory) {
    const csvFilename = bundle.replace(/-\d{8}T\d{6}-/, '-latest-') // support timestamped downloads
                              .replace('.tar.bz2', '.csv');
//    return `curl -s ${wofDataHost}/bundles/${bundle} | tar -xj --strip-components=1 --exclude=README.txt -C ` +
//      `${directory} && mv ${path.join(directory, csvFilename)} ${path.join(directory, 'meta')}`;
     var command=`wget ${wofDataHost}/bundles/${bundle} && tar -xvj --strip-components=1 --exclude=README.txt -C ${directory} -$
     return command;