pelias/whosonfirst

Download sqlite database without storing temporary archive

Closed this issue · 2 comments

The sqlite download currently downloads the bz2 archive to a temporary file, and then extracts the database from that local file. This is not ideal for two reasons:

  • it increases the disk needed, as there has to at least momentarily be enough disk space to hold the compressed archive and the uncompressed database
  • it increases the time needed to download. Ideally the file would be uncompressed as it is downloaded.

It appears this was done since the timestamp of the archived file is generated after it's downloaded, and used for future comparison to avoid re-downloading identical files in the future.

We could probably streamline this by using curl to get the remote last modified time via HEAD request, and then downloading the archive, without a temporary file, immediately after.

We have had issues piping curl in the bunzip in the past, the temporary file isn't ideal but its proven itself to be stable.

Joxit commented

Fixed since #417 (comment).

return `curl -s ${wofDataHost}/sqlite/${sqlite.name_compressed} | ${extract} > ${path.join(directory, 'sqlite', sqlite.name)}`;