dathere/datapusher-plus

URL parameters can break temp files

GordianDziwis opened this issue · 6 comments

Describe the bug

2023-04-13 15:51:05,819 INFO Fetching from: https://statistik.leipzig.de/opendata/api/values?kategorie_nr=5&rubrik_nr=2&periode=y&format=csv...
[pid: 6|app: 0|req: 2/2] 127.0.0.1 () {32 vars in 501 bytes} [Thu Apr 13 15:51:05 2023] GET /job/54ebecf7-2a52-4347-a241-e6f651b7f9e0 => generated 930 bytes in 2 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 1)
2023-04-13 15:51:06,124 ERROR Job "push_to_datastore (trigger: date[2023-04-13 15:51:05 UTC], next run at: 2023-04-13 15:51:05 UTC)" raised an exception
Traceback (most recent call last):
  File "/usr/lib/datapusher-plus/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/usr/lib/datapusher-plus/src/datapusher-plus/datapusher/jobs.py", line 471, in push_to_datastore
    tmp = tempfile.NamedTemporaryFile(suffix="." + resource.get("format").lower())
  File "/usr/lib/python3.10/tempfile.py", line 698, in NamedTemporaryFile
    file = _io.open(dir, mode, buffering=buffering,
  File "/usr/lib/python3.10/tempfile.py", line 695, in opener
    fd, name = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.10/tempfile.py", line 395, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppdo2j9wn.text/csv'

To Reproduce
Steps to reproduce the behavior:

  1. Create a resource with this link: https://statistik.leipzig.de/opendata/api/values?kategorie_nr=5&rubrik_nr=2&periode=y&format=csv
  2. Upload resource with datapusher-plus

Thanks for the report @BonaBeavis .

Can confirm this is a bug.

Where are the path for creating and reading the temp files created? I can look into this.

Looking at the error, it seems that we need to add some logic to sanitize the file extension.

The resource format attribute can sometimes be the full MIME type which contains a slash, and not just the extension.

@BonaBeavis , can you confirm and fix this?

Sure can you point me to the line, where the path is built?

Great!

The whole download section is here - check line 471 where we just grab resource.format.

Perhaps we should check that the resulting temp file is valid, and handle the case when the full MIME type is stored instead of just the file extension (e.g. application/csv and not just csv)

# ==========================================================================
# DOWNLOAD
# ==========================================================================
timer_start = time.perf_counter()
# fetch the resource data
logger.info("Fetching from: {0}...".format(url))
headers = {}
preview_rows = int(config.get("PREVIEW_ROWS"))
download_preview_only = config.get("DOWNLOAD_PREVIEW_ONLY")
if resource.get("url_type") == "upload":
# If this is an uploaded file to CKAN, authenticate the request,
# otherwise we won't get file from private resources
headers["Authorization"] = api_key
try:
kwargs = {
"headers": headers,
"timeout": config.get("DOWNLOAD_TIMEOUT"),
"verify": config.get("SSL_VERIFY"),
"stream": True,
}
if USE_PROXY:
kwargs["proxies"] = {"http": DOWNLOAD_PROXY, "https": DOWNLOAD_PROXY}
response = requests.get(url, **kwargs)
response.raise_for_status()
cl = response.headers.get("content-length")
max_content_length = int(config.get("MAX_CONTENT_LENGTH"))
try:
if (
cl
and int(cl) > max_content_length
and (preview_rows > 0 and not download_preview_only)
):
raise util.JobError(
"Resource too large to download: {cl:.2MB} > max ({max_cl:.2MB}).".format(
cl=DataSize(int(cl)), max_cl=DataSize(int(max_content_length))
)
)
except ValueError:
pass
tmp = tempfile.NamedTemporaryFile(suffix="." + resource.get("format").lower())
length = 0
m = hashlib.md5()
if download_preview_only and preview_rows > 0:
# if download_preview_only and preview_rows is greater than zero
# we're downloading the preview rows only, not the whole file
if cl:
logger.info(
"Downloading only first {:,} row preview from {:.2MB} file...".format(
preview_rows, DataSize(int(cl))
)
)
else:
logger.info("Downloading only first {:,} row preview of file of unknown size...".format(preview_rows))
curr_line = 0
for line in response.iter_lines(int(config.get("CHUNK_SIZE"))):
curr_line += 1
# add back the linefeed as iter_lines removes it
line = line + "\n".encode("ascii")
length += len(line)
tmp.write(line)
m.update(line)
if curr_line > preview_rows:
break
else:
# download the entire file otherwise
# TODO: handle when preview_rows is negative (get the preview from the
# end of the file) so we can use http range request to get the preview from
# the end without downloading the entire file
if cl:
logger.info("Downloading {:.2MB} file...".format(DataSize(int(cl))))
else:
logger.info("Downloading file of unknown size...")
for chunk in response.iter_content(int(config.get("CHUNK_SIZE"))):
length += len(chunk)
if length > max_content_length and not preview_rows:
raise util.JobError(
"Resource too large to process: {cl} > max ({max_cl}).".format(
cl=length, max_cl=max_content_length
)
)
tmp.write(chunk)
m.update(chunk)
ct = response.headers.get("content-type", "").split(";", 1)[0]
except requests.HTTPError as e:
raise HTTPError(
"DataPusher+ received a bad HTTP response when trying to download "
"the data file",
status_code=e.response.status_code,
request_url=url,
response=e.response.content,
)
except requests.RequestException as e:
raise HTTPError(
message=str(e), status_code=None, request_url=url, response=None
)
file_hash = m.hexdigest()
tmp.seek(0)
if (
resource.get("hash") == file_hash
and not data.get("ignore_hash")
and not config.get("IGNORE_FILE_HASH")
):
logger.warning(
"Upload skipped as the file hash hasn't changed: {hash}.".format(
hash=file_hash
)
)
return
resource["hash"] = file_hash
fetch_elapsed = time.perf_counter() - timer_start
logger.info(
"Fetched {:.2MB} file in {:,.2f} seconds.".format(
DataSize(length), fetch_elapsed
)
)

Thanks!!