URL parameters can break temp files

Describe the bug

2023-04-13 15:51:05,819 INFO Fetching from: https://statistik.leipzig.de/opendata/api/values?kategorie_nr=5&rubrik_nr=2&periode=y&format=csv...
[pid: 6|app: 0|req: 2/2] 127.0.0.1 () {32 vars in 501 bytes} [Thu Apr 13 15:51:05 2023] GET /job/54ebecf7-2a52-4347-a241-e6f651b7f9e0 => generated 930 bytes in 2 msecs (HTTP/1.1 200) 2 headers in 72 bytes (1 switches on core 1)
2023-04-13 15:51:06,124 ERROR Job "push_to_datastore (trigger: date[2023-04-13 15:51:05 UTC], next run at: 2023-04-13 15:51:05 UTC)" raised an exception
Traceback (most recent call last):
  File "/usr/lib/datapusher-plus/lib/python3.10/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/usr/lib/datapusher-plus/src/datapusher-plus/datapusher/jobs.py", line 471, in push_to_datastore
    tmp = tempfile.NamedTemporaryFile(suffix="." + resource.get("format").lower())
  File "/usr/lib/python3.10/tempfile.py", line 698, in NamedTemporaryFile
    file = _io.open(dir, mode, buffering=buffering,
  File "/usr/lib/python3.10/tempfile.py", line 695, in opener
    fd, name = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/usr/lib/python3.10/tempfile.py", line 395, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmppdo2j9wn.text/csv'

To Reproduce
Steps to reproduce the behavior:

Create a resource with this link: https://statistik.leipzig.de/opendata/api/values?kategorie_nr=5&rubrik_nr=2&periode=y&format=csv
Upload resource with datapusher-plus

Thanks for the report @BonaBeavis .

Can confirm this is a bug.

Where are the path for creating and reading the temp files created? I can look into this.

Looking at the error, it seems that we need to add some logic to sanitize the file extension.

The resource format attribute can sometimes be the full MIME type which contains a slash, and not just the extension.

@BonaBeavis , can you confirm and fix this?

Sure can you point me to the line, where the path is built?

Great!

The whole download section is here - check line 471 where we just grab resource.format.

Perhaps we should check that the resulting temp file is valid, and handle the case when the full MIME type is stored instead of just the file extension (e.g. application/csv and not just csv)

datapusher-plus/datapusher/jobs.py

Lines 429 to 555 in 3a3c696

    
           # ========================================================================== 
        
           # DOWNLOAD 
        
           # ========================================================================== 
        
           timer_start = time.perf_counter() 
        
           # fetch the resource data 
        
           logger.info("Fetching from: {0}...".format(url)) 
        
           headers = {} 
        
           preview_rows = int(config.get("PREVIEW_ROWS")) 
        
           download_preview_only = config.get("DOWNLOAD_PREVIEW_ONLY") 
        
           if resource.get("url_type") == "upload": 
        
               # If this is an uploaded file to CKAN, authenticate the request, 
        
               # otherwise we won't get file from private resources 
        
               headers["Authorization"] = api_key 
        
           try: 
        
               kwargs = { 
        
                   "headers": headers, 
        
                   "timeout": config.get("DOWNLOAD_TIMEOUT"), 
        
                   "verify": config.get("SSL_VERIFY"), 
        
                   "stream": True, 
        
               } 
        
               if USE_PROXY: 
        
                   kwargs["proxies"] = {"http": DOWNLOAD_PROXY, "https": DOWNLOAD_PROXY} 
        
               response = requests.get(url, **kwargs) 
        
               response.raise_for_status() 
        
               cl = response.headers.get("content-length") 
        
               max_content_length = int(config.get("MAX_CONTENT_LENGTH")) 
        
               try: 
        
                   if ( 
        
                       cl 
        
                       and int(cl) > max_content_length 
        
                       and (preview_rows > 0 and not download_preview_only) 
        
                   ): 
        
                       raise util.JobError( 
        
                           "Resource too large to download: {cl:.2MB} > max ({max_cl:.2MB}).".format( 
        
                               cl=DataSize(int(cl)), max_cl=DataSize(int(max_content_length)) 
        
                           ) 
        
                       ) 
        
               except ValueError: 
        
                   pass 
        
               tmp = tempfile.NamedTemporaryFile(suffix="." + resource.get("format").lower()) 
        
               length = 0 
        
               m = hashlib.md5() 
        
               if download_preview_only and preview_rows > 0: 
        
                   # if download_preview_only and preview_rows is greater than zero 
        
                   # we're downloading the preview rows only, not the whole file 
        
                   if cl: 
        
                       logger.info( 
        
                           "Downloading only first {:,} row preview from {:.2MB} file...".format( 
        
                               preview_rows, DataSize(int(cl)) 
        
                           ) 
        
                       ) 
        
                   else: 
        
                       logger.info("Downloading only first {:,} row preview of file of unknown size...".format(preview_rows)) 
        
                   curr_line = 0 
        
                   for line in response.iter_lines(int(config.get("CHUNK_SIZE"))): 
        
                       curr_line += 1 
        
                       # add back the linefeed as iter_lines removes it 
        
                       line = line + "\n".encode("ascii") 
        
                       length += len(line) 
        
                       tmp.write(line) 
        
                       m.update(line) 
        
                       if curr_line > preview_rows: 
        
                           break 
        
               else: 
        
                   # download the entire file otherwise 
        
                   # TODO: handle when preview_rows is negative (get the preview from the 
        
                   # end of the file) so we can use http range request to get the preview from 
        
                   # the end without downloading the entire file 
        
                   if cl: 
        
                       logger.info("Downloading {:.2MB} file...".format(DataSize(int(cl)))) 
        
                   else: 
        
                       logger.info("Downloading file of unknown size...") 
        
                   for chunk in response.iter_content(int(config.get("CHUNK_SIZE"))): 
        
                       length += len(chunk) 
        
                       if length > max_content_length and not preview_rows: 
        
                           raise util.JobError( 
        
                               "Resource too large to process: {cl} > max ({max_cl}).".format( 
        
                                   cl=length, max_cl=max_content_length 
        
                               ) 
        
                           ) 
        
                       tmp.write(chunk) 
        
                       m.update(chunk) 
        
               ct = response.headers.get("content-type", "").split(";", 1)[0] 
        
           except requests.HTTPError as e: 
        
               raise HTTPError( 
        
                   "DataPusher+ received a bad HTTP response when trying to download " 
        
                   "the data file", 
        
                   status_code=e.response.status_code, 
        
                   request_url=url, 
        
                   response=e.response.content, 
        
               ) 
        
           except requests.RequestException as e: 
        
               raise HTTPError( 
        
                   message=str(e), status_code=None, request_url=url, response=None 
        
               ) 
        
           file_hash = m.hexdigest() 
        
           tmp.seek(0) 
        
           if ( 
        
               resource.get("hash") == file_hash 
        
               and not data.get("ignore_hash") 
        
               and not config.get("IGNORE_FILE_HASH") 
        
           ): 
        
               logger.warning( 
        
                   "Upload skipped as the file hash hasn't changed: {hash}.".format( 
        
                       hash=file_hash 
        
                   ) 
        
               ) 
        
               return 
        
           resource["hash"] = file_hash 
        
           fetch_elapsed = time.perf_counter() - timer_start 
        
           logger.info( 
        
               "Fetched {:.2MB} file in {:,.2f} seconds.".format( 
        
                   DataSize(length), fetch_elapsed 
        
               ) 
        
           )

Thanks!!

	# ==========================================================================
	# DOWNLOAD
	# ==========================================================================
	timer_start = time.perf_counter()

	# fetch the resource data
	logger.info("Fetching from: {0}...".format(url))
	headers = {}
	preview_rows = int(config.get("PREVIEW_ROWS"))
	download_preview_only = config.get("DOWNLOAD_PREVIEW_ONLY")
	if resource.get("url_type") == "upload":
	# If this is an uploaded file to CKAN, authenticate the request,
	# otherwise we won't get file from private resources
	headers["Authorization"] = api_key
	try:
	kwargs = {
	"headers": headers,
	"timeout": config.get("DOWNLOAD_TIMEOUT"),
	"verify": config.get("SSL_VERIFY"),
	"stream": True,
	}
	if USE_PROXY:
	kwargs["proxies"] = {"http": DOWNLOAD_PROXY, "https": DOWNLOAD_PROXY}
	response = requests.get(url, **kwargs)
	response.raise_for_status()

	cl = response.headers.get("content-length")
	max_content_length = int(config.get("MAX_CONTENT_LENGTH"))
	try:
	if (
	cl
	and int(cl) > max_content_length
	and (preview_rows > 0 and not download_preview_only)
	):
	raise util.JobError(
	"Resource too large to download: {cl:.2MB} > max ({max_cl:.2MB}).".format(
	cl=DataSize(int(cl)), max_cl=DataSize(int(max_content_length))
	)
	)
	except ValueError:
	pass

	tmp = tempfile.NamedTemporaryFile(suffix="." + resource.get("format").lower())
	length = 0
	m = hashlib.md5()

	if download_preview_only and preview_rows > 0:
	# if download_preview_only and preview_rows is greater than zero
	# we're downloading the preview rows only, not the whole file
	if cl:
	logger.info(
	"Downloading only first {:,} row preview from {:.2MB} file...".format(
	preview_rows, DataSize(int(cl))
	)
	)
	else:
	logger.info("Downloading only first {:,} row preview of file of unknown size...".format(preview_rows))

	curr_line = 0
	for line in response.iter_lines(int(config.get("CHUNK_SIZE"))):
	curr_line += 1
	# add back the linefeed as iter_lines removes it
	line = line + "\n".encode("ascii")
	length += len(line)

	tmp.write(line)
	m.update(line)
	if curr_line > preview_rows:
	break
	else:
	# download the entire file otherwise
	# TODO: handle when preview_rows is negative (get the preview from the
	# end of the file) so we can use http range request to get the preview from
	# the end without downloading the entire file
	if cl:
	logger.info("Downloading {:.2MB} file...".format(DataSize(int(cl))))
	else:
	logger.info("Downloading file of unknown size...")

	for chunk in response.iter_content(int(config.get("CHUNK_SIZE"))):
	length += len(chunk)
	if length > max_content_length and not preview_rows:
	raise util.JobError(
	"Resource too large to process: {cl} > max ({max_cl}).".format(
	cl=length, max_cl=max_content_length
	)
	)
	tmp.write(chunk)
	m.update(chunk)

	ct = response.headers.get("content-type", "").split(";", 1)[0]
	except requests.HTTPError as e:
	raise HTTPError(
	"DataPusher+ received a bad HTTP response when trying to download "
	"the data file",
	status_code=e.response.status_code,
	request_url=url,
	response=e.response.content,
	)
	except requests.RequestException as e:
	raise HTTPError(
	message=str(e), status_code=None, request_url=url, response=None
	)

	file_hash = m.hexdigest()
	tmp.seek(0)

	if (
	resource.get("hash") == file_hash
	and not data.get("ignore_hash")
	and not config.get("IGNORE_FILE_HASH")
	):
	logger.warning(
	"Upload skipped as the file hash hasn't changed: {hash}.".format(
	hash=file_hash
	)
	)
	return

	resource["hash"] = file_hash

	fetch_elapsed = time.perf_counter() - timer_start
	logger.info(
	"Fetched {:.2MB} file in {:,.2f} seconds.".format(
	DataSize(length), fetch_elapsed
	)
	)