Error on " in filename
gabmert opened this issue · 9 comments
I was running pferd as usual, but today I got this error:
Error An unexpected exception occurred
Traceback (most recent call last):
File "/usr/lib/python3.11/site-packages/PFERD/pferd.py", line 156, in run
await crawler.run()
File "/usr/lib/python3.11/site-packages/PFERD/crawl/http_crawler.py", line 193, in run
await super().run()
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
return await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 338, in run
await self._run()
File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 209, in _run
await self._crawl_course(self._target)
File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 223, in _crawl_course
await self._crawl_url(root_url, expected_id=course_id)
File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 285, in _crawl_url
await self.gather(tasks)
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 274, in gather
return await result
^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
return await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 345, in _crawl_ilias_page
await self.gather(tasks)
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 274, in gather
return await result
^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
return await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 111, in wrapper
return await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 664, in _download_file
async with dl as (bar, sink):
File "/usr/lib/python3.11/site-packages/PFERD/utils.py", line 126, in __aenter__
result: T = await self._on_aenter()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 133, in _on_aenter
sink = await self._stack.enter_async_context(self._fs_token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/contextlib.py", line 638, in enter_async_context
result = await _enter(cm)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/utils.py", line 126, in __aenter__
result: T = await self._on_aenter()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/output_dir.py", line 116, in _on_aenter
tmp_path, file = await self._output_dir._create_tmp_file(self._local_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/PFERD/output_dir.py", line 359, in _create_tmp_file
return tmp_path, open(tmp_path, "xb")
^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument: '<secret-path>/MT/Offers to
Students/.Study "Mattering and motivation academic studies".pdf.tmp.jnnl5o'
my first guess is that either pferd or my filesystem can't handle the "
in the filename.
This is the file in ilias:
Are you syncing into a OneDrive folder by chance? (See https://support.microsoft.com/en-us/office/restrictions-and-limitations-in-onedrive-and-sharepoint-64883a5d-228e-48f5-b3d2-eb39e07630fa#invalidcharacters)
Thank you for the hint. No it's not onedrive but it's a ntfs partition via ntfs3. So I'm not able to use this filename. Now I'm wondering what would happen if I download this file via browser on windows. Will try when I have time.
For now I'm trying this transform in my config, but I can't get the escaping to work, need to do some more research
transform =
# (.*) -re->> "{g1.replace('"', '_')}"
# (.*) -re->> "{g1.replace('\"', '_')}"
(.*) -re->> '{g1.replace("\"", "_")}'
For the future: Is it a possible feature to have pferd automatically rename filenames which aren't allowed on the location?
You can set windows_paths = yes
in your default section (or for individual crawlers):
[DEFAULT]
windows_paths = yes
Then PFERD will apply the windows (and OneDrive) escape rules. See
Line 36 in 533bc27
PFERD will auto-detect if it is run on windows, but it can not autodetect if you are writing to some sink that eventually winds up in a windows context. If you have a nice idea for that, that could be changed.
Maybe a good first change would be catching that OS error and suggesting the windows_paths
option?
Thank you for pointing out that this feature already exists windows_paths = yes
. That solves my problem. I think it would be too big of an effort to detect a situation like I have.
Now that my issue is solved, I still have the question of how I would do with a transform rule.
experiments in python:
d = 'Study "Mattering and motivation academic studies".pdf'
d.replace('"', '_')
# Out[6]: 'Study _Mattering and motivation academic studies_.pdf'
f"{d.replace('"','_')}"
# SyntaxError: unterminated string literal
f'{d.replace("\"","_")}'
# SyntaxError: f-string expression part cannot include a backslash
f"""{d.replace('"','_')}"""
# Out[28]: 'Study _Mattering and motivation academic studies_.pdf'
f'''{d.replace('"','_')}'''
# Out[30]: 'Study _Mattering and motivation academic studies_.pdf'
when i put the following into pferd.cfg
transform =
(.*) -re->> '''{g1.replace('"', '_')}'''
I receive
Error Error parsing rule on line 1:
(.*) -re->> '''{g1.replace('"', '_')}'''
^--- Expected end of line
and a similar result for triple double quotes.
https://realpython.com/python-f-strings/ seems to suggest that f"{d.replace('"','_')}"
could work in python 3.12
There are two things:
- PFERD does not supported nested quotations. This means that you need to escape the outer quotation whenever it appears in the string, i.e.
'hey\'there'
or"hey\"there"
. - PFERD uses
f'{right!r}'
to format its right hand side eval rule, which apparently uses a simple heuristic for choosing outer quotes: If a"
appears in the string, it chooses'
for the outer quotes. This also replaces all inner single quotes with\'
, in order to generate a valid python string. As you need to use either\"
or'"'
to represent your double quote in the replacement rule, your string will always contain either a backslash or a single quote. The!r
then normalizes the single quote to\'
, creating a backslash within the fstring.
I think the only solution within the current rules would be to just chain a few regex replacement rules and hope for the best, i.e.
'"([^"]+)' -re->> '_{g1}' # leading quotes
'([^"]+)"' -re->> '{g1}_' # trailing quotes
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # a
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # few
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # inner
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # quotes
In Python 3.12 there might be a nicer solution, but aiohttp is not yet compatible.
Thanks for the infos on PFERD! Really appreciate that you took your time to explain.
I tried the regex chain, but couldn't get it to work. I'll stay with windows_paths = yes
Really appreciate that you took your time to explain.
🐞
windows_paths
is the intended solution and also guards against a few other windows idiosyncrasies :) So that is the solution I'd advice you to use.
Yea, I was a bit stupid
'"([^"]+)' -re->> '_{g1}'
'([^"]+)"' -re->> '{g1}_'
'([^"]+)"(.*)' -re->> '{g1}_{g2}'
'([^"]+)"(.*)' -re->> '{g1}_{g2}'
'([^"]+)"(.*)' -re->> '{g1}_{g2}'
'([^"]+)"(.*)' -re->> '{g1}_{g2}'
You can not require the second part to also be free of quotes 😅 This seems to work correctly
hello "there my" dear
Testing rule 1: '"([^"]+)' -re->> '_{g1}'
Testing rule 2: '([^"]+)"' -re->> '{g1}_'
Testing rule 3: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
Match found, updated path to 'hello _there my" dear'
Testing rule 4: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
Match found, updated path to 'hello _there my_ dear'
Testing rule 5: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
Testing rule 6: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
Final result: 'hello _there my_ dear'