Garmelon/PFERD

Error on " in filename

gabmert opened this issue · 9 comments

I was running pferd as usual, but today I got this error:

Error An unexpected exception occurred

Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/PFERD/pferd.py", line 156, in run
    await crawler.run()
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/http_crawler.py", line 193, in run
    await super().run()
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
    return await f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 338, in run
    await self._run()
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 209, in _run
    await self._crawl_course(self._target)
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 223, in _crawl_course
    await self._crawl_url(root_url, expected_id=course_id)
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 285, in _crawl_url
    await self.gather(tasks)
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 274, in gather
    return await result
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
    return await f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 345, in _crawl_ilias_page
    await self.gather(tasks)
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 274, in gather
    return await result
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
    return await f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 111, in wrapper
    return await f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 664, in _download_file
    async with dl as (bar, sink):
  File "/usr/lib/python3.11/site-packages/PFERD/utils.py", line 126, in __aenter__
    result: T = await self._on_aenter()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/crawl/crawler.py", line 133, in _on_aenter
    sink = await self._stack.enter_async_context(self._fs_token)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 638, in enter_async_context
    result = await _enter(cm)
             ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/utils.py", line 126, in __aenter__
    result: T = await self._on_aenter()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/output_dir.py", line 116, in _on_aenter
    tmp_path, file = await self._output_dir._create_tmp_file(self._local_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/PFERD/output_dir.py", line 359, in _create_tmp_file
    return tmp_path, open(tmp_path, "xb")
                     ^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument: '<secret-path>/MT/Offers to 
Students/.Study "Mattering and motivation academic studies".pdf.tmp.jnnl5o'

my first guess is that either pferd or my filesystem can't handle the " in the filename.
This is the file in ilias:
Screenshot_20231109_171655

Thank you for the hint. No it's not onedrive but it's a ntfs partition via ntfs3. So I'm not able to use this filename. Now I'm wondering what would happen if I download this file via browser on windows. Will try when I have time.

For now I'm trying this transform in my config, but I can't get the escaping to work, need to do some more research

transform =
  # (.*) -re->> "{g1.replace('"', '_')}"
  # (.*) -re->> "{g1.replace('\"', '_')}"
  (.*) -re->> '{g1.replace("\"", "_")}'

For the future: Is it a possible feature to have pferd automatically rename filenames which aren't allowed on the location?

You can set windows_paths = yes in your default section (or for individual crawlers):

[DEFAULT]
windows_paths = yes

Then PFERD will apply the windows (and OneDrive) escape rules. See

def _fixup_element(self, name: str) -> str:

PFERD will auto-detect if it is run on windows, but it can not autodetect if you are writing to some sink that eventually winds up in a windows context. If you have a nice idea for that, that could be changed.
Maybe a good first change would be catching that OS error and suggesting the windows_paths option?

Thank you for pointing out that this feature already exists windows_paths = yes. That solves my problem. I think it would be too big of an effort to detect a situation like I have.

Now that my issue is solved, I still have the question of how I would do with a transform rule.
experiments in python:

d = 'Study "Mattering and motivation academic studies".pdf'
d.replace('"', '_')
# Out[6]: 'Study _Mattering and motivation academic studies_.pdf' 
f"{d.replace('"','_')}"
# SyntaxError: unterminated string literal
f'{d.replace("\"","_")}'
# SyntaxError: f-string expression part cannot include a backslash

f"""{d.replace('"','_')}"""
# Out[28]: 'Study _Mattering and motivation academic studies_.pdf'
f'''{d.replace('"','_')}'''
# Out[30]: 'Study _Mattering and motivation academic studies_.pdf'

when i put the following into pferd.cfg

transform =
  (.*) -re->> '''{g1.replace('"', '_')}'''

I receive

Error Error parsing rule on line 1:
(.*) -re->> '''{g1.replace('"', '_')}'''
              ^--- Expected end of line

and a similar result for triple double quotes.

https://realpython.com/python-f-strings/ seems to suggest that f"{d.replace('"','_')}" could work in python 3.12

There are two things:

  1. PFERD does not supported nested quotations. This means that you need to escape the outer quotation whenever it appears in the string, i.e. 'hey\'there' or "hey\"there".
  2. PFERD uses f'{right!r}' to format its right hand side eval rule, which apparently uses a simple heuristic for choosing outer quotes: If a " appears in the string, it chooses ' for the outer quotes. This also replaces all inner single quotes with \', in order to generate a valid python string. As you need to use either \" or '"' to represent your double quote in the replacement rule, your string will always contain either a backslash or a single quote. The !r then normalizes the single quote to \', creating a backslash within the fstring.

I think the only solution within the current rules would be to just chain a few regex replacement rules and hope for the best, i.e.

'"([^"]+)' -re->> '_{g1}'            # leading quotes
'([^"]+)"' -re->> '{g1}_'            # trailing quotes
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # a
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # few
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # inner
'([^"]+)"([^"]+)' -re->> '{g1}_{g2}' # quotes

In Python 3.12 there might be a nicer solution, but aiohttp is not yet compatible.

Thanks for the infos on PFERD! Really appreciate that you took your time to explain.
I tried the regex chain, but couldn't get it to work. I'll stay with windows_paths = yes

Really appreciate that you took your time to explain.

🐞


windows_paths is the intended solution and also guards against a few other windows idiosyncrasies :) So that is the solution I'd advice you to use.

Yea, I was a bit stupid

    '"([^"]+)' -re->> '_{g1}'
    '([^"]+)"' -re->> '{g1}_'
    '([^"]+)"(.*)' -re->> '{g1}_{g2}'
    '([^"]+)"(.*)' -re->> '{g1}_{g2}'
    '([^"]+)"(.*)' -re->> '{g1}_{g2}'
    '([^"]+)"(.*)' -re->> '{g1}_{g2}'
grafik

You can not require the second part to also be free of quotes 😅 This seems to work correctly

hello "there my" dear
  Testing rule 1: '"([^"]+)' -re->> '_{g1}'
  Testing rule 2: '([^"]+)"' -re->> '{g1}_'
  Testing rule 3: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
  Match found, updated path to 'hello _there my" dear'
  Testing rule 4: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
  Match found, updated path to 'hello _there my_ dear'
  Testing rule 5: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
  Testing rule 6: '([^"]+)"(.*)' -re->> '{g1}_{g2}'
  Final result: 'hello _there my_ dear'