Garmelon/PFERD

Throws Exception when downloading certain courses

Robotic-Brain opened this issue · 1 comments

Hi,
PFERD told me to report a bug so I'm doing that...

I'll give more information when i have time to investigate further.

Maybe Related? #67

Version:
PFERD 3.4.2 (https://github.com/Garmelon/PFERD)

Config:

[DEFAULT]
working_dir = /data/pferd

redownload = never-smart
on_conflict = no-delete
tasks = 2
downloads = 1
task_delay = 0.1
links = plaintext
videos = yes
forums = yes
transform =
    (.*) -re->> "{g1.replace(' ', '_')}"

[crawl:HOC/WS22_ARS_Reflections_9003053]
type = kit-ilias-web
target = https://ilias.studium.kit.edu/goto.php?target=crs_1890802&client_id=produktiv

Output:

Loading crawl:HOC/WS22_ARS_Reflections_9003053
Warning Please avoid using too many parallel requests as these are the KIT ILIAS
instance's greatest bottleneck.

Running crawl:HOC/WS22_ARS_Reflections_9003053
Loading cookies
  Sharing cookies
  '/data/pferd/HOC/WS22_ARS_Reflections_9003053/.cookies' has newest mtime so far
  Loading cookies from '/data/pferd/HOC/WS22_ARS_Reflections_9003053/.cookies'
Creating base directory at '/data/pferd/HOC/WS22_ARS_Reflections_9003053'
Loading previous report from '/data/pferd/HOC/WS22_ARS_Reflections_9003053/.report'
  Failed to load report
  [Errno 2] No such file or directory: '/data/pferd/HOC/WS22_ARS_Reflections_9003053/.report'
Inferred crawl target: URL https://ilias.studium.kit.edu/goto.php?target=crs_1890802&client_id=produktiv
Decision: Crawl '.'
  Testing rule 1: (.*) -re->> "{g1.replace(' ', '_')}"
  Match found, updated path to '.'
  Final result: '.'
  Answer: Yes
Parsing HTML page for '.'
  URL: https://ilias.studium.kit.edu/goto.php?target=crs_1890802&client_id=produktiv
  Page is a normal folder, searching for elements
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type from <img alt="-obj_xoct-" class="icon xoct medium outlined" 
src="./Customizing/global/skin/kit/images/outlined/icon_default.svg"/> for card title <a 
href="ilias.php?baseClass=ilObjPluginDispatchGUI&amp;cmd=forward&amp;ref_id=1956243&amp;forwardCmd=showContent" 
id="il_ui_fw_6398c3348493a1_82967459">Einleitung &amp; Organisatorisches<span data-list-item-id="lg_div_1956243_pref_1890802"></span></a>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type for <a 
href="ilias.php?baseClass=ilObjPluginDispatchGUI&amp;cmd=forward&amp;ref_id=1956243&amp;forwardCmd=showContent" 
id="il_ui_fw_6398c3348493a1_82967459">Einleitung &amp; Organisatorisches<span data-list-item-id="lg_div_1956243_pref_1890802"></span></a>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type from <img alt="Datei" class="icon file medium outlined" 
src="./Customizing/global/skin/kit/images/outlined/icon_file.svg"/> for card title <button class="btn btn-link" data-action="" 
id="il_ui_fw_6398c334576cf0_75968013">Einleitung &amp; Organisatorisches - Folien.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type for <button class="btn btn-link" data-action="" id="il_ui_fw_6398c334576cf0_75968013">Einleitung &amp; 
Organisatorisches - Folien.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type from <img alt="Datei" class="icon file medium outlined" 
src="./Customizing/global/skin/kit/images/outlined/icon_file.svg"/> for card title <button class="btn btn-link" data-action="" 
id="il_ui_fw_6398c334590272_09102644">Einleitung &amp; Organisatorisches - Skript.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type for <button class="btn btn-link" data-action="" id="il_ui_fw_6398c334590272_09102644">Einleitung &amp; 
Organisatorisches - Skript.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type from <img alt="Datei" class="icon file medium outlined" 
src="./Customizing/global/skin/kit/images/outlined/icon_file.svg"/> for card title <button class="btn btn-link" data-action="" 
id="il_ui_fw_6398c3345a65b1_83325404">ARs ReflecTIonis - Kurshandbuch.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type for <button class="btn btn-link" data-action="" id="il_ui_fw_6398c3345a65b1_83325404">ARs ReflecTIonis - 
Kurshandbuch.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type from <img alt="Datei" class="icon file medium outlined" 
src="./Customizing/global/skin/kit/images/outlined/icon_file.svg"/> for card title <button class="btn btn-link" data-action="" 
id="il_ui_fw_6398c3345bea55_15233962">Übersicht Studienbereiche und Studiengänge.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type for <button class="btn btn-link" data-action="" id="il_ui_fw_6398c3345bea55_15233962">Übersicht Studienbereiche und
Studiengänge.pdf</button>
Warning Encountered unexpected HTML structure, ignoring element.
Could not extract type from <img alt="Datei" class="icon file medium outlined" 
src="./Customizing/global/skin/kit/images/outlined/icon_file.svg"/> for card title <button class="btn btn-link" data-action="" 
id="il_ui_fw_6398c33461eef2_39766784">Einleitung - Übungsaufgaben mit Lösung und Erläuterung.pdf</button>
Crawled     '.' 

Error An unexpected exception occurred

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/PFERD/pferd.py", line 156, in run
    await crawler.run()
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/http_crawler.py", line 193, in run
    await super().run()
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/crawler.py", line 85, in wrapper
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/crawler.py", line 338, in run
    await self._run()
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 208, in _run
    await self._crawl_url(self._target)
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 263, in _crawl_url
    await gather_elements()
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 104, in wrapper
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_web_crawler.py", line 258, in gather_elements
    elements.extend(page.get_child_elements())
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_html.py", line 102, in get_child_elements
    return self._find_normal_entries()
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_html.py", line 548, in _find_normal_entries
    result += self._find_cards()
  File "/usr/local/lib/python3.10/site-packages/PFERD/crawl/ilias/kit_ilias_html.py", line 688, in _find_cards
    description = caption_parent.find_next_sibling("div").getText().strip()
AttributeError: 'NoneType' object has no attribute 'getText'

╭──────────────────────────────────────────────────────────────────────────────╮
│ Please copy your program output and send it to the PFERD maintainers, either │
│ directly or as a GitHub issue: https://github.com/Garmelon/PFERD/issues/new  │
╰──────────────────────────────────────────────────────────────────────────────╯

Report for crawl:HOC/WS22_ARS_Reflections_9003053
  Error 'NoneType' object has no attribute 'getText'

Closed by 6f30c65 and 467fc52.