MarginaliaSearch/MarginaliaSearch

(crawler) thefossilforum.com doesn't get crawled properly

Closed this issue · 1 comments

Crawler only fetches the index. Don't see anything in robots.txt, the headers or the meta tags that would indicate why it shouldn't crawl further.

Contents of 3f/12/3f123f37cf205631a7a953398aa30294-www.thefossilforum.com.zstd with body and IP redacted:

{
  "url": "http://www.thefossilforum.com/",
  "contentType": "text/html;charset=UTF-8",
  "timestamp": "2023-07-27T22:16:20.498185969",
  "httpStatus": 200,
  "crawlerStatus": "OK",
  "headers": "Date: Thu, 27 Jul 2023 20:16:20 GMT\nServer: Apache\nPragma: no-cache\nX-IPS-LoggedIn: 0\nContent-Encoding: gzip\nVary: cookie,Accept-Encoding\nX-XSS-Protection: 0\nX-Frame-Options: sameorigin\nX-IPS-Cached-Response: Thu, 27 Jul 2023 20:16:18 GMT\nExpires: Thu, 27 Jul 2023 20:16:50 GMT\nCache-Control: max-age=30, public\nConnection: close\nContent-Length: 24931\nLast-Modified: Thu, 27 Jul 2023 20:16:18 GMT\nExpires: max-age=29030400, public\nContent-Type: text/html;charset=UTF-8\n",
  "documentBody": "...",
  "documentBodyHash": "20d820427e20598470ff5d551852544e",
  "canonicalUrl": "http://www.thefossilforum.com/index.php",
  "recrawlState": "SAME-BY-COMPARISON"
}
{
  "url": "http://www.thefossilforum.com/index.php",
  "contentType": "text/html;charset=UTF-8",
  "timestamp": "2023-07-27T22:16:21.837639302",
  "httpStatus": 200,
  "crawlerStatus": "OK",
  "headers": "Date: Thu, 27 Jul 2023 20:16:21 GMT\nServer: Apache\nPragma: no-cache\nX-IPS-LoggedIn: 0\nContent-Encoding: gzip\nVary: cookie,Accept-Encoding\nX-XSS-Protection: 0\nX-Frame-Options: sameorigin\nExpires: Thu, 27 Jul 2023 20:16:51 GMT\nCache-Control: max-age=30, public\nConnection: close\nContent-Length: 24952\nLast-Modified: Thu, 27 Jul 2023 20:16:21 GMT\nExpires: max-age=29030400, public\nContent-Type: text/html;charset=UTF-8\n",
  "documentBody": "...",
  "documentBodyHash": "42d78ecbd431ae7a7ba14f0f1394c3dc",
  "canonicalUrl": "http://www.thefossilforum.com/index.php",
  "recrawlState": "SAME-BY-COMPARISON"
}
{
  "id": "3f123f37cf205631a7a953398aa30294",
  "domain": "www.thefossilforum.com",
  "crawlerStatus": "OK",
  "ip": "...",
  "doc": [],
  "cookies": [
    "ips4_IPSSessionFront=e75f89a3dc49ebff4ff8b73c5b15aa70; path=/; httponly",
    "ips4_guestTime=1690488978; path=/; httponly",
    "ips4_forum_view=table; expires=Sat, 27 Jul 2024 20:16:18 GMT; path=/; httponly"
  ]
}

robots.txt

User-Agent: *
Disallow: /startTopic/
Disallow: /discover/unread/
Disallow: /markallread/
Disallow: /staff/
Disallow: /online/
Disallow: /discover/
Disallow: /leaderboard/
Disallow: /search/
Disallow: /*?advancedSearchForm=
Disallow: /register/
Disallow: /lostpassword/
Disallow: /login/
Disallow: /*?sortby=
Disallow: /*?filter=
Disallow: /*?tab=
Disallow: /*?do=
Disallow: /*ref=
Disallow: /*?forumId*
Disallow: /profile/
Sitemap: http://thefossilforum.com/sitemap.php

Sitemap is nested:

<sitemapindex>
<sitemap>
<loc>
http://www.thefossilforum.com/sitemap.php?file=sitemap_content_forums_Forum
</loc>
<lastmod>2023-09-03T09:43:25+02:00</lastmod>
</sitemap>
...
</sitemapindex>
<urlset>
<url>
<loc>http://www.thefossilforum.com/forum/2-fossil-news/</loc>
<lastmod>2023-09-02T20:19:41+01:00</lastmod>
</url>
<url>
<loc>
http://www.thefossilforum.com/forum/186-paleo-re-creations/
</loc>
<lastmod>2023-08-30T13:03:45+01:00</lastmod>
</url>
...
</urlset>

More curiously still, the search engine seems to index documents from this forum, ostensibly without having fetched them? WTF