(crawler) thefossilforum.com doesn't get crawled properly
Closed this issue · 1 comments
vlofgren commented
Crawler only fetches the index. Don't see anything in robots.txt, the headers or the meta tags that would indicate why it shouldn't crawl further.
Contents of 3f/12/3f123f37cf205631a7a953398aa30294-www.thefossilforum.com.zstd
with body and IP redacted:
{
"url": "http://www.thefossilforum.com/",
"contentType": "text/html;charset=UTF-8",
"timestamp": "2023-07-27T22:16:20.498185969",
"httpStatus": 200,
"crawlerStatus": "OK",
"headers": "Date: Thu, 27 Jul 2023 20:16:20 GMT\nServer: Apache\nPragma: no-cache\nX-IPS-LoggedIn: 0\nContent-Encoding: gzip\nVary: cookie,Accept-Encoding\nX-XSS-Protection: 0\nX-Frame-Options: sameorigin\nX-IPS-Cached-Response: Thu, 27 Jul 2023 20:16:18 GMT\nExpires: Thu, 27 Jul 2023 20:16:50 GMT\nCache-Control: max-age=30, public\nConnection: close\nContent-Length: 24931\nLast-Modified: Thu, 27 Jul 2023 20:16:18 GMT\nExpires: max-age=29030400, public\nContent-Type: text/html;charset=UTF-8\n",
"documentBody": "...",
"documentBodyHash": "20d820427e20598470ff5d551852544e",
"canonicalUrl": "http://www.thefossilforum.com/index.php",
"recrawlState": "SAME-BY-COMPARISON"
}
{
"url": "http://www.thefossilforum.com/index.php",
"contentType": "text/html;charset=UTF-8",
"timestamp": "2023-07-27T22:16:21.837639302",
"httpStatus": 200,
"crawlerStatus": "OK",
"headers": "Date: Thu, 27 Jul 2023 20:16:21 GMT\nServer: Apache\nPragma: no-cache\nX-IPS-LoggedIn: 0\nContent-Encoding: gzip\nVary: cookie,Accept-Encoding\nX-XSS-Protection: 0\nX-Frame-Options: sameorigin\nExpires: Thu, 27 Jul 2023 20:16:51 GMT\nCache-Control: max-age=30, public\nConnection: close\nContent-Length: 24952\nLast-Modified: Thu, 27 Jul 2023 20:16:21 GMT\nExpires: max-age=29030400, public\nContent-Type: text/html;charset=UTF-8\n",
"documentBody": "...",
"documentBodyHash": "42d78ecbd431ae7a7ba14f0f1394c3dc",
"canonicalUrl": "http://www.thefossilforum.com/index.php",
"recrawlState": "SAME-BY-COMPARISON"
}
{
"id": "3f123f37cf205631a7a953398aa30294",
"domain": "www.thefossilforum.com",
"crawlerStatus": "OK",
"ip": "...",
"doc": [],
"cookies": [
"ips4_IPSSessionFront=e75f89a3dc49ebff4ff8b73c5b15aa70; path=/; httponly",
"ips4_guestTime=1690488978; path=/; httponly",
"ips4_forum_view=table; expires=Sat, 27 Jul 2024 20:16:18 GMT; path=/; httponly"
]
}
robots.txt
User-Agent: *
Disallow: /startTopic/
Disallow: /discover/unread/
Disallow: /markallread/
Disallow: /staff/
Disallow: /online/
Disallow: /discover/
Disallow: /leaderboard/
Disallow: /search/
Disallow: /*?advancedSearchForm=
Disallow: /register/
Disallow: /lostpassword/
Disallow: /login/
Disallow: /*?sortby=
Disallow: /*?filter=
Disallow: /*?tab=
Disallow: /*?do=
Disallow: /*ref=
Disallow: /*?forumId*
Disallow: /profile/
Sitemap: http://thefossilforum.com/sitemap.php
Sitemap is nested:
<sitemapindex>
<sitemap>
<loc>
http://www.thefossilforum.com/sitemap.php?file=sitemap_content_forums_Forum
</loc>
<lastmod>2023-09-03T09:43:25+02:00</lastmod>
</sitemap>
...
</sitemapindex>
<urlset>
<url>
<loc>http://www.thefossilforum.com/forum/2-fossil-news/</loc>
<lastmod>2023-09-02T20:19:41+01:00</lastmod>
</url>
<url>
<loc>
http://www.thefossilforum.com/forum/186-paleo-re-creations/
</loc>
<lastmod>2023-08-30T13:03:45+01:00</lastmod>
</url>
...
</urlset>
vlofgren commented
More curiously still, the search engine seems to index documents from this forum, ostensibly without having fetched them? WTF