springload/madewithwagtail

Use consistent heuristics to tell if a site is made with Wagtail

thibaudcolas opened this issue · 11 comments

I would like to build something like http://isthissitebuiltwithdrupal.com/ or http://whatcms.org/, both to make it easier for us to validate sites, and also because it's cool.

A note: if someone wants to tackle this, it is related to Made with Wagtail but I don't think it should be built within it. MWW could potentially integrate with that thing though.

At the moment we use https://wappalyzer.com/, which detects Django / Python, but not with a great degree of confidence IIRC.

Here are the heuristics I could think of, in no particular order:

Drupal sets a X-Generator header with the major version (i.e. Drupal 8 (https://drupal.org)), and the default expires header is Expires:Sun, 19 Nov 1978 05:00:00 GMT which is the creator's birthday. So this could be one way Wagtail can do it.

👍 Good approach, but I'm very keen to see how far this can go without changes in Wagtail itself first.

Another hint from Wagtail Slack:

  • Rendition signatures in filenames of user-uploaded images, e.g. screenshot_2016-11-24_17.03.04.width-800.png (but careful of other systems using the same image processors though).

Although it's not a default, we could ping the API at /api/v1/pages.

I'm not a big fan of the generator header or meta, especially not with the version number since it gives attacker too much knowledge of the architecture and the potential open vulnerabilities.

Load /admin/login and check for "wagtail" in the response (either in the page title, or wagtailadmin JS file – that's a 100% tell I think).

This is not always the case. We use admin for serving the django admin and cms for serving Wagtail.

I didn't realise Wappalyzer was open-source! Here is what they make available to detect technologies: https://github.com/AliasIO/Wappalyzer/wiki/Specification

Here is Django for example:

"Django": {
    "cats": [
        "18"
    ],
    "env": "^__admin_media_prefix__",
    "html": "(?:powered by <a[^>]+>Django ?([\\d.]+)?|<input[^>]*name=[\"']csrfmiddlewaretoken[\"'][^>]*>)\\;version:\\1",
    "icon": "Django.png",
    "implies": "Python",
    "website": "http://djangoproject.com"
},

And here is a survey of what other CMSes are doing:

  • Drupal = lots of headers, generator, scripts
  • WordPress = generator, JS, patterns in the HTML
  • CraftCMS = "Set-Cookie": "CraftSessionId=", "X-Powered-By": "Craft CMS"
  • SilverStripe = "generator": "SilverStripe, "html": "Powered by <a href=\"[^>]+SilverStripe",
  • Sitecore = "Set-cookie": "SC_ANALYTICS_GLOBAL_COOKIE"
  • Sitefinity = "generator": "^Sitefinity (.+)$\\;version:\\1"
  • Hugo = "generator": "Hugo ([\\d.]+)?\\;version:\\1"
  • AEM = html patterns
  • Concrete5 = "generator": "concrete5 - ([\\d.ab]+)\\;version:\\1"
  • Dotclear = "X-Dotclear-Static-Cache": ""
  • Umbraco = "X-Umbraco-Version": "(.*)\\;version:\\1", generator

generator is actually standardised (at https://www.w3.org/TR/html5/document-metadata.html#standard-metadata-names): The value must be a free-form string that identifies one of the software packages used to generate the document..

Considering how much info is available anyway, not adding it for the sake of security just feels like obscurity. If someone (malicious or not) does want to do more rigorous fingerprinting, that won't make a big difference. That said some obscurity does sound better than less obscurity 😅.

I'm not sure what to think of it. Having it optionally perhaps?

@thibaudcolas @loicteixeira I tried to add a new definition for Wagtail in Wappalyzer but without being able to access the API or the admin URLs there's no way to detect whether it's a Wagtail site.
The only way I see is to do a PR to the Wagtail base to add a generator meta.

If the conclusion is that the "best" way to discriminate a Wagtail app is to add a generator, I reckon that's a discussion for the Wagtail project itself and an issue should be created there.

Absolutely I'll try that route, thanks for your feedback!

I understand that specific Wagtail generated http header would make detection easier, but for what cost? These types of headers can leak sensible information to potential attacker. Django refused to add x-powered header for that same security reason https://code.djangoproject.com/ticket/14431 years ago.
See https://pentest-tools.com/blog/essential-http-security-headers/#info-leak for more examples. No wagtail specific header makes MWW work a bit harder, but the actual website server is a bit safer.

As stated in my original comment, I'm not really in favour for a generator tag of any sort either.

However, I reckon this is a discussion to have on the Wagtail repo which is why I encouraged the discussion to move over there.

That being said, @mojeto still being part of the team in charge of Made With Wagtail (while @thibaudcolas and myself aren't anymore), if they would rather continue without a generator tag, given that the need for such tag has only been voiced for MWW use (afaik, nobody else ever mentioned it), then maybe it's not even worth opening that discussion.

I’ve finally taken the time to experiment with this and implemented detection based on image renditions, if anyone wants to try it out it’s available from https://detect-wagtail.netlify.app/, and the implementation details are further discussed in https://thib.me/detecting-wagtail-in-the-wild. This will be released in Wappalyzer soon: https://github.com/AliasIO/wappalyzer/pull/3546.

Having spent the time implementing this I find the question of whether or not to have a generator header / meta tag a bit moot – there are so many ways to detect Wagtail anyway, so I don’t think it’s worth adding that to Wagtail, but equally the absence of such a tag shouldn’t be construed as having any security benefit whatsoever.