searx/searx-docker

Block chrome users agent that doesn't send Sec-Fetch-X

unixfox opened this issue ยท 13 comments

Description

Since Google Chrome 76 (release date: 30 July 2019), 4 new headers are sent on every request: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site and Sec-Fetch-User. Even forks of Google Chrome like Edge, Opera, Brave and Vivaldi (to be tested) send it.
Most if not all bots fake their user agent by pretending to be Google Chrome, I think this could be a good new rule to investigate and could potentially increase by a lot the amount of bots blocked on the Searx instances that use searx-docker filtron rules.

Test case of users agent of real browsers that have "Chrome" in it

Working results

Edge 81 (Windows 10):
image
Opera:
image
Chrome Android:
image
Naked Browser (Android webkit):
Screenshot_20200426-115835_Naked_Browser_LTS_1_1_1
Google Chrome/Chromium:
image

Failed results

The only exception is Samsung Browser :
image

How to implement the rule?

Due to Samsung Browser, the regex can't just be: "Header:User-Agent=(Chrome)" because it would block Samsung Browser which is still a popular Browser.
@dalf Do you have an idea of a regex that would filter Chrome but not Samsung Browser?
Also, when after filtering Google Chrome, how do I implement a rule that block requests that doesn't have Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site and Sec-Fetch-User? Using a "subrules"? But how?

I was able to workaround the issue with Samsung Browser by stopping the validating if filtron finds SamsungBrowser in the user agent:

{
    "name": "chrome browser",
    "filters": [
        "Header:User-Agent=(Chrome)"
    ],
    "subrules": [
        {
            "name": "contains samsung",
            "stop": true,
            "filters": [
                "Header:User-Agent=(SamsungBrowser)"
            ],
            "actions": [
                {
                    "name": "shell",
                    "params": {
                        "cmd": "/bin/true"
                    }
                }
            ]
        },
        {
            "name": "doesnt contains sec-fetch-x headers",
            "stop": true,
            "filters": [
                "!Header:Sec-Fetch-Dest",
                "!Header:Sec-Fetch-Mode",
                "!Header:Sec-Fetch-Site",
                "!Header:Sec-Fetch-User"
            ],
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Please update Google Chrome to a newer version."
                    }
                }
            ]
        }
    ]
}

But it prints a message because I'm not able to find a workaround to avoid logging the request.
Also, the message is different from the common message because I'm unable to filtron the Google Chrome browsers with a version below 76.

New Update!

Thanks to StackOverflow: https://stackoverflow.com/questions/29977086/regex-how-can-i-match-all-numbers-greater-than-954/29977124
Here is the Regex: https://regex101.com/r/wtfXpP/1
I was able to find a regex that match Chrome 76 & more and this seems to work perfectly! More over, Samsung Browser doesn't get filtered because its user agent have Chrome 75.

{
    "name": "chrome browser",
    "filters": [
        "Header:User-Agent=(Chrome/([1-9]\\d{2,}|[8-9]\\d|[6-9]{2}))",
        "!Header:Sec-Fetch-Dest",
        "!Header:Sec-Fetch-Mode",
        "!Header:Sec-Fetch-Site",
        "!Header:Sec-Fetch-User"
    ],
    "stop": true,
    "actions": [
        {
            "name": "block",
            "params": {
                "message": "Rate limit exceeded"
            }
        }
    ]
}

I guess time to submit a PR but I'm still waiting for your feedback @dalf.

dalf commented

@unixfox testing

Oh, it also filters Chrome/66.0.4044.122. I guess my regex is not powerful enough. I'll keep trying to find the better one.
Meanwhile, if you have a better one let me know :).

EDIT: I deleted my newer comment because it doesn't match "Chrome/80". Still trying...
Here is the regex that I tried: https://regex101.com/r/wtfXpP/2

dalf commented

Am I missing something:

  • use
    https://gist.github.com/dalf/5a823b5aeae06e9e631b5721794ae514
  • curl 'https://a.searx.space/?q=test&category_general=on&time_range=&language=en-US' -H 'authority: a.searx.space' -H 'pragma: no-cache' -H 'cache-control: no-cache' -H 'dnt: 1' -H 'upgrade-insecuests: 1' -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' -H 'sec-fetch-modezz: navigate' -H 'sec-fetch-user: ?1' -H 'sec-fetch-dest: document' -H 'accept-language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6' -H 'cookie: categories=general; language=en-US; locale=fr; image_proxy=1; safesearch=0; results_on_new_tab=0; doi_resolver=oadoi.org; oscar-style=logicodev; disabled_plugins=; enabled_plugins=; maintab=on; enginetab=on; method=GET; autocomplete=duckduckgo; theme=oscar; disabled_engines="wikidata__general\054piratebay__videos\054bing__general"; enabled_engines="reddit__social media\054startpage__general\054duckduckgo__general\054ddg definitions__general"; tokens=' --compressed

no blocking.

Ok I found a simpler regex: https://regex101.com/r/EVjgjL/1
Here are the rules to test that: https://paste.ee/p/0COrc
This should not be blocked:

curl 'http://127.0.0.1:4004/' \
  -H 'Connection: keep-alive' \
  -H 'Cache-Control: max-age=0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: null' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Accept-Language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7' \
  --data 'q=ok&time_range=&language=fr-FR&category_general=on'

This should be blocked:

curl 'http://127.0.0.1:4004/' \
  -H 'Connection: keep-alive' \
  -H 'Cache-Control: max-age=0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: null' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Accept-Language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7' \
  --data 'q=ok&time_range=&language=fr-FR&category_general=on'

Feel free to try different combinations while still thinking that it should block only if Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site and Sec-Fetch-User headers aren't present at the same time. I don't know how to make it so it match if one of the 4 headers aren't present without replicating the rule 4 time with each header.

dalf commented

[EDIT]

   "filters": [
       "Header:User-Agent=Chrome/(7[6-9]|[8-9][0-9]|[1-9][0-9][0-9])",
       "!Header:Sec-Fetch-Dest",
       "!Header:Sec-Fetch-Mode",
       "!Header:Sec-Fetch-Site",
       "!Header:Sec-Fetch-User"
   ],

but as soon there is at least one the Header-Sec-* header, filtron forwards the request to searx.

With this more powerful rule it blocks your curl command but it's not pretty:

{
    "name": "chrome >=76 user agent",
    "filters": [
        "Header:User-Agent=(Chrome/([0-9][0-9][0-9]|[8-9][0-9]|7[6-9]))"
    ],
    "subrules": [
        {
            "name": "No Sec-Fetch-Dest header",
            "filters": [
                "!Header:Sec-Fetch-Dest"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        },
        {
            "name": "No Sec-Fetch-Mode header",
            "filters": [
                "!Header:Sec-Fetch-Mode"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        },
        {
            "name": "No Sec-Fetch-Site header",
            "filters": [
                "!Header:Sec-Fetch-Site"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        },
        {
            "name": "No Sec-Fetch-User header",
            "filters": [
                "!Header:Sec-Fetch-User"
            ],
            "limit": 0,
            "stop": true,
            "actions": [
                {
                    "name": "block",
                    "params": {
                        "message": "Rate limit exceeded"
                    }
                }
            ]
        }
    ]
}

Complete rules.json file: https://paste.ee/p/zx1rw

EDIT: I just pushed the rules on my own instance and I can already see bots being blocked ๐Ÿ˜„!

dalf commented

I confirm it is working as intended.

I don't see a shorter way to write this rules without touching filtron source code.


A format improvement idea:

{
    "name": "chrome >=76 user agent",
    "filters": {
		"and": {
			"Header:User-Agent=(Chrome/([0-9][0-9][0-9]|[8-9][0-9]|7[6-9]))": true,
			"or": {
				"Header:Sec-Fetch-Dest": false,
				"Header:Sec-Fetch-Mode": false,
				"Header:Sec-Fetch-Site": false	
			}
		}
    },
    ...
}

I just pushed the rules on my own instance and I can already see bots being blocked ๐Ÿ˜„!

๐Ÿ‘ just be sure there are bots.

I just pushed the rules on my own instance and I can already see bots being blocked smile!

+1 just be sure there are bots.

I'm monitoring their request, and they don't follow a normal human behavior.
More over, for the moment all of them their IP is blacklisted on https://mxtoolbox.com/blacklists.aspx which is a very good sign of being a bot request.

dalf commented

Do you think if it would be interesting to add a blacklist check as a filter in filtron ?

As I understand it is one a DNS lookup per IP and per list ?

Ok so first observation, not every Google Chrome or forks of it send Sec-Fetch-User for some reason (which I need to investigate). I removed this rule because I feel like it's filtering real humans.
Second observation is that some users use a user agent randomizer which could block their request if they make a request from Firefox using the Google Chrome user agent. I think it would be better to customize the message in order to alert the user about that but without telling the real truth in order to not having bot owners trying to circumvent filtron ๐Ÿค”


Do you think if it would be interesting to add a blacklist check as a filter in filtron ?

No. I would prefer to make a new program for that purpose because I feel like filtron is good at doing its header filtering thing, adding more features unrelated to that would defeat the initial purpose of it and probably make it slower.
Moreover, having your IP in a blacklist doesn't mean that you are a bot.
Some blacklists are really outdated and due to that I think this would do more harm than actually help to block bots because your home connection could have its IP in the blacklist due to the dynamic IP thing. Filtering by ASN is a better idea, for example blocking requests made by ASN of VPS providers.

I disabled the rules because I'm seeing more requests blocked that looks like human ones than actual bots.
I guess there are a lot of users in the searx community that are using a user agent randomizer.

I'm closing this issue because it's not worth investigating more and filtron still do a very good job with the current rules.