This small library allows you to remove advertising and tracking query parameters from a given URL in Scala. It does not contain any filtering logic itself, but instead uses part of the JS AdGuard adblocker engine under the hood. So, besides of its’ main function, you can treat this library as a some kind of PoC that it is totally possible to run any (almost) kind of JS code on a JVM. Sometimes this method can be excessive, so read on to find out why we need to do it this way.
In case if you use sbt
:
libraryDependencies += "me.seroperson" %% "urlopt4s" % "0.1.0"
In case of mill
:
ivy"me.seroperson::urlopt4s::0.1.0"
The main object is UrlOptimizer[F]
. It provides a Scala API to remove
advertising and tracking query parameters from a given URL and it interacts with
a JS adblocker engine via GraalJS under the hood. As of usage example, you can
check the tests and also example
directory, but I'll show some example here
too:
// ...
object ExampleApp extends IOApp {
override def run(args: List[String]): IO[ExitCode] = UrlOptimizer[IO]()
.use { urlOptimizer =>
for {
result <- urlOptimizer.removeAdQueryParams("https://www.google.com/?utm_source=test")
_ <- IO.println(result) // "https://www.google.com/"
} yield ExitCode.Success
}
}
You can run a real example like so:
./millw 'example.cats[2.12.18].run' 'https://google.com/?utm_source=test'
The function removeAdQueryParams
takes some time to execute, so you usually
want to run your processing in the background. It can be used concurrently, as
all the necessary locks have been implemented internally. Also, be sure not to
block on UrlOptimizer
resource initialization, as it may take some time for it
to start.
You can also pass your own custom rule list to the UrlOptimizer.apply
method
if the default one (see urlopt4s/resources/rules.txt
) does not cover your
needs. You can also redefine the GraalJS context, but usually this is not
necessary.
I encountered the neccessity to remove advertising and tracking query parameters
while developing my pet-project, a Telegram bot "the advanced link saver".
I wanted to implement a feature that strips redundant query parameters from URLs
and, the first thing which I coded was a simple filter for a predefined set of
commonly used tracking query parameters, such as utm_source
, utm_medium
,
fbclid
, etc.
Quite quickly, I realized that this method didn't work well enough. There were a really lot of parameters around, and you couldn't cover all of them. For example, the Google Search URL typically looks like this:
https://www.google.com/search?q=hello&sca_esv=494940dbc25649b8&source=hp&ei=rmEhZuPhF6eLxc8P6-C_mAI&iflsig=ANes7DEAAAAAZiFvvg9IypzVMAznAHWL3LCM0tiJHpsL&udm=&ved=0ahUKEwjj8PzlrcyFAxWnRfEDHWvwDyMQ4dUDCA0&uact=5&oq=hello&gs_lp=Egdnd3Mtd2l6IgVoZWxsbzIIEC4YgAQYsQMyCxAuGIAEGLEDGNQCMggQABiABBixAzIIEC4YgAQYsQMyCBAuGIAEGLEDMgsQLhiABBixAxiDATILEC4YgAQYsQMY1AIyCBAAGIAEGLEDMggQABiABBixAzIIEAAYgAQYsQNInAhQAFivBnAAeACQAQCYAUigAdwCqgEBNbgBA8gBAPgBAZgCBaAC5wLCAhEQLhiABBixAxjRAxiDARjHAcICDhAuGIAEGMcBGI4FGK8BwgIEEAAYA8ICCxAAGIAEGLEDGIMBwgIYEC4YgAQYARixAxjRAxiDARjHARiKBRgKwgIFEAAYgATCAhEQLhiABBixAxiDARjHARivAZgDAJIHATWgB5VV&sclient=gws-wiz
The only part that matters here is:
https://www.google.com/search?q=hello
Of course, you can manually collect a list of redundant query parameters by visiting the most popular websites and carefully searching for truly ad-related query parameters. However, if we dive deeply:
- It's a really time-consuming process and it's hard to do it properly.
- A parameter that you think is related to ads may be actually so at one website but not at another, and you may never know if something has gone wrong.
- Sometimes you want to filter parameters or match domain names using regular expressions.
- And probably some other points that aren't so obvious.
As you can see, this simple task becomes the more and more difficult as you go along.
The method I came up with is reusing the code that popular web adblockers already have. If you have a good adblocker installed as a browser extension, you may notice that it sometimes rewrites your URLs to get rid of advertising or tracking query parameters. This means that adblockers actually already manage a list of trashy query parameters and have all the necessary code to filter URLs. We just need to find this code and reuse it.
I have chosen AdGuard ecosystem to do this. They have very friendly documentation, most things are open-source, and it is relatively easy to get things right with them. Project tsurlfilter is the core, which is responsible for the common logic and is used in all their adblockers. Using this API, we can initialize the adblocker engine, pass in some URLs, match them against your adblocker rules, and then perform blocking, filtering or redirecting and so on, depending on what is matched.
As I said, an adblocker usually works according to a predefined set of rules. Therefore, we also need to create our own list of rules that only contain entries related to filtering query parameters. The FiltersRegistry allows you to do it. We will discuss this later.
Then, if our backend was written in JS, we would have no further problems: just add dependencies, maybe some polyfills, and run the code. But we're running on the JVM, and that's actually another purpose of this library - to show that it's possible to run a large webpack bundle consisting of TypeScript libraries and modern JS APIs on the JVM.
So, that's how I have done it:
- We are writing
urlopt4s-js
JS module, which interacts withtsurlfilter
library, inits an engine and provides functions to be called from a JVM, likeremoveAdQueryParams(str)
. - We are building
urlopt4s-js
bundle with webpack, adding some polyfills, some tricks to make JS-on-JVM working. - We are compiling our custom set of rules, which has only query params filtering things.
- Finally, we are writing
urlopt4s
Scala module, and packing JS bundle and rules inside. It inits everything and then just provides Scala interface to call JS code.
JS-on-JVM is implemented using GraalJS and works quite well, but a lot of tricky things were required to get everything working together.
Still, there is plenty of room for optimization, and I believe many things could be improved, but as for now I leave it as it is.
Firstly, you have to build webpack bundle which will be included in final
.jar
. Just go to urlopt4s-js
and do:
npm exec webpack
Your bundle will be available at urlopt4s-js/dist/main-bundle.mjs
. It should
be moved then to urlopt4s/resources/urlopt4s.mjs
.
Now you can compile and build .jar
:
./millw __.publishLocal
urlopt4s
comes with predefined set of rules: urlopt4s/resources/rules.txt
.
It was compiled using FiltersRegistry repository and contains only
$removeparam
directives. You may use the default one or compile your own. The
repository has pretty nice documentation, but compiling the list which you see
here requires some additional code. I'm leaving the patch which I did to do it:
diff --git a/scripts/build/build.js b/scripts/build/build.js
index 8f7332b7657..033c0a59c55 100755
--- a/scripts/build/build.js
+++ b/scripts/build/build.js
@@ -1,6 +1,7 @@
const fs = require('fs');
const path = require('path');
const compiler = require('adguard-filters-compiler');
+const compilerOptimization = require('../../node_modules/adguard-filters-compiler/src/main/optimization.js');
const customPlatformsConfig = require('./custom_platforms');
const { formatDate } = require('../utils/strings');
@@ -72,6 +73,8 @@ const buildFilters = async () => {
await fs.promises.cp(platformsPath, copyPlatformsPath, { recursive: true });
}
+ compilerOptimization.disableOptimization();
+
await compiler.compile(
filtersDir,
logPath,
diff --git a/scripts/build/custom_platforms.js b/scripts/build/custom_platforms.js
index 71dcb17cd00..867603af78b 100644
--- a/scripts/build/custom_platforms.js
+++ b/scripts/build/custom_platforms.js
@@ -533,7 +533,44 @@ const SAFARI_BASED_EXTENSION_PATTERNS = [
...JSONPRUNE_MODIFIER_PATTERNS,
];
+const ONLY_REMOVEPARAM_MODIFIER_PATTERNS = [
+ '^(?!.*(\\$(?!#|(path|domain)=.*]).*removeparam(,|=|$))).*$',
+];
+
+const SKIP_CONTENT_TYPE_PATTERNS = [
+ '\\$.*document',
+ '\\$.*subdocument',
+ '\\$.*font',
+ '\\$.*image',
+ '\\$.*media',
+ '\\$.*object',
+ '\\$.*other',
+ '\\$.*ping',
+ '\\$.*script',
+ '\\$.*stylesheet',
+ '\\$.*websocket',
+ '\\$.*xmlhttprequest'
+];
+
module.exports = {
+ 'LINK_OPTIMIZER': {
+ 'platform': 'link_optimizer',
+ 'path': 'link_optimizer',
+ 'expires': '10 days',
+ 'configuration': {
+ // removing everything except of $removeparam
+ 'removeRulePatterns': [
+ ...ONLY_REMOVEPARAM_MODIFIER_PATTERNS,
+ ...SKIP_CONTENT_TYPE_PATTERNS
+ ],
+ 'replacements': null,
+ 'ignoreRuleHints': false,
+ },
+ 'defines': {
+ 'adguard': true,
+ 'adguard_ext_chromium': true,
+ },
+ },
'WINDOWS': {
'platform': 'windows',
'path': 'windows',
After compiling you will have to concat all the output and get rid of duplicates.
MIT License
Copyright (c) 2024 Daniil Sivak
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.