DanielSiepmann/tracking

Add "bot" (flags) detection

DanielSiepmann opened this issue · 9 comments

Right now the extension is not aware of bots.
It would make sense to add a flag to all records that define whether the tracking was triggered by a bot.

Those should then be allowed to be excluded from widgets. That way widgets can be created for humans as well as bots.
This way one can check which pages are called by humans and by bots, as well as which operating systems are used by bots and humans.

The existing rule should remove all the bot logic.
Instead a new option should be introduced which contains the logic.

Integrated into the existing update logic, existing records can be marked as bots afterwards. No existing data should be lost.

Open Todos:

  • Current inconsistency: I sometimes use "flag" sometimes "tag, this should be streamlined.
  • Fix typo Unkown

That feature can be abstracted and combined with existing operating system feature.
Instead of adding detection over detection, combined with dedicated database fields, here is another approach:
Use tags. Integrators should be able to define arbitrary tags, e.g. in configuration as keys. Each tag has a rule as value, which receives the same as the "should track rule".
Everyone can define the tags important for a project. Tags can be "bot:yes" or "bot:no" as well as "os:windows", "os:unix", etc.

We wouldn't need to add feature over feature, but implement a single flexible feature.
Widgets should be able to filter by those tags. E.g. show only page views with tag "bot:yes".

Blocks proper tests for this: TYPO3/testing-framework#256

Already worked on this, see commits and Branch feature/46-add-flags-feature (as well as first attempt in branch feature/bot-support which won't make it due to more flexible approach in new branch). Blocker right now: This will be breaking and data migration takes way to long right now.

Further work (funding) needed to provide a smother migration.

What do you think about requiring/suggesting https://github.com/JayBizzle/Crawler-Detect and making it available in the Expression Language as detectCrawler.isBot() or similar? They are pretty quick with adding new bots' user-agents.

Absolutely wonderful project by the way :) Thank you!

I've integrated https://packagist.org/packages/matomo/device-detector within the feature branch already.
There is currently no plan to add it to the expression language, as the concept will change.

All requests will be tragged but can have arbitrary trags. E.g. a feature flag "isBot:yes" or flag "botName:Google". See: https://github.com/DanielSiepmann/tracking/blob/feature/46-add-flags-feature/Documentation/Changelog/2.0.0.rst#features and https://github.com/DanielSiepmann/tracking/blob/feature/46-add-flags-feature/Documentation/Tags.rst

Widgets will be extended to allow filtering by tags. That way the extension is not limited to anything, e.g. bots, but open for anything. Developers can add further extractors to extract tags from request which will be attached as well. Integrators can then create fine grained widgets, e.g. top bots, top pages by bots, etc.
Existing information like operating systems are also moved to those tags via extractors.

Developers are also able to replace extractors, e.g. if you prefer another crawler library.
Current extractor allows to add further yaml files to matomo bot detection, e.g. if you have very specific bots from 3rd parties or your own.

The only issue left is a proper migration which doesn't take ages on large datasets. And proper documentation, especially on how to migrate the whole yaml setup.

I worked on a command which will migrate a configurable amount of record each run.
Still that would leave the dashboard in a broken state until all data is migrated. Not sure if that is sufficient. Maybe there should be a transition phase where both ways are supported. That way each integrator is free to use the new feature or keep old behaviour. But he can already use the new one and turn on migration and define a transition phase on its own.

On the other hand … its up to everyone to stay on v1 and we could just release v2 with the migration path and new features. Maybe people give it a try and provide feedback if that approach doesn't work … we then could still provide v2.x which provides compatibility with both and allow a smoother transition.

I am not sure I understand the problem.
Is it about that "unprocessed" records would be visible in the Dashboard, just be untagged until the "extractors" have run? Then I would say this is absolutely no problem.

Yes, that's the "big" problem I see.

Furthermore, one has to adjust the Services.yml configuration. But that shouldn't be a big problem. Default shipped configuration will be adapted, and I'll add a proper documentation for migration.

Let's see when I find time to finish. I'll then use the new version on my own site for a while before I'll merge and release the new version.

We need to ensure that existing ignores are kept, e.g. #105