HiveMinds/browse-like-us

Create scraped, curated uBlock Origin filter dataset as baseline.

Opened this issue · 0 comments

a-t-0 commented
  1. Determine how to find personal Ublock0 filter lists to create a dataset, and find the existing ones in GitHub. Allow users to follow/like/try/find:
    1.1 Direct/complete ublock lists.
    1.2 Ublock filter exports.
    1.3 Ublock backup
    As a tip; to find these one could search on the (default) filenames of these types of files.

  2. Make applying Ublock0 filters easy and modular using that dataset.

  3. Automatically cluster the filter list files based on the websites and or themes. Determine how to identify the difference between generic filter comments, and comments specific to a website. Determine how to handle those types of comments. (For example, if someone writes:

# Some comment for stack exchange.
some filter for stackoverflow
some filter for askUbuntu

Then the comments get refactored separately into:

# AskUbuntu.com
some filter for stackoverflow
# Stackoverflow.com
some filter for askUbuntu

Where does the # Some comment for stack exchange. go? Propose manual merging made easy, give user a prompt with the two/n categories and ask: does it go into any, should it be changed or deleted?

3.2 I created some filters for the stackexchange network wich for example drop off the bottom bar of all websites in the stackexchange universe. However, instead of writing:

Askubuntu.com##Drop_bottom_bar
stackoverflow.com##Drop_bottom_bar
...
etc.

I wrote:

##Drop_bottom_bar

So now if some other random website has that bottom bar, and I need it, it still gets filtered out. So determine how to determine for which sites specifically the filters should be, and make it complete and structured within the filtering dataset. (Automatically send pull requests with clarifying information if it cannot be derived from comments and/or user input).

3.2 Allow group/cluster filters, like for stackexchange (instead of 1 for askubuntu.com and another for stackoverflow). In the dataset, include the cluster/group relation, and store the filters as separate per website.

  1. Make a list of:
    • Filter lists
    • Custom Filters:
      • Available generic filters (like for stackexchange universe, instead of for askubuntu.com and stackoverflow.com etc.)
      • Available website specific filters
    • Custom Rules
    • Trusted Sites
  2. Determine usefull dataset categories
    • Input: Html Source as input, element filter set (specific to website, or overarching group)
    • Label/score: User votes/User usage
    • If a classifier exists on how readable a website is, this may be used to generate uBlock origin filters using some genetic algorithm/evolutionary strategy. (Be carefull it doesn't converge to: no content is easiest to read, cause there is no difficult content to read.)
    • Doubt: perhaps include javascript/dependencies in the input labels to make it learn better on how to filter this more complicated web content.
    • Perhaps train on visual image of website. However, they would need some reliable score on:
    • Is the relevant/minimally required content present?
    • How minimalist/nice/easy to read/calm is it? (relative to without filters)