/PublicData

Public data sets for Marginalia Search

Data Sets

These are data sets used by Marginalia Search. If you feel something belongs that is absent, or is present that doesn't belong, feel free to make a pull request.

Contributions are welcome.

  • blogs.txt is a list of websites that are blogs (or close enough). Websites on this list receive slightly preferential treatment in how they are processed, and they are processed with the assumption that they are blogs with all that entails. blogs.txt is also the list of domains that show up in the new 'Blogosphere' filter.

  • docs.txt is not yet in use, but the idea is to gather as many good documentation sites as possible and make a filter for that.

  • random-domains.txt is the list of domains that are in the random exploration mode.