/Inversion-DNSBL-Generator

Generate malicious URL blocklists for DNSBL applications like pfBlockerNG or Pi-hole by scanning various public URL sources using the Safe Browsing API from Google and/or Yandex.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Inversion DNSBL (Domain Name System-based blackhole list) Generator

Logo

Generate malicious URL blocklists for DNSBL applications like pfBlockerNG or Pi-hole by scanning various public URL sources using the Safe Browsing API from Google and/or Yandex.

Report Bug · Request Feature

Python SQLite AIOHTTP Ray

GitHub stars GitHub watchers GitHub forks GitHub issues Code Climate Maintainability GitHub license GitHub commit activity

Table of Contents
  1. Blocklists available for download
  2. URL sources
  3. Safe Browsing API vendors
  4. Requirements
  5. Setup instructions
  6. Getting Started
  7. Other Examples
  8. Known Issues
  9. Disclaimer
  10. References

Blocklists available for download

Total Blocklist URLs

You may download the blocklists here

URL sources

Name URL Count Source Description
Tranco TOP1M 1M https://tranco-list.eu A Research-Oriented Top Sites Ranking Hardened Against Manipulation
DomCop TOP10M 10M https://www.domcop.com/top-10-million-domains Top 10 million domains Based on Open PageRank data
Registrar R01 6M https://r01.ru Zone files for .ru .su .rf domains
CubDomain.com 196M https://cubdomain.com Aggregator that tracks newly registered domains daily
ICANN CZDS (Centralized Zone Data Service) 247M https://czds.icann.org ICANN's centralized point for interested parties to request access to Zone Files provided by participating Top Level Domain Registries
Domains Project 2.1B https://domainsproject.org World’s single largest Internet domains dataset
Amazon Web Services EC2 57M https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-hostnames Amazon Elastic Compute Cloud hostnames
Google Compute Engine 11M https://www.gstatic.com/ipranges/cloud.json Google Compute Engine
OpenINTEL.nl 6M https://openintel.nl Zone files for .se .nu .ee domains
Switch.ch 3.3M https://switch.ch/open-data Zone files for .ch .li domains
AFNIC.fr 7M https://www.afnic.fr/en/products-and-services/fr-and-associated-services/shared-data-reuse-fr-data Daily newly registered .fr .re .pm .tf .wf .yt domains
Internet.ee 153K https://www.internet.ee/domains/ee-zone-file Estonian Internet Foundation (.ee)
Internetstiftelsen 1.7M https://zonedata.iis.se Swedish Internet Foundation
SK-NIC.sk 400K https://sk-nic.sk/subory/domains.txt Domain Registry of the Slovak Republic (.sk)
Google TAG IOCs 200 https://blog.google/threat-analysis-group Google Threat Analysis Group Indicators of Compromise
IPv4 Addresses 4.2B 0.0.0.0 - 255.255.255.255 Exhaustive list of all IPv4 addresses

Safe Browsing API vendors

Google Safe Browsing API Yandex Safe Browsing API
Google Yandex
Terms-of-Service Terms-of-Service

Requirements

System (mandatory)

  • Linux or macOS
  • Python 3.10+
  • Multi-core x86-64 CPU; for Python Ray support
  • RAM: At least 32GB
  • SSD Storage Space: At least 700GB required to process all URL sources

Safe Browsing API Access (mandatory)

Choose at least one

URL feed access (optional)

  • ICANN Zone Files: Sign up for a ICANN CZDS account
  • Once registered, turn off email notifications in the user settings (otherwise they will send you hundreds of acknowledgement emails), then select Create New Request on the Dashboard to request for zone file access.

Uploading blocklists to GitHub (optional)

Download limits

  • ICANN CZDS (Centralized Zone Data Service): Once every 24 hours per zone file
  • Switch.ch: Once every 24 hours per zone file

Setup instructions

git clone and cd into the project directory

Declare environment variables

cp .env-dev .env

In .env, fill in the following variables

# Mandatory: At least one of the following Safe Browsing API keys
GOOGLE_API_KEY=
YANDEX_API_KEY=

# Optional: ICANN zone file access
ICANN_ACCOUNT_USERNAME=
ICANN_ACCOUNT_PASSWORD=
# Some registrars will not accept your request reason unless you include your Name, Email, IP Address, Physical Address (Building, Street, Postcode etc.), and Phone Number
ICANN_REQUEST_REASON='Detection of potentially malicious domains for cybersecurity research. Name: _ Email: _ IP Address: _ Physical Address: _ Phone Number: _'

# Optional: Upload generated blocklists to your GitHub repository
GITHUB_ACCESS_TOKEN=
BLOCKLIST_REPOSITORY_NAME=

Install dependencies

According to PEP 668, use of a virtual environment is strongly recommended as of 2023.

python3 -m venv venv
venv/bin/python3 -m pip install --upgrade pip
venv/bin/python3 -m pip install -r requirements.txt

Download Domains Project URLs (optional)

# Dataset size ~49Gb
cd ../
git clone https://github.com/tb0hdan/domains.git
cd domains
git lfs install # you will need to install Git LFS first (https://git-lfs.github.com)

Edit unpack.sh and remove combine from the last line, then run:

./unpack.sh

Getting Started

Download Google Safe Browsing API hashes

⚠️ As of 4 August 2023, the following command will make around 9000 calls (exact number depends on number of hashes in Google's dataset) to Google Safe Browsing API. As the daily limit is 10,000 calls, --update-hashes should be run no more than once every 24 hours.

venv/bin/python3 main.py --update-hashes --vendors google

Download and Identify malicious URLs from Tranco TOP1M

  • ✔️ Add Tranco TOP1M URLs to database
  • ✔️ Identify malicious URLs from database using Safe Browsing API hashes, and generate a blocklist
  • ✔️ Update database with latest malicious URL statuses
  • 📝 Sources: Tranco TOP1M
  • 🛡️ Vendors: Google
venv/bin/python3 main.py --fetch-urls --identify-malicious-urls --sources top1m --vendors google

Other Examples

Download DomCop TOP10M URLs

  • ✔️ Add DomCop TOP10M URLs to database (no blocklist will be generated)
  • 📝 Sources: DomCop TOP10M
  • 🛡️ Vendors: Not Applicable
venv/bin/python3 main.py --fetch-urls --sources top10m

Download and Identify malicious URLs from all sources

⚠️ Requires at least 700GB free space.

ℹ️ If you have not downloaded any Safe Browsing API hashes yet, add the --update-hashes flag to the following command.

  • ✔️ Add URLs from all sources to database
  • ✔️ Identify malicious URLs from database using Safe Browsing API hashes, and generate a blocklist
  • ✔️ Update database with latest malicious URL statuses
  • 📝 Sources: Everything
  • 🛡️ Vendors: Google
venv/bin/python3 main.py --fetch-urls --identify-malicious-urls --vendors google

Retrieve URLs marked as malicious from past scans from database

  • ✔️ Retrieve URLs with malicious statuses (attained from past scans) from database, and generate a blocklist
  • 📝 Sources: DomCop TOP10M, Domains Project
  • 🛡️ Vendors: Google
venv/bin/python3 main.py --retrieve-known-malicious-urls --sources top10m domainsproject --vendors google

Display help message

venv/bin/python3 main.py --help

Known Issues

  • Yandex Safe Browsing Update API appears to be unserviceable. Yandex Technical support has been notified.

Disclaimer

  • This project is not sponsored, endorsed, or otherwise affiliated with Google and/or Yandex.

  • Google works to provide the most accurate and up-to-date information about unsafe web resources. However, Google cannot guarantee that its information is comprehensive and error-free: some risky sites may not be identified, and some safe sites may be identified in error.

  • URLs detected with the Safe Browsing API usually have a malicious validity period of about 5 minutes. As the blocklists are updated only once every 24 hours, the blocklists must not be used to display user warnings.

More information on Google Safe Browsing API usage limits: https://developers.google.com/safe-browsing/v4/usage-limits

References