Inversion-DNSBL-Generator: A Python repository from elliotwutingfeng

Inversion DNSBL (Domain Name System-based blackhole list) Generator

Generate malicious URL blocklists for DNSBL applications like pfBlockerNG or Pi-hole by scanning various public URL sources using the Safe Browsing API from Google and/or Yandex.

Report Bug · Request Feature

Table of Contents

Blocklists available for download
URL sources
Safe Browsing API vendors
Requirements
Setup instructions
Getting Started
- Download Google Safe Browsing API hashes
- Download and Identify malicious URLs from Tranco TOP1M
Other Examples
Known Issues
Disclaimer
References

Blocklists available for download

You may download the blocklists here

URL sources

Name	URL Count	Source	Description
Tranco TOP1M	1M	https://tranco-list.eu	A Research-Oriented Top Sites Ranking Hardened Against Manipulation
DomCop TOP10M	10M	https://www.domcop.com/top-10-million-domains	Top 10 million domains Based on Open PageRank data
Registrar R01	6M	https://r01.ru	Zone files for .ru .su .rf domains
CubDomain.com	196M	https://cubdomain.com	Aggregator that tracks newly registered domains daily
ICANN CZDS (Centralized Zone Data Service)	247M	https://czds.icann.org	ICANN's centralized point for interested parties to request access to Zone Files provided by participating Top Level Domain Registries
Domains Project	2.1B	https://domainsproject.org	World’s single largest Internet domains dataset
Amazon Web Services EC2	57M	https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#vpc-dns-hostnames	Amazon Elastic Compute Cloud hostnames
Google Compute Engine	11M	https://www.gstatic.com/ipranges/cloud.json	Google Compute Engine
OpenINTEL.nl	6M	https://openintel.nl	Zone files for .se .nu .ee domains
Switch.ch	3.3M	https://switch.ch/open-data	Zone files for .ch .li domains
AFNIC.fr	7M	https://www.afnic.fr/en/products-and-services/fr-and-associated-services/shared-data-reuse-fr-data	Daily newly registered .fr .re .pm .tf .wf .yt domains
Internet.ee	153K	https://www.internet.ee/domains/ee-zone-file	Estonian Internet Foundation (.ee)
Internetstiftelsen	1.7M	https://zonedata.iis.se	Swedish Internet Foundation
SK-NIC.sk	400K	https://sk-nic.sk/subory/domains.txt	Domain Registry of the Slovak Republic (.sk)
Google TAG IOCs	200	https://blog.google/threat-analysis-group	Google Threat Analysis Group Indicators of Compromise
IPv4 Addresses	4.2B	0.0.0.0 - 255.255.255.255	Exhaustive list of all IPv4 addresses

Safe Browsing API vendors


Google	Yandex
Terms-of-Service	Terms-of-Service

Requirements

System (mandatory)

Linux or macOS
Python 3.10+
Multi-core x86-64 CPU; for Python Ray support
RAM: At least 32GB
SSD Storage Space: At least 700GB required to process all URL sources

Safe Browsing API Access (mandatory)

Choose at least one

URL feed access (optional)

ICANN Zone Files: Sign up for a ICANN CZDS account
Once registered, turn off email notifications in the user settings (otherwise they will send you hundreds of acknowledgement emails), then select Create New Request on the Dashboard to request for zone file access.

Uploading blocklists to GitHub (optional)

Create a GitHub API Personal Access Token

Download limits

ICANN CZDS (Centralized Zone Data Service): Once every 24 hours per zone file
Switch.ch: Once every 24 hours per zone file

Setup instructions

git clone and cd into the project directory

Declare environment variables

cp .env-dev .env

In .env, fill in the following variables

# Mandatory: At least one of the following Safe Browsing API keys
GOOGLE_API_KEY=
YANDEX_API_KEY=

# Optional: ICANN zone file access
ICANN_ACCOUNT_USERNAME=
ICANN_ACCOUNT_PASSWORD=
# Some registrars will not accept your request reason unless you include your Name, Email, IP Address, Physical Address (Building, Street, Postcode etc.), and Phone Number
ICANN_REQUEST_REASON='Detection of potentially malicious domains for cybersecurity research. Name: _ Email: _ IP Address: _ Physical Address: _ Phone Number: _'

# Optional: Upload generated blocklists to your GitHub repository
GITHUB_ACCESS_TOKEN=
BLOCKLIST_REPOSITORY_NAME=

Install dependencies

According to PEP 668, use of a virtual environment is strongly recommended as of 2023.

python3 -m venv venv
venv/bin/python3 -m pip install --upgrade pip
venv/bin/python3 -m pip install -r requirements.txt

Download Domains Project URLs (optional)

# Dataset size ~49Gb
cd ../
git clone https://github.com/tb0hdan/domains.git
cd domains
git lfs install # you will need to install Git LFS first (https://git-lfs.github.com)

Edit unpack.sh and remove combine from the last line, then run:

./unpack.sh

Getting Started

Download Google Safe Browsing API hashes

⚠️ As of 4 August 2023, the following command will make around 9000 calls (exact number depends on number of hashes in Google's dataset) to Google Safe Browsing API. As the daily limit is 10,000 calls, --update-hashes should be run no more than once every 24 hours.

venv/bin/python3 main.py --update-hashes --vendors google

Download and Identify malicious URLs from Tranco TOP1M

✔️ Add Tranco TOP1M URLs to database
✔️ Identify malicious URLs from database using Safe Browsing API hashes, and generate a blocklist
✔️ Update database with latest malicious URL statuses
📝 Sources: Tranco TOP1M
🛡️ Vendors: Google

venv/bin/python3 main.py --fetch-urls --identify-malicious-urls --sources top1m --vendors google

Other Examples

Download DomCop TOP10M URLs

✔️ Add DomCop TOP10M URLs to database (no blocklist will be generated)
📝 Sources: DomCop TOP10M
🛡️ Vendors: Not Applicable

venv/bin/python3 main.py --fetch-urls --sources top10m

Download and Identify malicious URLs from all sources

⚠️ Requires at least 700GB free space.

ℹ️ If you have not downloaded any Safe Browsing API hashes yet, add the --update-hashes flag to the following command.

✔️ Add URLs from all sources to database
✔️ Identify malicious URLs from database using Safe Browsing API hashes, and generate a blocklist
✔️ Update database with latest malicious URL statuses
📝 Sources: Everything
🛡️ Vendors: Google

venv/bin/python3 main.py --fetch-urls --identify-malicious-urls --vendors google

Retrieve URLs marked as malicious from past scans from database

✔️ Retrieve URLs with malicious statuses (attained from past scans) from database, and generate a blocklist
📝 Sources: DomCop TOP10M, Domains Project
🛡️ Vendors: Google

venv/bin/python3 main.py --retrieve-known-malicious-urls --sources top10m domainsproject --vendors google

Display help message

venv/bin/python3 main.py --help

Known Issues

Yandex Safe Browsing Update API appears to be unserviceable. Yandex Technical support has been notified.

Disclaimer

This project is not sponsored, endorsed, or otherwise affiliated with Google and/or Yandex.
Google works to provide the most accurate and up-to-date information about unsafe web resources. However, Google cannot guarantee that its information is comprehensive and error-free: some risky sites may not be identified, and some safe sites may be identified in error.
URLs detected with the Safe Browsing API usually have a malicious validity period of about 5 minutes. As the blocklists are updated only once every 24 hours, the blocklists must not be used to display user warnings.

More information on Google Safe Browsing API usage limits: https://developers.google.com/safe-browsing/v4/usage-limits

elliotwutingfeng/Inversion-DNSBL-Generator