Domains Project: Processing petabytes of data so you don't have to

World's single largest Internet domains dataset

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Project news

Support needed!

You can support this project by doing any combination of the following:

Posting a link on your website to DomainsProject
Sponsoring this project on Patreon
Opening issue and attaching other domain datasets that are not here yet (be sure to scroll through this README first)

Milestones:

Domains

(Wasted) Internet traffic:

500TB
925TB
1PB
1.3PB

Random facts:

More than 1TB of Internet traffic is just 3 Mbytes of compressed data
1 million domains is just 5 Mbytes compressed
More than 1.3PB of Internet traffic is necessary to crawl 342 million domains (3.4TB / 1 million).
Only 2.3Gb of disk space is required to store 342 million domains in compressed form
1Gbit fully saturated link is good for about 2 million new domains every day
8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
2 ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
After reaching 9 million domains repository was switched to compressed files. Please use freely available XZ to unpack files.
After reaching 30 million records, files were moved to /data so repository doesn't have it's README at the very bottom.

Used by

Using dataset

This repository empoys Git LFS technology, therefore user has to use both git lfs and xz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.git
cd domains
git lfs install
./unpack.sh

Getting unfiltered dataset

Raw data may be available at https://dataset.domainsproject.org, though it is recommended to use Github repo.

wget -m https://dataset.domainsproject.org

Data format

After unpacking, domain lists are just text files (~8.2Gb at 342 mil) with one domain per line. Sample for data/afghanistan/domain2multi-af.txt:

1tv.af
1tvnews.af
3rdeye.af
8am.af
aan.af
acaa.gov.af
acb.af
acbr.gov.af
acci.org.af
ach.af
acku.edu.af
acsf.af
adras.af
aeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

DNS checks client is in early stages and is used by select few. It is called Freya and I'm working on making it stable and good enough for general public.

HTTP crawler is being rewritten as well. It is called Idun

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered using Scrapy and Colly frameworks.

Starting with version 1.0.7 crawler has partial robots.txt support and rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.