/the-great-gpt-firewall

๐Ÿค– A curated list of websites that restrict access to AI Agents, AI crawlers and GPTs

Primary LanguagePythonMIT LicenseMIT

The Great GPT Firewall ๐Ÿ“›

This collection is a curated list of websites that employ the robots.txt file to restrict access to AI Agents, AI crawlers and GPTs.

It will be updated monthly.

We need a plan!

User agents & robots.txt

The robots.txt file allows website owners to control and limit the access of these user agents to certain areas of their website by specifying rules and directives.

# OpenAIโ€™s web crawler: GPT3.5, GPT4, ChatGPT
# https://platform.openai.com/docs/bots
User-agent: GPTBot

# ChatGPT plugins
# https://platform.openai.com/docs/bots
User-agent: ChatGPT-User

# OpenAI Search bot
# https://platform.openai.com/docs/bots
User-agent: OAI-SearchBot

# Google's web crawler: Bard, VertexAI, Gemini
# https://blog.google/technology/ai/an-update-on-web-publisher-controls/
User-agent: Google-Extended

# Apple's web crawler, dedicated to GenAI projects
# https://support.apple.com/en-us/119829
User-agent: Applebot-Extended

# Claude
User-agent: anthropic-ai

# Claude Bot
User-agent: ClaudeBot

# Claude web
User-agent: Claude-Web

# Cohere
User-agent: Cohere-ai

# Perplexity
User-agent: PerplexityBot

# Common Crawl
# https://commoncrawl.org/ccbot
User-agent: CCBot

# Omglibot: webz.io
# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/
User-agent: Omgilibot
User-agent: Omgili
User-agent: Webzio-Extended

# Facebook: Llama
# https://developers.facebook.com/docs/sharing/bot/
User-agent: FacebookBot

# ByteDance: Duobao
User-agent: Bytespider

# Censorship area
Disallow: /

Disclaimer

Please note that this blocklist is intended for informational purposes only. Despite the provoking project name, it's fine to disallow web crawling and protect content ownership.

2024-05 update

Category: Press

  • Scanned: 66
  • โœ… Passing: 41 %
  • ๐Ÿ” Blocked: 59 %
  • โ“ Unknown: 0 %
Name Country Status
The Times ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
BBC ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
The Guardian ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
The Economist ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
Financial Times ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
The Independent ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
The Telegraph ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
Daily Mail ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
The Sun ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ”
Daily Mirror ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Daily Express ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Washington Post ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
USA Today ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Fox News ๐Ÿ‡บ๐Ÿ‡ธ โœ…
ABC News ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
NBC News ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
CBS News ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Los Angeles Times ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Chicago Tribune ๐Ÿ‡บ๐Ÿ‡ธ โœ…
New York Post ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
New York Daily News ๐Ÿ‡บ๐Ÿ‡ธ โœ…
The New Yorker ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Vice ๐Ÿ‡บ๐Ÿ‡ธ โœ…
New York Times ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Wall Street Journal ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
CNN ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
El Paรญs ๐Ÿ‡ช๐Ÿ‡ธ โœ…
Sรผddeutsche Zeitung ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ”
Der Spiegel ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ”
Corriere della Sera ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ”
La Repubblica ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ”
Le Monde ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Libรฉration ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Le Figaro ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
20 Minutes ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Ouest France ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Le Parisien ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
L'Equipe ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Le Point ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Marianne ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Le Nouvel Observateur ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
L'Express ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
France 24 ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
BFMTV ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
CNews ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Le Monde Diplomatique ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Mediapart ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Courrier International ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
Brut ๐Ÿ‡ซ๐Ÿ‡ท โœ…
IMDB ๐ŸŒ โœ…
Allocine ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Fakt ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Super Express ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Gazeta Wyborcza ๐Ÿ‡ต๐Ÿ‡ฑ ๐Ÿ”
Rzeczpospolita ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Dziennik Gazeta Prawna ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Polityka ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Newsweek Polska ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Goล›ฤ‡ Niedzielny ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Sieci ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Do Rzeczy ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Twรณj Styl ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Zwierciadล‚o ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Wysokie Obcasy Extra ๐Ÿ‡ต๐Ÿ‡ฑ ๐Ÿ”
Pani ๐Ÿ‡ต๐Ÿ‡ฑ โœ…
Elle ๐Ÿ‡ต๐Ÿ‡ฑ โœ…

Category: Video on demand

  • Scanned: 9
  • โœ… Passing: 56 %
  • ๐Ÿ” Blocked: 44 %
  • โ“ Unknown: 0 %
Name Country Status
Prime Video ๐ŸŒ โœ…
Netflix ๐ŸŒ โœ…
Disney+ ๐ŸŒ ๐Ÿ”
Hulu ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
HBO Max ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Canal+ ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
FranceTV ๐Ÿ‡ซ๐Ÿ‡ท โœ…
TF1 ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”
6Play ๐Ÿ‡ซ๐Ÿ‡ท โœ…

Category: Music

  • Scanned: 6
  • โœ… Passing: 67 %
  • ๐Ÿ” Blocked: 33 %
  • โ“ Unknown: 0 %
Name Country Status
Soundcloud ๐ŸŒ ๐Ÿ”
Youtube ๐ŸŒ โœ…
Apple Music ๐ŸŒ โœ…
Spotify ๐ŸŒ ๐Ÿ”
Deezer ๐Ÿ‡ซ๐Ÿ‡ท โœ…
LastFM ๐Ÿ‡ฌ๐Ÿ‡ง โœ…

Category: Podcast

  • Scanned: 8
  • โœ… Passing: 75 %
  • ๐Ÿ” Blocked: 25 %
  • โ“ Unknown: 0 %
Name Country Status
Google Podcasts ๐ŸŒ โœ…
Apple Podcast ๐ŸŒ โœ…
Spotify Podcaster ๐ŸŒ ๐Ÿ”
Buzzsprout ๐ŸŒ โœ…
Podbean ๐ŸŒ โœ…
Acast ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
AudioMeans ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Radio France ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ”

Category: X

  • Scanned: 6
  • โœ… Passing: 67 %
  • ๐Ÿ” Blocked: 33 %
  • โ“ Unknown: 0 %
Name Country Status
PornHub ๐ŸŒ ๐Ÿ”
YouPorn ๐ŸŒ ๐Ÿ”
Xnxx ๐ŸŒ โœ…
Xvideos ๐ŸŒ โœ…
Xhamster ๐ŸŒ โœ…
OnlyFan ๐ŸŒ โœ…

Category: Religion

  • Scanned: 5
  • โœ… Passing: 100 %
  • ๐Ÿ” Blocked: 0 %
  • โ“ Unknown: 0 %
Name Country Status
Bible ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Bible gateway ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Jehovah's Witnesses ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Vatican ๐Ÿ‡ป๐Ÿ‡ฆ โœ…
Islamweb ๐ŸŒ โœ…

Category: Social media

  • Scanned: 13
  • โœ… Passing: 31 %
  • ๐Ÿ” Blocked: 62 %
  • โ“ Unknown: 8 %
Name Country Status
Facebook ๐ŸŒ ๐Ÿ”
Instagram ๐ŸŒ ๐Ÿ”
Reddit ๐ŸŒ โœ…
Hacker News ๐ŸŒ โ“
Lobsters ๐ŸŒ ๐Ÿ”
Pinterest ๐ŸŒ ๐Ÿ”
TikTok ๐ŸŒ โœ…
Twitter ๐ŸŒ ๐Ÿ”
LinkedIn ๐ŸŒ โœ…
Quora ๐ŸŒ ๐Ÿ”
VK ๐Ÿ‡ท๐Ÿ‡บ โœ…
TripAdvisor ๐ŸŒ ๐Ÿ”
Yelp ๐ŸŒ ๐Ÿ”

Category: Artist

  • Scanned: 42
  • โœ… Passing: 71 %
  • ๐Ÿ” Blocked: 19 %
  • โ“ Unknown: 10 %
Name Country Status
Michael Jackson ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Madonna ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Taylor Swift ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Rihanna ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Bruno Mars ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Justin Bieber ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Beyoncรฉ ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Katy Perry ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Lady Gaga ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Hardwell ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Dimitri Vegas & Like Mike ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Kanye West ๐Ÿ‡บ๐Ÿ‡ธ โ“
Black Eyed Peas ๐Ÿ‡บ๐Ÿ‡ธ โ“
Imagine Dragons ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Twenty One Pilots ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Maroon 5 ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Selena Gomez ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Usher ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Stromae ๐Ÿ‡ง๐Ÿ‡ช โ“
Aya Nakamura ๐Ÿ‡ซ๐Ÿ‡ท โ“
Soprano ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Johnny Hallyday ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Grand Corps Malade ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Zaho ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Jean Louis Aubert ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Camelia Jordana ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Indochine ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Tryo ๐Ÿ‡ซ๐Ÿ‡ท โœ…
David Guetta ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Mc Solaar ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Zaz ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Christine and the Queens ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Boulevard des Airs ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Calogero ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Hoshi ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Avicii ๐Ÿ‡ธ๐Ÿ‡ช โœ…
Adele ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Calvin Harris ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Ed Sheeran ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Arctic Monkeys ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Coldplay ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
The Weeknd ๐Ÿ‡จ๐Ÿ‡ฆ ๐Ÿ”

Category: Gov

  • Scanned: 3
  • โœ… Passing: 100 %
  • ๐Ÿ” Blocked: 0 %
  • โ“ Unknown: 0 %
Name Country Status
White House ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Elysรฉe ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Europe ๐Ÿ‡ช๐Ÿ‡บ โœ…

Category: Science

  • Scanned: 28
  • โœ… Passing: 89 %
  • ๐Ÿ” Blocked: 11 %
  • โ“ Unknown: 0 %
Name Country Status
Google Scholar ๐ŸŒ โœ…
Sci-Hub ๐ŸŒ โœ…
PubPeer ๐ŸŒ โœ…
Scopus ๐Ÿ‡ณ๐Ÿ‡ฑ ๐Ÿ”
Elsevier ๐Ÿ‡ณ๐Ÿ‡ฑ ๐Ÿ”
ScienceDirect ๐Ÿ‡ณ๐Ÿ‡ฑ โœ…
MDPI ๐Ÿ‡จ๐Ÿ‡ญ โœ…
Springer ๐Ÿ‡ฉ๐Ÿ‡ช โœ…
Wiley ๐Ÿ‡บ๐Ÿ‡ธ โœ…
American Chemical Society ๐Ÿ‡บ๐Ÿ‡ธ โœ…
PubMed ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Academia ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Science ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
ArXiv ๐Ÿ‡บ๐Ÿ‡ธ โœ…
American Physical Society ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Mendeley ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Nature ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Taylor & Francis ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Oxford University Press ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Cambridge University Press ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
Royal Society of Chemistry ๐Ÿ‡ฌ๐Ÿ‡ง โœ…
ResearchGate ๐Ÿ‡ฉ๐Ÿ‡ช โœ…
BNF ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Cairn ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Persee ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Gallica ๐Ÿ‡ซ๐Ÿ‡ท โœ…
HAL ๐Ÿ‡ซ๐Ÿ‡ท โœ…
OpenEdition ๐Ÿ‡ซ๐Ÿ‡ท โœ…

Category: Dev

  • Scanned: 3
  • โœ… Passing: 67 %
  • ๐Ÿ” Blocked: 33 %
  • โ“ Unknown: 0 %
Name Country Status
Github ๐ŸŒ โœ…
Gitlab ๐ŸŒ โœ…
Stack Overflow ๐ŸŒ ๐Ÿ”

Category: Other content

  • Scanned: 19
  • โœ… Passing: 74 %
  • ๐Ÿ” Blocked: 26 %
  • โ“ Unknown: 0 %
Name Country Status
Wikipedia ๐ŸŒ โœ…
Medium ๐ŸŒ ๐Ÿ”
Substack ๐ŸŒ โœ…
Common Crawl ๐ŸŒ โœ…
Internet Archive ๐ŸŒ โœ…
Wayback Machine ๐ŸŒ โœ…
Notion ๐ŸŒ โœ…
Weather ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
AccuWeather ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Mรฉtรฉo France ๐Ÿ‡ซ๐Ÿ‡ท โœ…
Getty Images ๐Ÿ‡บ๐Ÿ‡ธ โœ…
Shutterstock ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Adobe Stock ๐Ÿ‡บ๐Ÿ‡ธ ๐Ÿ”
Unsplash ๐Ÿ‡จ๐Ÿ‡ฆ ๐Ÿ”
Pexels ๐Ÿ‡ฉ๐Ÿ‡ช โœ…
Pixabay ๐Ÿ‡ฉ๐Ÿ‡ช โœ…
Flickr ๐Ÿ‡บ๐Ÿ‡ธ โœ…
500px ๐Ÿ‡จ๐Ÿ‡ฆ โœ…
Giphy ๐Ÿ‡บ๐Ÿ‡ธ โœ…

Category: Other

  • Scanned: 1
  • โœ… Passing: 100 %
  • ๐Ÿ” Blocked: 0 %
  • โ“ Unknown: 0 %
Name Country Status
Indeed ๐Ÿ‡บ๐Ÿ‡ธ โœ…

WTF list

A.k.a: do they understand their business model? ๐Ÿ’ธ

Name Status
Getty Images โœ…
Pexels โœ…
500px โœ…

Shame list

A.k.a: this is public interest. ๐Ÿ–•

Name Status
Medium ๐Ÿ”
Quora ๐Ÿ”
Elsevier ๐Ÿ”
Scopus ๐Ÿ”
Science ๐Ÿ”

๐Ÿค Contributing

Looking for contributions:

  • Enrich website database
  • Chinese websites
  • New categories

Please open issues!

Don't hesitate ;)

Build

python -m venv venv
source ./venv/bin/activate
pip3 install -r requirements.txt
python3 scrape.py
# then copy the last version into readme

๐Ÿ‘ค Contributors

Contributors

๐Ÿ’ซ Show your support

Give a โญ๏ธ if this project helped you!

GitHub Sponsors

๐Ÿ“ License

Copyright ยฉ 2024 Samuel Berthe.

This project is MIT licensed.