In theory we would like to choose a domain and get all relevant words from it.
One way to do it is to scrape all text from heading and paragraph tags.
We can do that using python and beautiful soup.
If we did that synchronously it could take an hour or more, depending on the domain size.
For that reason we can do it using the python asyncio package.
This way the bottleneck becomes the internet speed and the maximum number of open files.
On a Linux system that is usually 1024.
For the task queue and the memory set we can use redis alongside the asynchronous aioredis package.
Check out the implementation in scraper.py.
creating a mask
black & white
colourful
Create by fitting the logo to the center of the image with gimp, selecting alpha to selection on the layer and growing it by a couple of pixels.
Create by fitting the logo to the center of the image with gimp and then dilate it with imagemagick.
colourize
Simplest way to colour the word cloud is using one of the predefined matplotlib colormaps.
A step up would be to use the colourful mask we created.
The best option is to colour the text transparent and edit the rest in gimp.