Twenkid/Vsy-Jack-Of-All-Trades-AGI-Bulgarian-Internet-Archive-And-Search-Engine

Initial Crawler Experiments. Data Formats. Technologies for Web Archiving & Serialization: Wget, Nutch, Common Crawl

Opened this issue · 0 comments

  1. Crawler Tasks: * Wget, Nutch, Commoncrawl, Heritrix (archive.org), (...)

    1. Wget
    2. Nutch
    3. Common Crawl - instead of crawling, learn how to access the data that is already crawled: https://commoncrawl.org/the-data/get-started/
  2. Data formats: Warc; ...

Crawler tasks

  1. ? Write a basic Py script that calls wget and crawls the list of online Bulgarian media. Extend the list with sample international media. Select some sample depth etc.
  2. Wget mirrors the files to disk as html and as Warc
  3. Use BeautifulSoup etc., extract and map text versions of the htmls
  4. Create a tool with simple GUI to search with a string, regex to the files
  5. Store the files in an SQLite DB and allow querying with SQL queries

A more sophisticated future analysis should recognize titles, headings, content; do image recognition and classification on the images, illustrations etc.

  • Crawler: Heritrix (Java) - created by Internet Archive (archive.org)
    arcreader IA-2006062.arc...

    Warc - a collection of web resources packed in a single file in order to reduce the multiple small files overhead. Sample from Wikipedia:

filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76
1 1 InternetArchive
URL IP-address Archive-date Content-type Archive-length

http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187
HTTP/1.1 200 OK
Date: Thu, 22 Jun 2006 19:01:15 GMT
Server: Apache
Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT
Content-Length: 30
Content-Type: text/html

<html>
Hello World!!!
</html>

Common Crawl

It's been crawling billions of web pages on a regular basis, but the complete collection requires serious BigData environment as the datasets are of hundreds of TBs. However it is possible to download an index and only select subsets.

Some examples collected quickly so far from https://commoncrawl.org/the-data/examples/

https://newsfetch.tech/
"News Extraction
News data from CommonCrawl, parsed and converted to a structured JSON format."
...

Below: Not tested yet - some of the years-old ones could be obsolete: