Initial Crawler Experiments. Data Formats. Technologies for Web Archiving & Serialization: Wget, Nutch, Common Crawl

Question

Initial Crawler Experiments. Data Formats. Technologies for Web Archiving & Serialization: Wget, Nutch, Common Crawl

Opened this issue 2 years ago · 0 comments

Crawler Tasks: * Wget, Nutch, Commoncrawl, Heritrix (archive.org), (...)
1. Wget
2. Nutch
3. Common Crawl - instead of crawling, learn how to access the data that is already crawled: https://commoncrawl.org/the-data/get-started/
Data formats: Warc; ...

Crawler tasks

Wget supports Warc: https://en.wikipedia.org/wiki/Wget
https://wiki.archiveteam.org/index.php/Wget_with_WARC_output

? Write a basic Py script that calls wget and crawls the list of online Bulgarian media. Extend the list with sample international media. Select some sample depth etc.
Wget mirrors the files to disk as html and as Warc
Use BeautifulSoup etc., extract and map text versions of the htmls
Create a tool with simple GUI to search with a string, regex to the files
Store the files in an SQLite DB and allow querying with SQL queries

A more sophisticated future analysis should recognize titles, headings, content; do image recognition and classification on the images, illustrations etc.

Crawler: Heritrix (Java) - created by Internet Archive (archive.org)
arcreader IA-2006062.arc...
- File format: Warc: https://en.wikipedia.org/wiki/Web_archiving
Warc - a collection of web resources packed in a single file in order to reduce the multiple small files overhead. Sample from Wikipedia:

filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76
1 1 InternetArchive
URL IP-address Archive-date Content-type Archive-length

http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187
HTTP/1.1 200 OK
Date: Thu, 22 Jun 2006 19:01:15 GMT
Server: Apache
Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT
Content-Length: 30
Content-Type: text/html

<html>
Hello World!!!
</html>

Common Crawl

It's been crawling billions of web pages on a regular basis, but the complete collection requires serious BigData environment as the datasets are of hundreds of TBs. However it is possible to download an index and only select subsets.

Some examples collected quickly so far from https://commoncrawl.org/the-data/examples/

https://newsfetch.tech/
"News Extraction
News data from CommonCrawl, parsed and converted to a structured JSON format."
...

Below: Not tested yet - some of the years-old ones could be obsolete:

Sample search engine in Scala, using some data from CC:
https://github.com/hannesrabo/simple-search-engine
comcrawl is a python package for easily querying and downloading pages from commoncrawl.org.
https://github.com/michaelharms/comcrawl
Watch the guidelines for fair use: "Overloading index.commoncrawl.org and bulk index downloads" https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ