Initial Crawler Experiments. Data Formats. Technologies for Web Archiving & Serialization: Wget, Nutch, Common Crawl
Opened this issue · 0 comments
-
Crawler Tasks: * Wget, Nutch, Commoncrawl, Heritrix (archive.org), (...)
- Wget
- Nutch
- Common Crawl - instead of crawling, learn how to access the data that is already crawled: https://commoncrawl.org/the-data/get-started/
-
Data formats: Warc; ...
Crawler tasks
- Wget supports Warc: https://en.wikipedia.org/wiki/Wget
https://wiki.archiveteam.org/index.php/Wget_with_WARC_output
- ? Write a basic Py script that calls wget and crawls the list of online Bulgarian media. Extend the list with sample international media. Select some sample depth etc.
- Wget mirrors the files to disk as html and as Warc
- Use BeautifulSoup etc., extract and map text versions of the htmls
- Create a tool with simple GUI to search with a string, regex to the files
- Store the files in an SQLite DB and allow querying with SQL queries
A more sophisticated future analysis should recognize titles, headings, content; do image recognition and classification on the images, illustrations etc.
-
Crawler: Heritrix (Java) - created by Internet Archive (archive.org)
arcreader IA-2006062.arc...- File format: Warc: https://en.wikipedia.org/wiki/Web_archiving
Warc - a collection of web resources packed in a single file in order to reduce the multiple small files overhead. Sample from Wikipedia:
filedesc://IA-2006062.arc 0.0.0.0 20060622190110 text/plain 76
1 1 InternetArchive
URL IP-address Archive-date Content-type Archive-length
http://foo.edu:80/hello.html 127.10.100.2 19961104142103 text/html 187
HTTP/1.1 200 OK
Date: Thu, 22 Jun 2006 19:01:15 GMT
Server: Apache
Last-Modified: Sat, 10 Jun 2006 22:33:11 GMT
Content-Length: 30
Content-Type: text/html
<html>
Hello World!!!
</html>
Common Crawl
It's been crawling billions of web pages on a regular basis, but the complete collection requires serious BigData environment as the datasets are of hundreds of TBs. However it is possible to download an index and only select subsets.
Some examples collected quickly so far from https://commoncrawl.org/the-data/examples/
https://newsfetch.tech/
"News Extraction
News data from CommonCrawl, parsed and converted to a structured JSON format."
...
Below: Not tested yet - some of the years-old ones could be obsolete:
-
Sample search engine in Scala, using some data from CC:
https://github.com/hannesrabo/simple-search-engine -
comcrawl is a python package for easily querying and downloading pages from commoncrawl.org.
https://github.com/michaelharms/comcrawl -
Watch the guidelines for fair use: "Overloading index.commoncrawl.org and bulk index downloads" https://groups.google.com/g/common-crawl/c/3QmQjFA_3y4/m/vTbhGqIBBQAJ