pageglut: A Python repository from peterrenshaw

#=======
#                                  __      __ 
#    ____  ____ _____ ____  ____ _/ /_  __/ /_
#   / __ \/ __ `/ __ `/ _ \/ __ `/ / / / / __/
#  / /_/ / /_/ / /_/ /  __/ /_/ / / /_/ / /_  
# / .___/\__,_/\__, /\___/\__, /_/\__,_/\__/  
#/_/          /____/     /____/               
#
# This file is part of Page Glut.
#
# Page Glut is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Page Glut is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Page Glut.  If not, see <http://www.gnu.org/licenses/gpl-3.0.txt>.
#
# name: README
# date: 2018DEC31
#       2016DEC23
# prog: pr
# desc: Page Gluts: read docs/ABOUT.txt
#======


I wrote this quick hack to work out how heavy a page was. It works as shown below.
Only two options, the -v option (verbosity) tells you everything that is goint on,
the -u option (URL) lets you examine the index page found at this URL.


Requirements:
=============

Python3 is used here, though I use nothing (that I'm aware of) that could not be used in py2X.

# ------- start requirements -------------------------------------------------------------------
import bs4        # source: https://www.crummy.com/software/BeautifulSoup  
                  # docs:   http://beautiful-soup-4.readthedocs.io 


import requests   # source: https://github.com/kennethreitz/requests
                  # docs:   http://docs.python-requests.org/en/latest/

import html5lib   # source: https://github.com/html5lib/html5lib-python
                  # docs:   https://html5lib.readthedocs.io/en/latest/

import validators # source: https://github.com/kvesteri/validators 
                  # docs:   https://validators.readthedocs.io/en/latest/
# ------- end requirements --------------------------------------------------------------------


Example use:
============

$ python3.5 pageglut.py -v -u http://news.ycombinator.com

page	resource	total		url
(b)	(b)		(Kb)
----------------------------------------------------------
request <http://news.ycombinator.com>
	url ok
	status 200
	get img
	extract
	ex 'image' 'img' 'src'
	warning: problem url, needs fixing <y18.gif>
	warning: problem url, needs fixing <s.gif>
	extract image <['y18.gif', 's.gif']>
	get css
	extract
	ex 'css' 'link' 'href'
	warning: problem url, needs fixing <news.css?AMTnWwphWLGKuspwrcEL>
	warning: problem url, needs fixing <favicon.ico>
	warning: problem url, needs fixing <rss>
	extract css <['news.css?AMTnWwphWLGKuspwrcEL', 'favicon.ico', 'rss']>
	get js
	extract
	ex 'javascript' 'script' 'src'
	warning: problem url, needs fixing <hn.js?AMTnWwphWLGKuspwrcEL>
	extract javascript <['hn.js?AMTnWwphWLGKuspwrcEL']>
	url <y18.gif>
	warning: fixed url <http://news.ycombinator.com/y18.gif>
	status 200
	warning: content type not text
	size 100
	url <s.gif>
	warning: fixed url <http://news.ycombinator.com/s.gif>
	status 200
	warning: content type not text
	size 43
	url <news.css?AMTnWwphWLGKuspwrcEL>
	warning: fixed url <http://news.ycombinator.com/news.css?AMTnWwphWLGKuspwrcEL>
	status 200
	warning: content type not text
	warning: header.'content-length' no content size found
	url <favicon.ico>
	warning: fixed url <http://news.ycombinator.com/favicon.ico>
	status 200
	warning: content type not text
	warning: header.'content-length' no content size found
	url <rss>
	warning: fixed url <http://news.ycombinator.com/rss>
	status 200
	warning: content type not text
	warning: header.'content-length' no content size found
	url <hn.js?AMTnWwphWLGKuspwrcEL>
	warning: fixed url <http://news.ycombinator.com/hn.js?AMTnWwphWLGKuspwrcEL>
	status 200
	warning: content type not text
	warning: header.'content-length' no content size found
34184	143		34.327		<http://news.ycombinator.com>...
-----------------------------------------------------------


# vim: ff=unix:ts=4:sw=4:tw=78:noai:expandtab
peterrenshaw/pageglut