/Redips

Redips is a Python based web crawler.

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

Redips

Redips is a Python based web crawler. Compatible with Python 2.7.X

Redips can be used to generate:

  • An index that maps every word Redips encounters with a set of URLs it finds the word on.
  • A graph that maps every URL Redips encounters to a list of URLs on that page.

Index: {keyword1 : set([URL1, URL2,...]), keyword2 : set([URL3, URL4,...]), ...}

Graph: {URL1 : [outlink1, outlink2,...], URL2 : [outlink3, outlink4,...], ...}

To use:

>>> from redips import Redips
>>> redips = Redips('http://github.com')
>>> redips.crawl()

The Redips constructor takes 2 arguments both of which are optional

>>> from redips import Redips
>>> redips = Redips(seed_url='http://foo.bar', pickle='abc.pickle')

URLs can also be added to the list of URLs to be crawled as:

>>> redips.add_url('http://google.com')
>>> redips.to_crawl
['http://google.com']
>>>

To save the state of a Redips object:

>>> redips.save()

To load a previously pickled Redips object:

>>> from redips import *
>>> redips = load('redips.pickle') # Or whichever pickle file your crawler is saved in
>>> redips.crawl() # Resume crawling from where it left

To specify a file other than the default redips.pickle file for pickling the crawler:

>>> from redips import *
>>> redips = Redips(pickle='foo.pickle')

To crawl a single page:

>>> redips.crawl_page('http://foo.bar')

To access the index:

>>> redips.get_index()

To access the graph:

>>> redips.get_graph()

To reset the list of URLs to crawl:

>>> redips.reset_to_crawl()

To merge the index with another index:

>>> redips.merge_index(index)

To merge the graph with another graph:

>>> redips.merge_graph(graph)

To merge the data of another Redips object with your Redips object:

>>> redips.merge(anotherRedips)