/scrapy-mongodb-cache-pipeline

A base pipeline and a decorator to allow you to cache item fields in MongoDB collections

Primary LanguagePython

Description

A base pipeline and a decorator to allow you to cache item fields in MongoDB collections.

Install

pip install scrapymongocache

Make your pipeline derive from MongoCacheBasePipeline and decorate the process_item method with the @cache decorator. The parameters for the decorator are: 1st the database and the collection to use in MongoDB, 2nd item's field to be used as a key and 3rd, array of fields to be used as cached values. Calls to process_item() aren't made if the key exists in cache. The cache assumes that MongoDB is installed in localhost. If it's installed somewhere else, then set the mongohost spider argument when you run scrapy (e.g. scrapy crawl myspider -a mongohost=username:password@host1:port1).

Example implementation

from geopy import geocoders
from scrapymongocache import *

class GeocodingPipeline(MongoCacheBasePipeline):
    def __init__(self):
        self.geo = geocoders.GoogleV3()

    @cache("loc_cache.cache", "address", ["loc", "geo_addr"])
    def process_item(self, item, spider):
        try:
            geo_addr, (lat, lng) = self.geo.geocode(item["address"], exactly_one = False)[0]
            item["loc"] = [lng, lat]
            item["geo_addr"] = geo_addr
        except:
            pass
        
        return item

Changelog

0.1.0 initial release

Licence

Copyright 2013 Dimitrios Kouzis-Loukas

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.