/couchbaseNoaaLoader

Scripts and Java implementation to load data from noaa ISH into Couchbase

Primary LanguageJava

Success with Python

First of all, you need scrapy, so install it with:

pip install scrapy
spider.py
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'www1.ncdc.noaa.gov/pub/data/noaa/'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        (URL + str(year)) for year in range(1901,2016)
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not url.startswith('http://'):
                url= response.url + url
            request = Request(url=url,callback=self.parse)
            print url
            yield request
100 connections to download files.
  scrapy runspider spider.py 2>&1 | grep '^http' | grep '\.gz$' > urls.txt;

It's possible to use gnu-parallel to parallelise the download.

On a mac, for example:

brew install parallel

wget with gnu-parallel
cat urls.txt | parallel -j 100 --gnu "wget {}"
Run command for .jar to insert station metadata
wget http://www1.ncdc.noaa.gov/pub/data/noaa/isd-history.csv
java -cp cb-noaa.jar org.couchbase.noaaLoader.Application isd-history.csv {CLUSTERIP} {BUCKET}
Run command for .jar to insert samples
java -cp cb-noaa.jar org.couchbase.noaaLoader.support.ishJava ./noaa 84.40.63.62 > cb_upload_log.txt &