/csvget

Uses parselets and rwget to generate csv files from websites

Primary LanguageRubyMIT LicenseMIT

csvget (also, jsonget)

What’s new in 0.3.0

  1. Added command-line option:

    --filter=RUBY_CODE RUBY_CODE will be eval'd in context of @row.is_a?(FasterCSV::Row)
    
    # Example:
    ./bin/csvget --require=chronic --require=time --filter "@row['time']=Chronic.parse(@row['time']).iso8601" # ...
  2. Added CHANGELOG

Dependencies

Running on EC2

> git clone git://github.com/fizx/csvget-ec2-recipe.git
> cd csvget-ec2-recipe
> ./boot.rb
> ssh YOUR_INSTANCE

Local Installation

  1. Install the dependencies.

  2. > gem sources -a gems.github.com

  3. > sudo gem install fizx-csvget

Example Usage

> cat hn.let
{
  "headlines":[{
    "title": ".title a",
    "link": ".title a @href",
    "comments": "match(.subtext a:nth-child(3), '\\d+')",
    "user": ".subtext a:nth-child(2)",
    "score": "match(.subtext span, '\\d+')",
    "time": "match(.subtext, '\\d+\\s+\\w+\\s+ago')"
  }]
}
> csvget --directory-prefix=./data  -A "/x" -w 5 --parselet=hn.let http://news.ycombinator.com/
> head data/headlines.csv
comments,title,time,link,score,user
4,Simpson's paradox: why mistrust seemingly simple statistics,2 hours ago,http://en.wikipedia.org/wiki/Simpson%27s_paradox,41,waldrews
67,America's unjust sex laws,2 hours ago,http://www.economist.com/opinion/displaystory.cfm?story_id=14165460,59,MikeCapone
23,Buy somebody lunch,3 hours ago,http://www.whattofix.com/blog/archives/2009/08/buy-somebody-lu.php,58,DanielBMarkham
10,A design pattern is an artifact of a missing feature in your chosen language,3 hours ago,http://www.snell-pym.org.uk/archives/2008/12/29/design-patterns/,31,bensummers
4,API changes in Snow Leopard,1 hour ago,http://developer.apple.com/mac/library/releasenotes/MacOSX/WhatsNewInOSX/Articles/MacOSX10_6.html#//apple_ref/doc/uid/TP40008898-SW1,14,pieter
16,How to run a linux based home web server,3 hours ago,http://stevehanov.ca/blog/index.php?id=73,28,RiderOfGiraffes
1,"OpenCL ""Hello World""",1 hour ago,"http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/Example:Hello,World/Example:Hello,World.html",8,pieter
15,US Senate bill allows White House to disconnect private computers from Internet,4 hours ago,http://news.cnet.com/8301-13578_3-10320096-38.html,35,drewr
1,Strategy: Solve Only 80 Percent of the Problem,47 minutes ago,http://highscalability.com/strategy-solve-only-80-percent-problem,6,alrex021
> csvget -h
Usage: ./bin/csvget [options] SEED_URL [SEED_URL2 ...]
        --parselet=JSON_FILE         JSON_FILE is a parselet.
    -w, --wait=SECONDS               wait SECONDS between retrievals.
    -P, --directory-prefix=PREFIX    save files to PREFIX/...
    -U, --user-agent=AGENT           identify as AGENT instead of RWget/VERSION.
    -A, --accept-pattern=RUBY_REGEX  URLs must match RUBY_REGEX to be saved to the queue.
        --time-limit=AMOUNT          Crawler will stop after this AMOUNT of time has passed.
    -R, --reject-pattern=RUBY_REGEX  URLs must NOT match RUBY_REGEX to be saved to the queue.
        --require=RUBY_SCRIPT        Will execute 'require RUBY_SCRIPT'
        --limit-rate=RATE            limit download rate to RATE.
        --http-proxy=URL             Proxies via URL
        --proxy-user=USER            Sets proxy user to USER
        --proxy-password=PASSWORD    Sets proxy password to PASSWORD
        --fetch-class=RUBY_CLASS     Must implement fetch(uri, user_agent_string) #=> [final_redirected_url, file_object]
        --store-class=RUBY_CLASS     Must implement put(key_string, temp_file)
        --dupes-class=RUBY_CLASS     Must implement dupe?(uri)
        --queue-class=RUBY_CLASS     Must implement put(key_string, depth_int) and get() #=> [key_string, depth_int]
        --links-class=RUBY_CLASS     Must implement urls(base_uri, temp_file) #=> [uri, ...]
    -S, --sitemap=URL                URL of a sitemap to crawl (will ignore inter-page links)
    -V, --version
    -Q, --quota=NUMBER               set retrieval quota to NUMBER.
        --max-redirect=NUM           maximum redirections allowed per page.
    -H, --span-hosts                 go to foreign hosts when recursive
        --connect-timeout=SECS       set the connect timeout to SECS.
    -T, --timeout=SECS               set all timeout values to SECONDS.
    -l, --level=NUMBER               maximum recursion depth (inf or 0 for infinite).
        --[no-]timestampize          Prepend the timestamp of when the crawl started to the directory structure.
        --incremental-from=PREVIOUS  Build upon the indexing already saved in PREVIOUS.
        --protocol-directories       use protocol name in directories.
        --no-host-directories        don't create host directories.
    -v, --[no-]verbose               Run verbosely
    -h, --help                       Show this message

Copyright © 2009 Kyle Maxwell. See LICENSE for details (MIT).