/kindle-your-highlights

Scrape highlights from kindle.amazon.com

Primary LanguageRubyMIT LicenseMIT

kindle-your-highlights Code Climate

It scrapes highlights from kinde.amazon.com web site (https://kindle.amazon.com/your_highlights).

Required Gems

  • nokogiri
  • jsonify
  • selenium-webdriver

Dependency

Using Firefox as default selenium engine. It may be able to specify other ones by passing option[:driver_type] in the constructor.

Usage

$ git clone git://github.com/parroty/kindle-your-highlights.git

$ cd kindle-your-highlights
$ bundle

$ export KINDLE_USERNAME="username"
$ export KINDLE_PASSWORD="password"

$ rake update:all

Rake Command Usage

default task is "rake update:recent"

rake convert
    call convert:all

rake convert:all
    load a local file and convert into xml/html format

rake convert:html
    load a local file and convert into html format

rake convert:xml
    load a local file and convert into xml format

rake open
    call open:html (TODO : mac only solution)

rake open:html
    open html file (TODO : mac only solution)

rake open:xml
    open xml file (TODO : mac only solution)

rake print
    load a local file and print highlight data

rake update
    call update:new

rake update:all
    retrieve all data from amazon server, and store them into a local file

rake update:new
    retrieve only newly arrived items from amazon server, and store them into a local file

rake update:recent
    retrieve recent 1 month data from amazon server, and store them into a local file

Library Usage Examples

object operation

require 'kindle-your-highlights'

# to create a new KindleYourHighlights object, give it your Amazon email address and password
kindle = KindleYourHighlights.new("foo@bar.com", "password")

kindle.highlights.each do |highlight|
	highlight.annotation_id      # => a unique value for each highlight, generated by Amazon
	highlight.content            # => the actual highlight text
	highlight.asin               # => the Amazon ASIN for the highlight's product
	highlight.author             # => author of the book from which the highlight is taken
	highlight.title              # => title of the book from which the highlight is taken
	highlight.location           # => highlight location in the book
	highlight.note               # => users' note added along with the highlight
end

kindle.books.each do |book|
	book.asin                    # => the Amazon ASIN for the book
	book.author                  # => author of the book
	book.title                   # => title of the book
	book.last_update             # => last update of the hightlights for the book (last annoted at)
end

xml/html outputs

require 'kindle-your-highlights'

# to create a new KindleYourHighlights object, give it your Amazon email address and password
kindle = KindleYourHighlights.new("foo@bar.com", "password", { :page_limit => 100, :day_limit => 31, :wait_time => 2 }) do | h |
	puts "loading... [#{h.books.last.title}] - #{h.books.last.last_update}"
end

# xml outputs (needs to create ./xml folder in advance)
KindleYourHighlights::XML.new(:list => kindle.list, :file_name => "xml/out.xml").output

# html outputs (needs to create ./html folder in advance)
KindleYourHighlights::HTML.new(:list => kindle.list, :file_name => "html/out.html").output

differential save/load

require 'kindle-your-highlights'

# to create a new KindleHighlight object, give it your Amazon email address and password
kindle = KindleYourHighlights.new("foo@bar.com", "password", { :page_limit => 100, :wait_time => 2 }) do | h |
	puts "loading... [#{h.books.last.title}]"
end

# load previous file, merge with the new one, and dump it again.
if File.exist?("out.dump")
	list = KindleYourHighlights::List.load("out.dump")
	kindle.merge!(list)
end

KindleYourHighlights::HTML.new(:list => kindle.list, :file_name => "out.html").output
kindle.list.dump("out.dump")

options

  • page_limit : specifies maximum number of pages (books) to be loaded
  • day_limit : specifies maximum number of days to be retrieved, based on "Last annotated on" date and today
  • stop_date : specifies the "Last annoted on" date to stop collecting more data.
  • wait_time : specifies wait time between each page load in seconds (default is 5 seconds)
  • block : call-back function which for each page load completion
  • driver_type : symbol to identify the selenium driver

Output Examples

xml

XML output example

<?xml version="1.0"?>
<books>
	<book>
		<asin>ASIN</asin>
		<title>TITLE</title>
		<author>AUTHOR</author>
		<highlights>
			<annotation_id>ANNOTATION_ID1</annotation_id>
			<content>CONTENT1</content>
		</highlights>
		<highlights>
			<annotation_id>ANNOTATION_ID2</annotation_id>
			<content>CONTENT2</content>
		</highlights>
	</book>
</books>

html

htmlimage

updates

  • 0.3.0
  • Change engine from Mechanize to Selenium, as it stopped working due to some unknown reasons.
  • 0.2.0
  • Adding client-side features for HTML output (searching, highlighting)
  • Change output directory in Rakefile (e.g. ../html -> output/html)
  • 0.1.0
  • Initial upload

notes

This lib was originally from "https://github.com/speric/kindle-highlights", but I created a separate project for adding features and for changing code formats.