Ever wanted to be able to rip through books like your name was Alex Ziegler? Well, now you can ZiegleIt! With our app you can quickly generate a summary of any article on the web to expedite your learning. When you're short on time, ZiegleIt.
##MVP Our MVP will have the following features:
- Scrape web URLs and copy content with Nokogiri
- Take scraped content and generate a summary based on a fixed(?) compression ratio
- Deliver content to a txt file
- At this point in time we expect our algorithm to be optimized for Wikipedia articles
##Document Structure
- Title (depth: 1)
- Chapter (2)
- Chapter (2)
- Chapter (2)
- Section (3)
- Sub Section (4)
- Paragraph (5)
- Sentence (6)
- Word (6.content)
- Sentence (6)
- Paragraph
- Paragraph
- Paragraph (5)
- Sub Section
- Sub Section
- Sub Section (4)
- Section
- Section
- Section (3)
- Chapter
##Parsing Rules v1
- There are a lot of blank elements (I'm guessing closing tags?) so first and foremost we need a guard clause that prevents nodes with blank
inner_xml
from making their way into the content:
if node.inner_xml != ""
- Break when
See also</span>
is reached in the current Nokogiri node inner xml. This is checked by matching aRegExp
:
break if node.inner_xml.match(/(See also<\/span>)/)
- The current section's text is no longer meaningful if a
</span>
tag has been reached. This is checked viaRegExp
as well:
puts "Section: #{node.inner_xml.match(/[ \w]+(?=<\/span>)/)}"
- The table of contents is ignored (we are building our own afterall). We achieve this by excluding any node that has an h2 parent with inner xml of
Contents
. We add it to our guard clause.
if (node.inner_xml != "") && (node.inner_xml != "Contents")
##Algorithm
##Next Steps
- Start calculating some word scores
##Resources