/Scrape

Cross-platform HTML and XML parser written in Swift

Primary LanguageSwiftMIT LicenseMIT

Scrape

Build Status Language Platform codecov codebeat badge

Scrape is a cross-platform HTML and XML parser written in Swift.

This project was forked from Kanna, then completely rewritten and refined.

About

This framework wraps well-known C library Libxml2, letting you use Swifty interface for HTML and XML parsing. You can use XPath queries or CSS selectors to search elements in XML and HTML documents.

Important: due to immaturity of open source Foundation framework implementation which is used in Linux version of Scrape, CSS selectors are not currently supported in Linux. When all the methods of NSRegularExpression are fully implemented, CSS selectors will come back. In macOS they work just fine.

Requirements

  • macOS 10.9 or newer
  • iOS 8.0 or newer
  • watchOS 2.0 or newer
  • tvOS 9.0 or newer
  • Ubuntu 14.04, 15.10 or 16.10
  • Other Linux distributions haven't been tested, but everything should work fine.

Installation

For now only Swift Package Manager is supported. If you want CocoaPods or Carthage support, please feel free to submit a PR 😊

In order to use Scrape in your SPM project add the following into your Package.swift file:

let package = Package(
    name: "YourPackageName",
    dependencies: [
        .Package(url: "https://github.com/WeirdMath/Scrape.git", majorVersion: 1)
    ]
)	

Usage

Consider the following HTML fragment:

    <div class="single-column">
      <ul class="boxed-group-inner mini-repo-list">
        <li class="public source ">
          <a href="http://github.com/jessesquires/JSQCoreDataKit" class="mini-repo-list-item css-truncate">
            <span class="repo" title="JSQCoreDataKit">JSQCoreDataKit</span>
            <span class="stars">
              357
              <span class="oction oction-star"></span>
            </span>
            <span class="repo-description css-truncate-target">A swifter Core Data stack</span>
          </a>
        </li>
    </div>

We can initialize an HTML document from string:

let document = HTMLDocument(html: htmlString, encoding: .utf8)

We then can access its contents by querying it using XPath:

let element = document.element(atXPath: "//span[@class='stars']")

print(element.text)

// Prints "357".

print(element.html)

// Prints:
// <span class="stars">
//               357
//               <span class="oction oction-star"></span>
//             </span>

…or CSS selector:

let element = document.element(atCSSSelector: "span.stars")

print(element.text)

// Prints "357".

print(element.html)

// Prints:
// <span class="stars">
//               357
//               <span class="oction oction-star"></span>
//             </span>

We can even change the contents:

let element = document.element(atXPath: "//span[@class='repo']")

element["title"] = "Scrape"
element.content = "Scrape"

print(element.html)

// Prints:
// <span class="repo" title="Scrape">Scrape</span>

It is also possible to reorder nodes, provide parsing options and convert CSS selectors to XPath queries.

Documentation

Available here.

Why not Kanna?

Kanna can be used on Apple platforms only, while Scrape can be used in Linux environment. Scrape also provides more consistent and clearer API, and also a better test coverage. And the most important — there is a complete documentation with examples!