m4rw3r/chomp

HTML extract link parser.

LeMoussel opened this issue · 4 comments

Do you think is it possible to extract attributs Link (<a> tag element) from HTML document?
If yes, can you write/explain an example parser?

Could you explain more in detail what kind of data you have as input and what kind of output you expect? Because in general if you have HTML and care about edge-cases (eg. not parsing links from comments or other places where things can look like they are <a> tags) you will need an HTML-parser and then filter out all the interesting tags and attributes. Obviously writing an HTML-parser is pretty complex if all edge-cases are to be covered.

Not HTML-parser for all edge-cases, but just for <a> tag element.
Like describe in W3C document just <A href="#section2" id="test" ... some others attributs>Some stuff</A>. And if it's possible all find attributs for <a> element. (href, id, name, rel, ....).

@LeMoussel Something like this should parse a subset of <a>-tags and their attributes:

#[macro_use]
extern crate chomp;

use std::collections::HashMap;
use std::hash::Hash;
use chomp::prelude::*;
use chomp::ascii::{skip_whitespace, is_whitespace};

#[derive(Debug, Default, PartialEq)]
pub struct Anchor<B: Buffer + Eq + Hash> {
    pub attributes: HashMap<B, Option<B>>,
}

pub fn anchor<'a, I: U8Input<Buffer=&'a [u8]>>(i: I)
  -> SimpleResult<I, Anchor<I::Buffer>> {
    parse!{i;
                token(b'<');
                token(b'a');
        // Utilize the fact that many is based on FromIterator
        let a = many(attr);
                skip_whitespace();
                token(b'>');

        ret Anchor { attributes: a }
    }
}

fn attr<I: U8Input>(i: I) -> SimpleResult<I, (I::Buffer, Option<I::Buffer>)> {
    parse!{i;
                    satisfy(is_whitespace);
        let key   = take_while1(|c| match c {
            b'='  => false,
            b' '  => false,
            b'\t' => false,
            b'>'  => false,
            _     => true,
        });
        let eq    = peek_next();
        let value = i -> if eq == b'=' {
            any(i).then(attr_value).map(Some)
        } else {
            i.ret(None)
        };

        ret (key, value)
    }
}

fn attr_value<I: U8Input>(i: I) -> SimpleResult<I, I::Buffer> {
    parse!{i;
        let quote = peek_next();
        i -> if quote == b'"' || quote == b'\'' {
            parse!{i; token(quote) >> take_while(|c| c != quote) <* token(quote) }
        } else {
            take_while(i, |c| match c {
                b' '  => false,
                b'\t' => false,
                b'>'  => false,
                _     => true,
            })
        }
    }
}

fn main() {
    let a = parse_only(anchor, b"<a href=\"http://www.example.com\">Test</a>");

    for (k, v) in a.unwrap().attributes.iter() {
        println!("{} = {:?}", String::from_utf8_lossy(k), v.map(String::from_utf8_lossy));
    }
}

This will just ignore everything after the > character of the tag though, and it requires that the input is at the beginning of a tag to parse. And the code above will most likely require some more thorough reading of the spec and adjustments following that.

Thank for your help. I test it.