A Tag token parser and Tag specific parsing combinators, inspired by parsec-tagsoup and tagsoup-parsec. This library helps you build a megaparsec parser using TagSoup's Tag as tokens.
We can build a DOM parser using TagSoup's Tag as a token type in Megaparsec. Let's start the example with importing all the required modules.
import Data.Text ( Text )
import qualified Data.Text as T
import Data.HashMap.Strict ( HashMap )
import qualified Data.HashMap.Strict as HMS
import Text.HTML.TagSoup
import Text.Megaparsec
import Text.Megaparsec.ShowToken
import Text.Megaparsec.TagSoup
Here's the data types used to represent our DOM. Node
is either ElementNode
or TextNode
. TextNode
data constructor takes a Text
and ElementNode
data constructor takes an Element
whose fields consist of elementName
, elementAttrs
and elementChildren
.
type AttrName = Text
type AttrValue = Text
data Element = Element
{ elementName :: !Text
, elementAttrs :: !(HashMap AttrName AttrValue)
, elementChildren :: [Node]
} deriving (Eq, Show)
data Node =
ElementNode Element
| TextNode Text
deriving (Eq, Show)
Our Parser
is defined as a type synonym for TagParser Text
. TagParser
takes a type argument representing the string type and we chose Text
here. We can pass any of StringLike
types such as String
and ByteString
.
type Parser = TagParser Text
There is nothing new in defining a parser except that our token is Tag Text
instead of Char
. We can use any Megaparsec combinators we want as usual. Our node
parser is either element
or text
so we used the choice combinator (<|>)
.
node :: Parser Node
node = ElementNode <$> element
<|> TextNode <$> text
tagsoup-megaparsec library provides some Tag
specific combinators.
tagText
: parse a chunk of text.anyTagOpen
/anyTagClose
: parse any opening and closing tag.
text
and element
parsers are built using these combinators.
NOTE: We don't need to worry about the text blocks containing only whitespace characters because all the parsers provided by tagsoup-megaparsec are lexeme parsers.
text :: Parser Text
text = fromTagText <$> tagText
element :: Parser Element
element = do
t@(TagOpen tagName attrs) <- anyTagOpen
children <- many node
closeTag@(TagClose tagName') <- anyTagClose
if tagName == tagName'
then return $ Element tagName (HMS.fromList attrs) children
else fail $ "unexpected close tag" ++ showToken closeTag
Now it's time to define our driver. parseDOM
takes a Text
and returns either ParseError
or [Node]
. We used many
combinator to represent that there are zero or more occurences of node
. We used TagSoup's parseTags
to create tokens and passed it to Megaparsec's parse
function.
parseDOM :: Text -> Either ParseError [Node]
parseDOM html = parse (many node) "" tags
where tags = parseTags html