Are you ready to dive into the world of effortless HTML parsing? Look no further! 🚀
html-parser
is not your ordinary HTML parsing library; it's a game-changer that combines speed, simplicity, and versatility in one extraordinary package. Say goodbye to the days of wrestling with clunky parsing tools and complex setups. With HTML-PARSER, you're in control!
-
No External Dependencies: We believe in keeping things simple. That's why
html-parser
has zero external dependencies. Just grab it, build it, and you're good to go! -
Limitless Manipulation: With a plethora of functions at your disposal, you can manipulate HTML documents like never before. Extract data, modify elements, or traverse the DOM tree with ease.
-
Robust & Reliable:
html-parser
is built with robustness in mind. It can gracefully handle even the most lenient HTML documents, so you can focus on your project without worrying about parsing quirks. -
Fast:
html-parser
is designed for speed. Competitive benchmarks pending!
To see a complete example of how to install and use the html-parser
library in a project, you can explore the html-parser-demo-c repository. This demo project provides a practical demonstration of integrating and utilizing the html-parser
library in a C application.
In this section, you'll find a quick example to help you get started with the html-parser
library. The example provides a clear overview of the library's essential features and how to use them.
Before you begin, ensure that you have the html-parser
library installed. Follow these straightforward steps for installation:
-
Clone the repository: Use the following command to clone this repository in your project (or favorite) folder:
git clone https://github.com/florianmarkusse/html-parser.git
-
Install CMake: Make sure you have CMake installed on your system. You can find installation instructions here.
-
Build the Project: Use the following commands inside the root folder of the repository to build the project based on your platform:
-
For All Operating Systems:
cmake -S . -B build/ -D CMAKE_BUILD_TYPE="Release" -D BUILD_SHARED_LIBS="false" -D BUILD_TESTS="false" -D BUILD_BENCHMARKS="false" cmake --build build/
-
For Linux or macOS: If you are on Linux or macOS, you can use the provided
build.sh
script. Run the script with the-h
flag to view all available build options:./build.sh
-
See this section for more information on building and running the tests and benchmarks.
Here's a comprehensive example showcasing how to use the html-parser library to parse and manipulate an HTML document using C:
#include <flo/html-parser.h>
#include <stdio.h>
int main() {
// Initialize a text store to manage memory
flo_html_TextStore textStore;
if (flo_html_createTextStore(&textStore) != ELEMENT_SUCCESS) {
fprintf(stderr, "Failed to create text store!\n");
return 1;
}
// Initialize a DOM structure
flo_html_Dom dom;
if (flo_html_createDomFromFile("test-file.html", &dom, &textStore) != DOM_SUCCESS) {
flo_html_destroyTextStore(&textStore);
fprintf(stderr, "Failed to parse DOM from file!\n");
return 1;
}
// Find the ID of the <body> element
flo_html_node_id bodyNodeID = 0;
if (flo_html_querySelector("body", &dom, &textStore, &bodyNodeID) != QUERY_SUCCESS) {
flo_html_destroyDom(&dom);
flo_html_destroyTextStore(&textStore);
fprintf(stderr, "Failed to query DOM!\n");
return 1;
}
// Check if the body element has a specific boolean property
// In other words: "<body add-extra-p-element> ... </body>"
if (flo_html_hasBoolProp(bodyNodeID, "add-extra-p-element", &dom, &textStore)) {
// Append HTML content to the <body> element
if (flo_html_appendHTMLFromStringWithQuery("body", "<p>I am appended</p>", &dom,
&textStore) != DOM_SUCCESS) {
flo_html_destroyDom(&dom);
flo_html_destroyTextStore(&textStore);
fprintf(stderr, "Failed to append to DOM!\n");
return 1;
}
}
// Print the modified HTML
flo_html_printHTML(&dom, &textStore);
// Cleanup: Free memory and resources
flo_html_destroyDom(&dom);
flo_html_destroyTextStore(&textStore);
return 0;
}
This example demonstrates how to use the html-parser
library to parse and manipulate an HTML document using C. Let's break down the code step by step:
-
Text Store Initialization: We initialize a
TextStore
to manage the text content of the HTML file. -
DOM Initialization: We create a
flo_html_Dom
structure and parse an HTML file named "test-file.html" usingflo_html_createDomFromFile
. The parsed DOM is stored in thedom
structure. -
Querying for the
<body>
Element: We usequerySelector
to find the<body>
element in the parsed DOM and retrieve its node ID. -
Checking a Boolean Property: With
flo_html_hasBoolProp
, we check if the<body>
element has a specific boolean property called "add-extra-p-element." -
Appending HTML: If the boolean property exists, we append an HTML string
<p>I am appended</p>
to the<body>
element usingflo_html_appendHTMLFromStringWithQuery
. -
Printing the Modified HTML: We print the modified HTML, showing the changes made to the document.
-
Cleanup: To prevent memory leaks, we destroy the
dom
structure and theTextStore
to free up resources.
This example provides a practical demonstration of the html-parser
library's capabilities for parsing and manipulating HTML documents in a C program.
- Reading an HTML file or string into a
Dom
: Thehtml-parser
library provides a straightforward way to parse HTML content, whether it's stored in a file or a string. This feature allows you to create a structured representation of the HTML document.
- Querying over the document with CSS queries: One of the key features of the library is its ability to query and traverse the parsed DOM using CSS queries. This means you can select specific elements or groups of elements within the HTML document based on their attributes, classes, or tags.
-
Reading properties and text content of nodes: You can extract information from nodes, such as attributes and text content. This is invaluable when you need to retrieve specific data from an HTML document.
-
Modifying nodes: The library allows you to modify the properties and content of individual nodes, giving you the ability to update the HTML dynamically. For example, you can change the attributes, text, or even add new elements to the document.
- Modifying the DOM structure: Beyond modifying individual nodes, you can also manipulate the entire DOM structure. This includes adding, removing, and replacing nodes to reshape the document as needed.
- Writing out a
Document
to a file in minified HTML: Once you've made your desired changes to the DOM, the library enables you to generate a new HTML file with the updated content. You can choose to format the output as minified HTML for production use.
These functionalities make the html-parser
library a versatile tool for working with HTML documents programmatically. Whether you need to scrape data, automate web-related tasks, or transform HTML content, this library provides the tools you need to accomplish your goals efficiently.
The provided example is just the tip of the iceberg. You can extend the functionality of your applications by combining the html-parser
library with other libraries and tools. Here are some ideas:
-
Data Extraction: Use the library to extract specific data from web pages, such as prices, product details, or news articles.
-
Web Automation: Combine
html-parser
with web automation frameworks like Selenium to create intelligent web scraping bots. -
Content Generation: Dynamically generate HTML content for websites or email templates by programmatically building and modifying the DOM structure.
-
Integration with Other Languages: Explore ways to use the library in conjunction with other programming languages through interop mechanisms.
Feel free to experiment and innovate with the library to tailor it to your unique use cases. If you have questions or need assistance, you can engage with the community or reach out to the library's maintainers through GitHub issues or other communication channels. Your creativity is the limit when it comes to leveraging the html-parser
library for web-related tasks.
If you encounter any challenges or have suggestions related to the functionalities provided by this library, please do not hesitate to:
- Open an issue, or;
- Open a PR.
We value your input and are committed to improving the project based on your feedback. Moreover, I would be absolutely delighted to see someone using my library :).
In this section, we'll delve into how the parser works under the hood to give you a better overview.
First, the HTML string has to be parsed before it can be manipulated. This is accomplished by a state machine that processes tokens one by one. Unlike a strict parser, this process is lenient, meaning it doesn't strictly adhere to the HTML specification. Instead, it does its best to interpret the input and make sense of it.
The parser distinguishes between two main node types:
-
Document Node: Document nodes can further be categorized into two subtypes:
- Single Node: For example:
<input />
. - Paired Node: For example:
<div></div>
.
- Single Node: For example:
-
Text Node: Text nodes simply contain text content, such as:
This is a sentence.
. Text nodes are commonly found within paired nodes, as in<p>my paragraph</p>
.
The parser differentiates between two types of properties:
-
Boolean Property
: This is a value on a tag that is either present or not. For example:<input required />
. In this example,required
is a boolean property on theinput
document node. -
Property
: This is a key-value pair on a tag. For example:<p id="a special id"></p>
. In this example,id="a special id"
is a property on thep
document node.
The HTML string is parsed into a flo_html_Dom
. Instead of a traditional tree structure of nodes, the flo_html_Dom
follows a data-oriented pattern. The flo_html_Dom
comprises several tables, each serving a specific purpose:
nodeID
|nodeType
|tagID/text
- Notes: Whether the third column contains a tagID or text is based on the nodeType of the node. If it is a document node, it contains a tagID. If it is a text node, it contains the pointer to the text.
parentID
|childID
- Notes: Represents parent-child relationships.
parentID
|childID
- Notes: Represents parent-child relationships (alternative table).
currentNodeID
|nextNodeID
- Notes: Tracks the sequence of nodes.
nodeID
|propID
- Notes: Records boolean properties of nodes.
tagID
|hashElement
|isPaired
- Notes: The hashElement contains the values necessary to look up the tag in the tag hash table so we can find the tag even if the hash table reallocated in the meantime.
indexID
|hashElement
- Notes: The hashElement contains the values necessary to look up the tag in the tag hash table so we can find the tag even if the hash table reallocated in the meantime.
indexID
|hashElement
- Notes: The hashElement contains the values necessary to look up the tag in the tag hash table so we can find the tag even if the hash table reallocated in the meantime.
indexID
|hashElement
- Notes: The hashElement contains the values necessary to look up the tag in the tag hash table so we can find the tag even if the hash table reallocated in the meantime.
As you can observe, the flo_html_Dom
does not directly store text content but rather references IDs
and HashElements
. To retrieve textual content, you use the ID
to look up the corresponding flo_html_HashElement
, which, in turn, is used to locate the text in a hash table. The flo_html_TextStore
struct holds all textual content from the parsed HTML.
For instance, if we have a node table entry: { 4, NODE_TYPE_DOCUMENT, 5 }
, and we want to find the text representation, we first look up the flo_html_HashElement
of 5 (the tagID
) in the tag-registry
table. This yields { 194893, 0 }
as the flo_html_HashElement
. To find the actual text, we perform a lookup in the tag hash table: (hash + offset) % hash table length
, which in this case is (194893 + 0 % tagHash.len)
.
Why was this decision made instead of storing all data directly in the flo_html_Dom
?
- It allows nodes with the same tag, boolean property, or property to share the same
ID
. - It facilitates faster node filtering.
- The library's maintainer, Florian Markusse, wanted to experiment with data-oriented programming and found this approach both challenging and educational.
This design choice enhances performance, reduces memory overhead, and optimizes node filtering. It also provides an opportunity for developers to explore data-oriented programming concepts.
After parsing, we can query the flo_html_Dom
for information that we are looking for. This section is split up into two sections: querying the flo_html_Dom
and querying the contents of an individual node.
Together, these functions empower you to query the flo_html_Dom
and individual nodes allowing the user to query the flo_html_Dom
effectively.
Querying the flo_html_Dom
is possible with convenience methods similar to querying a web DOM. These methods include:
-
querySelector
: Retrieves the first element in theflo_html_Dom
that matches a specified CSS selector. It returns a singleflo_html_node_id
. -
querySelectorAll
: Retrieves all elements in theflo_html_Dom
that match a specified CSS selector. It returns an array offlo_html_node_id
s. In the latter case, please remember tofree
this array. -
getElementsByTagName
: Retrieves all elements in theflo_html_Dom
that have a specified tag name. It returns an array offlo_html_node_id
s. Don't forget tofree
this array when appropriate. -
getElementById
: Retrieves an element in theflo_html_Dom
by its unique ID. It returns a singleflo_html_node_id
. -
getElementsByClassName
: Retrieves all elements in theflo_html_Dom
that have a specified class name. It returns an array offlo_html_node_id
s. Remember tofree
this array as needed.
The html-parser
library provides a set of convenient functions to query and retrieve properties and content from individual nodes within the flo_html_Dom
. These functions allow you to inspect and work with specific attributes and text content of nodes. Here's a brief overview of the available functions:
-
flo_html_getNodeType
: Retrieves the type of a given node, such as whether it's a document node or a text node. -
flo_html_hasBoolProp
: Checks if a node has a specified boolean property and returnstrue
if the property exists and is true. -
flo_html_hasPropKey
: Checks if a node has a property with a specific key. -
flo_html_hasPropValue
: Checks if a node has a property with a specific value. -
flo_html_hasProperty
: Checks if a node has a property with both a specific key and value. -
flo_html_getValue
: Retrieves the value of a property associated with a node. -
flo_html_getTextContent
: Retrieves the text content of a node, storing results in an array of strings. Remember tofree
this array as needed.
After querying, maybe you want to traverse the flo_html_Dom
to find the first child or the parent of the queries node. Here are some functions to do exactly that!
-
flo_html_getFirstChild
: Retrieves the ID of the first child node of a given node. Returns0
if there are no child nodes. -
flo_html_getFirstChildNode
: Returns a pointer to theflo_html_ParentChild
structure. ReturnsNULL
if there are no child nodes. -
flo_html_getNext
: Retrieves the ID of the next sibling node of a given node. Returns0
if there are no more sibling nodes. -
flo_html_getNextNode
: Returns a pointer to theflo_html_NextNode
structure. ReturnsNULL
if there are no more sibling nodes. -
flo_html_getPrevious
: Retrieves the ID of the previous sibling node of a given node. Returns0
if there are no previous sibling nodes. -
flo_html_getPreviousNode
: Returns a pointer to theflo_html_NextNode
structure. ReturnsNULL
if there are no previous sibling nodes. -
flo_html_getParent
: Retrieves the ID of the parent node of a given node. Returns0
if there is no parent node. -
flo_html_getParentNode
: Returns a pointer to theflo_html_ParentChild
structure. ReturnsNULL
if there is no parent node. -
flo_html_traverseDom
: Traverses the DOM structure from the specified node and returns the ID of the next node. Returns0
if there are no more nodes to traverse. -
flo_html_traverseNode
: Traverses the DOM structure of a specific to node with the given ID to traverse and returns the ID of the next node inside that specific node. Returns0
if there are no more nodes in the specific node. -
flo_html_getLastNext
: Retrieves the ID of the last next sibling node starting from a given node. Returns0
if there are no more sibling nodes. -
flo_html_getLastNextNode
: Returns a pointer to theflo_html_NextNode
structure representing the last next sibling node starting from a given node. ReturnsNULL
if there are no more sibling nodes.
Now that we have some flo_html_node_id
s after querying and traversing the flo_html_Dom
, we can modify the flo_html_Dom
to our heart's content. Again, these functions are split up into two levels: "dom-based" and "node-based". All operations modify the flo_html_Dom
in place.
Below, all the append functions provided. They append a new child to the provided parent node. This library also provides the same functionality to prepend and replaceWith. Prepending a node adds a new child as the first child node of the provided parent node. Lastly, Replacing a node completely, thus also all its children, does exactly that.
-
flo_html_appendDocumentNodeWithQuery
: Append aflo_html_DocumentNode
to the DOM using a CSS query. This function appends aflo_html_DocumentNode
specified bydocNode
to the DOM using the provided CSS querycssQuery
. -
flo_html_appendTextNodeWithQuery
: Append a text node to the DOM using a CSS query. This function appends a text node with the specifiedtext
to the DOM using the provided CSS querycssQuery
. -
flo_html_appendHTMLFromStringWithQuery
: Append HTML content from a string to the DOM using a CSS query. This function appends HTML content specified byhtmlString
to the DOM using the provided CSS querycssQuery
. -
flo_html_appendHTMLFromFileWithQuery
: Append HTML content from a file to the DOM using a CSS query. This function appends HTML content from the specifiedfileLocation
to the DOM using the provided CSS querycssQuery
. -
flo_html_appendDocumentNode
: Append aflo_html_DocumentNode
to the DOM. This function appends aflo_html_DocumentNode
specified bydocNode
to the DOM. -
flo_html_appendTextNode
: Append a text node to the DOM. This function appends a text node with the specifiedtext
to the DOM. -
flo_html_appendHTMLFromString
: Append HTML content from a string to the DOM. This function appends HTML content specified byhtmlString
to the DOM.
For the sake of brevity, the prepend...
and replaceWith...
functions are left out but are present in the library. Simply replace append
with your desired operation.
To make changes to specific nodes within the DOM, this library provides a set of functions for adding and updating properties, text content, and tags. These functions allow you to manipulate the HTML elements identified by their unique flo_html_node_id
within the DOM structure. Below are some key node modification functions:
-
flo_html_addPropertyToNodeStringsWithLength
: Add a property with a specified key and value to an HTML element. This function takes theflo_html_node_id
of the target element, the property key, property value, and other necessary parameters. -
flo_html_addPropertyToNodeStrings
: A simplified version of the above function for adding a single property to an HTML element. -
flo_html_addBooleanPropertyToNodeStringWithLength
: Add a boolean property to an HTML element, specifying the property key, property value, and length. -
flo_html_addBooleanPropertyToNodeString
: A simplified version of the above function for adding a single boolean property to an HTML element. -
flo_html_setPropertyValue
: Set the value of an HTML element's property by specifying theflo_html_node_id
, property key, and the new value. -
flo_html_setTextContent
: Set the text content of an HTML element identified byflo_html_node_id
to the specified text. This function allows you to update the content of an element. Note that this function will remove any child elements this node may have. -
flo_html_addTextToTextNode
: Add text content to a text node within an HTML element. You can specify whether to append or prepend the text content. -
flo_html_setTagOnDocumentNode
: Set the tag for a DocumentNode within the DOM structure. You can specify the tag's start, length, and whether it is paired or not.
These functions provide a comprehensive set of tools for making precise modifications to the HTML elements within the DOM. You can use them to customize your parsed HTML content to suit your specific needs.
Lastly, after making modifications to the parsed HTML content, you may want to output or print the resulting HTML. This library offers a set of functions to help you achieve this:
-
flo_html_printHTML
: Use this function to print the minified HTML representation. It displays all the elements, tags, and text content in a compact format. This is particularly helpful for inspecting the parsed HTML document directly within your program. -
flo_html_writeHTMLToFile
: If you wish to save the parsed HTML document to a file, this function is your solution. It writes the minified HTML representation to the specifiedfilePath
. The function returns a status code to indicate the success or failure of the file-writing operation, making it easy to handle file I/O errors. -
flo_html_printDomStatus
: This function allows you to print the status of theflo_html_Dom
andflo_html_TextStore
. It provides information about node counts, registrations, and other relevant details. It can be a valuable tool for debugging and gaining insights into the structure of the parsed DOM.
These printing and writing functions provide essential utilities for interacting with and exporting the parsed HTML content, whether you need to debug, inspect, or save the modified DOM structure to a file for further use.
It would be amazing if you are willing to contribute to the project. Please look at any issues if they are present or reach out to the maintainer to collaborate!
This repository comes with tests and a simple benchmarking tool included. If you want to run these programs, please follow these steps:
Use the following commands to build the project based on your platform:
-
For All Operating Systems:
cmake -S . -B build/ -D CMAKE_BUILD_TYPE="Release" -D BUILD_SHARED_LIBS="false" -D BUILD_TESTS="true" -D BUILD_BENCHMARKS="true" cmake --build build/
-
For Linux or macOS: If you are on Linux or macOS, you can use the provided
build.sh
script. Run the script with the-h
flag to view all available build options:./build.sh -t -b
build/tests/html-parser-tests-Release
build/benchmarks/html-parser-benchmarks-Release
This project is licensed under the MIT License. See the LICENSE file for details.
NB: Since this parser is lenient, it can probably also be used to parse XML, or similar markup languages. Be advised, this has not been tested and is not the goal of this project.