Often, search engines only return a webpage's URL along with some snippets. However, sometimes it is necessary to retrieve the complete webpage content. To address this, the playwright-go-server project was developed. It leverages browser automation technology to fetch the full HTML content of a webpage and supports converting it to Markdown format, which is more convenient for subsequent processing by large language models.
- Webpage Content Fetching: Uses a browser pool (based on Playwright) to fetch the full HTML content of a given URL.
- Markdown Conversion: Converts the fetched HTML content into Markdown format for easier text processing and inference by large models.
- Efficient and Stable: Implements lazy initialization of a global session pool to reuse browser instances, ensuring fast and efficient response.
-
Clone the repository:
git clone https://github.com/litongjava/playwright-go-server.git cd playwright-go-server -
Install Go dependencies:
go mod tidy
-
Install the HTML-to-Markdown conversion library:
go build
docker build -t litongjava/playwright-go-server:1.0.0 .
docker run -dit --name playwright-go-server --net=host litongjava/playwright-go-server:1.0.0
The project provides an HTTP service with an endpoint to fetch webpage content and convert it based on the provided format.
- Endpoint:
/fetch - Query Parameters:
url: The URL of the webpage to fetch (required)format: The format of the returned content (optional; when set tomarkdown, returns content in Markdown format; otherwise returns the raw HTML)
Fetching Markdown formatted content:
GET /fetch?url=https://example.com&format=markdown
curl "http://localhost/fetch?url=https://www.kapiolani.hawaii.edu/&format=markdown"
Start the service using the following command:
go run main.goOnce the server is running, you can make HTTP requests to the endpoint.
Contributions are welcome! Please feel free to open issues or submit pull requests to improve the project.
This project is licensed under the MIT License.