This Rust application downloads HTML content from a URL, extracts HTML tables, and converts them to CSV format.
Features
- Downloads HTML content from any URL
- Parses HTML and extracts table data
- Converts table data to CSV format with customizable options
- Command-line interface for easy usage
- Configurable CSV delimiter
- Flexible field quoting options
- Header removal option
- Column-specific quoting
- Column filtering (show only specified columns)
The application uses the following Rust crates:
clap- Command line argument parsingureq- HTTP client for downloading HTML with rustls TLS support
The current implementation includes:
- URL argument parsing - Uses clap to handle command line arguments with extensive options
- HTML downloading - Uses ureq HTTP client with rustls TLS support
- HTML table parsing - Robust HTML table parsing without external HTML parsing dependencies
- CSV conversion - Converts extracted table data to CSV format with customizable options:
- Configurable delimiter (comma, pipe, semicolon, etc.)
- Flexible quoting modes (never, always, as-needed)
- Header removal option
- Column-specific quoting
- Column filtering to show only specified fields
- Error handling - Proper error handling and user feedback
To build the project:
cargo build --releaseTo run tests:
cargo testTo run example:
cargo run -- https://gist.githubusercontent.com/bella92/4184664/raw/82982ace341d5a579ad53b53a47bcf58c7dea5ee/1.%2520Fresh-fruits# Basic usage - download and convert tables from a URL
$ html-table-csv-converter "https://example.com/page-with-tables"
# Use pipe delimiter instead of comma
$ html-table-csv-converter -d "|" "https://example.com/page-with-tables"
# Always quote all fields
$ html-table-csv-converter --quote-fields always "https://example.com/page-with-tables"
# Remove headers from output
$ html-table-csv-converter --no-header "https://example.com/page-with-tables"
# Quote only specific columns (1-based indexing)
$ html-table-csv-converter --quote-columns "1,3" "https://example.com/page-with-tables"
# Show only specific columns
$ html-table-csv-converter --show-fields "1,3" "https://example.com/page-with-tables"
# Combine options
$ html-table-csv-converter -d ";" --quote-fields never --no-header --show-fields "2,4" "https://example.com/page-with-tables"
-d, --delimiter <CHAR>: Change delimiter character (default: ',')-q, --quote-fields <MODE>: Quote fields mode: never, always, asneeded (default: asneeded)--no-header: Remove table headers from output--quote-columns <COLUMNS>: Only quote specific columns (comma-separated indices, e.g., '1,3')--show-fields <COLUMNS>: Show only specified columns (comma-separated indices, e.g., '1,3,5')
For a table like:
<table>
<tr><th>Name</th><th>Age</th><th>City</th></tr>
<tr><td>John Doe</td><td>25</td><td>New York</td></tr>
<tr><td>Jane Smith</td><td>30</td><td>Los Angeles</td></tr>
</table>The output will be:
Name,Age,City
John Doe,25,New York
Jane Smith,30,Los Angeles
With different options:
# Using pipe delimiter
Name|Age|City
John Doe|25|New York
Jane Smith|30|Los Angeles
# With always quote
"Name","Age","City"
"John Doe","25","New York"
"Jane Smith","30","Los Angeles"
# Without header
John Doe,25,New York
Jane Smith,30,Los Angeles
# Show only columns 1 and 3 (Name and City)
Name,City
John Doe,New York
Jane Smith,Los Angeles