I worked this as a tool that would generate a sitemap from a webtozip.com download.
It should probably work with other website export tools like HTTrack, SiteSucker, or Archivarix, but I haven't tested it with those tools.
The download is a zip file that contains the entire website. The website is structured in a way that the root directory contains the index.html
file and the rest of the website is structured in directories.
Each directory contains an index.html
file that represents the page. The script scans the directories recursively to build a nested structure of the website.
The script extracts the title of each page from the <title>
tag in the HTML files and uses the directory structure to create a hierarchical JSON object representing the site's structure.
folder2sitemap is a Node.js script designed to generate a JSON representation of a website's structure based on its directory and file organization.
It scans a specified directory for HTML files, particularly looking for index.html files in each directory to determine the structure of the website.
Note that index.html
files ARE REQUIRED for the script to work properly.
The script extracts the title of each page from the <title>
tag in the HTML files and uses the directory structure to create a hierarchical JSON object representing the site's structure.
- Title Extraction: Extracts titles directly from the HTML files to accurately represent each page.
- Recursive Directory Traversal: Scans directories recursively to build a nested structure of the website.
- JSON Output: Outputs the website structure in a readable JSON format.
- CSV Output: Outputs the website structure in CSV format.
- Exclusion of Directories: Allows you to exclude specific directories from the sitemap generation.
- Custom Output File: Option to save the output directly to a file.
- No Dependencies: Requires only Node.js to run.
- Ensure you have Node.js installed on your system.
- Clone this repository or download the script to your local machine.
To use folder2sitemap, run the script from the command line, passing the path to the root directory of your website as an argument:
node folder2sitemap ./example.com
The script will output the structure of your website in JSON format to the console. You can redirect this output to a file if needed:
node folder2sitemap ./example.com > site_structure.json
To save the output directly to a file, use the --output
flag followed by the file name:
node folder2sitemap ./example.com --output site_structure.json
By default, the script outputs the website structure in JSON format. If you prefer to output the structure in CSV format, use the --format=csv
flag:
node folder2sitemap ./example.com --format=csv
Would output the website structure in CSV format to the console. You can redirect this output to a file as well:
slug,title
"/","Home"
"/about/","About"
"/blog/","Blog"
"/blog/post1/","Post 1"
"/blog/post2/","Post 2"
To exclude specific directories from the sitemap generation, use the --exclude
flag followed by the directory name relative to the site root. You can specify multiple directories to exclude by using multiple --exclude
flags. For example:
node folder2sitemap ./example.com --exclude=contentassets --exclude=zh-cn
This command will generate the sitemap without including the directories /contentassets
and /zh-cn
.
Given a website with a simple structure, the output might look like this:
{
"slug": "/",
"title": "Home",
"children": [
{
"slug": "/about/",
"title": "About"
},
{
"slug": "/blog/",
"title": "Blog",
"children": [
{
"slug": "/blog/post1/",
"title": "Post 1"
},
{
"slug": "/blog/post2/",
"title": "Post 2"
}
]
}
]
}
You can use online tools like JSON Crack to visualize the JSON output in a more structured format. Simply paste the JSON output into the tool to see a visual representation of the website structure.