fedeya/remix-sitemap

XML Special characters aren't being escaped properly

AjaniBilby opened this issue · 5 comments

Describe the bug
The library is not correctly sanitising inputs before putting them in an XML body

To Reproduce

import type { SitemapFunction } from 'remix-sitemap';
import { json } from "@remix-run/node";

export const sitemap: SitemapFunction = async () => {
	return [{
		lastmod: new Date().toISOString().split('T')[0],
		loc: `/news/0`,
		news: [{
			title: "Something with an & symbol",
			publication: {
				name: 'Website',
				language: 'en'
			},
			date: new Date().toISOString().split('T')[0]
		}]
	}]
};

export async function loader() { return json({}); }

Invalid XML output

<url>
	<loc>https://example.com/news/0</loc>
	<lastmod>2023-10-15</lastmod>
	<changefreq>daily</changefreq>
	<priority>0.7</priority>
	<news:news>
		<news:publication>
			<news:name>Signplanet</news:name>
			<news:language>en</news:language>
		</news:publication>
		<news:publication_date>2023-10-15</news:publication_date>
		<news:title>Something with an & symbol</news:title>
	</news:news>
</url>

Expected behavior

<url>
	<loc>https://example.com/news/0</loc>
	<lastmod>2023-10-15</lastmod>
	<changefreq>daily</changefreq>
	<priority>0.7</priority>
	<news:news>
		<news:publication>
			<news:name>Signplanet</news:name>
			<news:language>en</news:language>
		</news:publication>
		<news:publication_date>2023-10-15</news:publication_date>
		<news:title>Something with an &amp; symbol</news:title>
	</news:news>
</url>

Environment (please complete the following information):

  • remix-sitemap version 3.0.0

This could be a potetial patch fix, though it's not perfect:

"Some text & stuff <things> here".replaceAll(/[^A-z0-9 ]/g, (c)=>`&#${c.charCodeAt(0).toString()};`)
'Some text &#38; stuff &#60;things&#62; here'
fedeya commented

I guess this can be resolved by simply changing this to true, but performance may be worse

processEntities: false,

It would indeed slow down performance, but I doubt that it will impact the entire tool meaningfully for it's purpose based on the request per seconds claim.

At least for me the biggest issue is that this library for some reason creates a new database connection for every single route it's generating a sitemap for, which is a much larger bottle neck than any xml processing would be

Ironically it might be faster to use the regex method I provided earlier, because then you can apply it only the fields which might have special characters such as names and titles, and can ignore dates and urls

fedeya commented

Yes, but your other comment makes sense, it's not expected to have a bunch of requests in a sitemap. The processEntities can be true

At least for me the biggest issue is that this library for some reason creates a new database connection for every single route it's generating a sitemap for, which is a much larger bottle neck than any xml processing would be

For this, if you are creating the connection to db in each sitemap function, i assume this is normal.
In a node environment, this should be resolved by connecting outside of the function, sharing the instance. i guess.
In serverless/edge idk maybe solving #6 can help some. I need to test how it works.

Please create another issue to investigate more about this