internetarchive/heritrix3

ExtractorHTML matches srcset attribute case-sensitively

ato opened this issue · 0 comments

ato commented

Links of the form <source srcSet="1.jpg 1x, 2.jpg 2x"> are being extracted as a single url like 1.jpg%201x,%202x.jpg%202x. It appears that the srcset parser is not invoked unless the srcset attribute is fully lowercase.

This appears to be because ExtractorHtml.elementContext() does not lowercase the attribute and then it's tested in processEmbed() when deciding to invoke the srcset parser with a case-sensitive comparison:

        if (context.equals(HTMLLinkContext.IMG_SRCSET.toString()) 
				|| context.equals(HTMLLinkContext.SOURCE_SRCSET.toString())
				|| context.equals(HTMLLinkContext.IMG_DATA_SRCSET.toString())
				|| context.equals(HTMLLinkContext.IMG_DATA_ORIGINAL_SET.toString())
				|| context.equals(HTMLLinkContext.SOURCE_DATA_ORIGINAL_SET.toString())) {