HtmlScrubber breaks links with params following a '?'
afred opened this issue · 1 comments
the bug
At the time of this writing, this record contains a "Segment Description", which has an AAPB link to a media segment, which uses URL params.
The HtmlScrubber.scrub
method wipes out all params after the ?
(but keeps the ?
for some reason).
Done when
Links with parameters are preserved after "scrubbing".
if dirtay =~ /\/\w+/
# Angle-brackets stripped, so be more aggressive
dirtay = dirtay.gsub(/\w+=\S+/, ' ')
end
This code from HTMLScrubber removes attributes such as class="whatever"
, after an earlier step gsubs away <
and >
. However, HTMLScrubber is only used to display pbcore descriptions, not to ingest them. So, attempting to ingest pbcore with elements in the description will already fail validation before the record is ingested and before HTMLScrubber will ever be run.
So, removing this line from HTMLScrubber as it currently only scrubs elements that cannot appear in the description given the current PBCoreIngester.