WGBH-MLA/AAPB2

HtmlScrubber breaks links with params following a '?'

afred opened this issue · 1 comments

afred commented

the bug

At the time of this writing, this record contains a "Segment Description", which has an AAPB link to a media segment, which uses URL params.

The HtmlScrubber.scrub method wipes out all params after the ? (but keeps the ? for some reason).

Done when

Links with parameters are preserved after "scrubbing".

    if dirtay =~ /\/\w+/
      # Angle-brackets stripped, so be more aggressive
      dirtay = dirtay.gsub(/\w+=\S+/, ' ')
    end

This code from HTMLScrubber removes attributes such as class="whatever", after an earlier step gsubs away < and >. However, HTMLScrubber is only used to display pbcore descriptions, not to ingest them. So, attempting to ingest pbcore with elements in the description will already fail validation before the record is ingested and before HTMLScrubber will ever be run.

So, removing this line from HTMLScrubber as it currently only scrubs elements that cannot appear in the description given the current PBCoreIngester.