WGBH-MLA/AAPB2

Conducting searches with quotes yields results that don't include transcript snippets

caseyedavis12 opened this issue · 5 comments

Describe the bug
When a user conducts a search with quotes around the search term, retrieved results do not show where those keywords appear in transcripts. If you do the same search without quotes, search results show transcript snippets.

Expected behavior
Searches for terms in quotes should include results that show transcript snippets.

Looking into this:

When searching for "ice cream":
https://americanarchive.org/catalog?f%5Bspecial_collections%5D%5B%5D=backstory&per_page=10&q=%22ice+cream%22&utf8=%E2%9C%93&f[access_types][]=online

GET /snippets.json?

The request form data is being sent as such:

Form Data

ids[]: cpb-aacip-532-3775t3h512
ids[]: cpb-aacip-5a4ab819222
ids[]: cpb-aacip-532-t43hx1758w
ids[]: cpb-aacip-af60440fe9c
ids[]: cpb-aacip-cca5af185c1
ids[]: cpb-aacip-532-c53dz0493w
ids[]: cpb-aacip-532-h41jh3fc4q
ids[]: cpb-aacip-532-n29p26rf2k
ids[]: cpb-aacip-532-jq0sq8rs78
ids[]: cpb-aacip-41541e17829
query: ""

So the quotes are not properly sent from the search form.

JS / ERB

$(document).ready(function() {
<% if @query.present? && @snippets && @snippets.keys.present? %>
var guids = <%= raw(@snippets.keys).to_s %>
var q = "<%= @query %>"
getSnippets(guids, q)
<% end %>

The double quotes around the @query are breaking when the erb templating happens.

Tests

ice cream => "ice cream"
"ice cream" => "&quot;&quot;"
"ice cream => 500 error
'ice cream => "&#39;ice cream"
'ice cream' => "&#39;ice cream&#39;"

Escaping

Tried many combinations, including:

  • Escaping with single quotes '
  • Escaping with backticks `
  • Unescaping
  • Using raw()
  • Using html_escape()
  • Multiple of the above
  • Several others I've forgotten

So far, none yield the expected results.

catalog_controller

# pull this out because we're going to mutate it inside terms_array method
@query = params[:q].dup
@terms_array = query_to_terms_array(@query)

With pry:

> params
=> {"utf8"=>"✓", "f"=>{"access_types"=>["online"]}, "per_page"=>"100", "q"=>"\"ice cream\"", "controller"=>"catalog", "action"=>"index"}
> params[:q]    
=> "\"ice cream\""
params[:q].dup
=> "\"ice cream\""
@query
=> "\"\""
@terms_array
=> [["ICE", "CREAM"]]

Questions

  1. How is @query empty if params[:q].dup=> "\"ice cream\"" ?
  2. How did @terms_array get [["ICE", "CREAM"]] from that?

Some answers

1+2. @query is "\"ice cream\"" before line 197, and "\"\"" afterwards.

More questions

  1. ??

query_to_terms_array

After pairing with Drew, we decided this function is confusing.

def query_to_terms_array(query)
return [] if !query || query.empty?
stopwords = Rails.cache.fetch("stopwords") do
sw = []
File.read(Rails.root.join('jetty', 'solr', 'blacklight-core', 'conf', 'stopwords.txt')).each_line do |line|
next if line.start_with?('#') || line.empty?
sw << line.upcase.strip
end
sw
end
terms_array = if query.include?(%("))
# pull out double quoted terms!
quoteds = query.scan(/"([^"]*)"/)
# now remove them from the remaining query
quoteds.each { |q| query.remove!(q.first) }
query = query.gsub(/[[:punct:]]/, '').upcase
# put it all together (removing any term thats just a stopword)
# and remove punctuation now that we've used our ""
quoteds.flatten.map(&:upcase) + (query.split(" ").delete_if { |term| stopwords.any? { |stopword| stopword == term } })
else
query.split(" ").delete_if { |term| stopwords.any? { |stopword| stopword == term } }
end
# remove extra spaces and turn each term into word array
terms_array.map { |term| term.upcase.strip.gsub(/[^\w\s]/, "").split(" ") }

Current steps:

  1. Read the list of stopwords

  2. if query contains quotes:

    1. Get quoted terms
    2. Remove quoted terms from query
    3. Remove punctuation from remaining query
    4. Uppercase remaining query
    5. Remove stop words from remaining query
    6. Flatten quoted terms
    7. Add remaining cleaned query terms to quoted terms
  3. Else:

    1. Split query by space
    2. Remove stopwords
  4. For each term:

    1. Uppercase
    2. Strip whitespace
    3. Remove all non word, non space characters
  5. Return query terms array