ualbertalib/jupiter

Duplicate ERA content found by Google Search Console

jefferya opened this issue · 3 comments

Related to #3289

When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. #3289 (comment)).

The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).

Question: is there a way to test for duplicates more efficiently than Google?

Attempt 1.: use the Active Storage database table active_storage_blob column checksum to verify each attachment is unique therefore finding any duplicate items.

irb(main):016:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_attachments")[0]
=> {"count"=>351758}
irb(main):017:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_blobs")[0]
=> {"count"=>388052}

irb(main):018:0> ActiveRecord::Base.connection.execute("select count(*) from theses")[0]
=> {"count"=>34452}
irb(main):019:0> ActiveRecord::Base.connection.execute("select count(*) from items")[0]
=> {"count"=>44781}

The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?

Let's test, each active_storage_blob should appear only once for each unique attachment, right?

irb(main):014:0> ActiveRecord::Base.connection.execute("SELECT a_blob.checksum, count(*) FROM active_storage_blobs as a_blob GROUP BY a_blob.checksum HAVING count(a_blob.checksum)>1").count()
=> 26520

Why are there so many blobs with the same checksum? Let's filter the attachment count by record_type

ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd5fcc98 status=PGRES_TUPLES_OK ntuples=374 nfields=1 cmd_tuples=374>

Why the decrease in numbers?

irb(main):065:0>  results = ActiveRecord::Base.connection.execute("SELECT record_type FROM  active_storage_attachments group by record_type")
=> #<PG::Result:0x00007f97dd922fd0 status=PGRES_TUPLES_OK ntuples=7 nfields=1 cmd_tuples=7>
irb(main):067:0> results.values
=>
[["ActiveStorage::Blob"],
 ["ActiveStorage::VariantRecord"],
 ["Community"],
 ["DraftItem"],
 ["DraftThesis"],
 ["Item"],
 ["Thesis"]]

Maybe due to the filter on the record_type? The answer seems like "yes" from the below

ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97de6f34e8 status=PGRES_TUPLES_OK ntuples=34829 nfields=1 cmd_tuples=34829>

In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.

irb(main):055:0>  results = ActiveRecord::Base.connection.execute("SELECT * FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97e5553780 status=PGRES_TUPLES_OK ntuples=1118 nfields=16 cmd_tuples=1118>
 {"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "name"=>"files",
  "created_at"=>"2020-08-07T16:41:30.790Z",
  "record_type"=>"Item",
  "fileset_uuid"=>"819ebccf-476e-43e7-8712-17dcf5c0ddf8",
  "record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
  "blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
  "filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "metadata"=>"{\"identified\":true,\"analyzed\":true}",
  "byte_size"=>63687,
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "service_name"=>"local"},
 {"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "name"=>"files",
  "created_at"=>"2020-08-07T16:41:30.790Z",
  "record_type"=>"DraftItem",
  "fileset_uuid"=>nil,
  "record_id"=>"f4244872-c911-46ee-a56c-f013accec73d",
  "blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
  "filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "metadata"=>"{\"identified\":true,\"analyzed\":true}",
  "byte_size"=>63687,
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "service_name"=>"local"},
 {"id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
  "name"=>"files",
  "created_at"=>"2020-12-23T17:44:09.337Z",
  "record_type"=>"DraftItem",
  "fileset_uuid"=>nil,
  "record_id"=>"ad7db058-22e2-44f7-a71f-53f2097cbd4d",
  "blob_id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
  "key"=>"lujere8kg5fpoad2cpslxec22q7o",
  "filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "metadata"=>"{\"identified\":true,\"analyzed\":true}",
  "byte_size"=>63687,
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "service_name"=>"local"}]

Let's filter out the DraftItems and DraftThesis

irb(main):058:0>  results = ActiveRecord::Base.connection.execute("SELECT record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND (a.record_type = 'Item' OR a.record_type='Thesis') and a.name='files' and b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97dd710710 status=PGRES_TUPLES_OK ntuples=835 nfields=16 cmd_tuples=835>
irb(main):059:0> results.values
...
 {"record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
  "record_type"=>"Item",
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "created_at"=>"2020-08-07T16:41:30.790Z",
  "filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
 {"record_id"=>"20121c56-ed6e-4b1c-986e-7014401e83c5",
  "record_type"=>"Item",
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "created_at"=>"2020-12-23T17:44:09.337Z",
  "filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}

Let's write this to a CSV file

CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
  results.each do |row|
    csv << row.values
  end
end

Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):

irb(main):050:0> results = ActiveRecord::Base.connection.execute("SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=b`lob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd75b5d0 status=PGRES_TUPLES_OK ntuples=12 nfields=2 cmd_tuples=12>
irb(main):051:0> results.as_json
=>
[{"checksum"=>"Zl2l9EGyAmAksQ5tS+B/iA==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
 {"checksum"=>"CgOhJouY1r4jDj0HQAXm2w==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
 {"checksum"=>"1rjlByqkT2HV70S4rQsqTw==", "record_id"=>"ebef783e-da83-478b-8467-d5cc99709272"},
 {"checksum"=>"HljKqB6+a4z7yjfqllx4kQ==", "record_id"=>"6e08f261-6dd1-4498-a4be-c3cfb1e46393"},
 {"checksum"=>"qtt8obM5FdBOZKgmhECXFg==", "record_id"=>"9b515e88-e783-44b4-89ef-7ea4d3dc02fc"},
 {"checksum"=>"MxCGonF7giYkVKN4dx+rKg==", "record_id"=>"4653e6af-12d3-4767-be78-2713dec4d1f6"},
 {"checksum"=>"BY3gLzkSPTF18A1JX67kBg==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
 {"checksum"=>"xzbqWsiVx+4hVQiY64SAVA==", "record_id"=>"25520184-11ad-4465-a039-18695e42a92d"},
 {"checksum"=>"aL/ITdTrXnTksq0IU5ABcA==", "record_id"=>"fae6573a-934c-4b60-93e6-38a7ea82651c"},
 {"checksum"=>"KBSFkcK7+prvJwAqpJTAaQ==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
 {"checksum"=>"36r3b8SWaEEHSDmUDWYUZw==", "record_id"=>"fc272f44-17bb-4980-bb31-3b86097abab0"},
 {"checksum"=>"wXGvEO+P2YiZJuaB6CPwYA==", "record_id"=>"0b800874-9bc7-49e1-9f08-aeba43cfde77"}]

Are these intentional?

Let's output nicely in a similar format to the duplicate records finder

results = ActiveRecord::Base.connection.execute("SELECT  record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id and (b.checksum,a.record_id) IN (SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1)")

CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
  results.each do |row|
    csv << row.values
  end
end

Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193

New items found in the last two weeks:

Rough query to find items without an attached file (some may be on purpose

results = ActiveRecord::Base.connection.execute("SELECT i.id FROM items AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Item' and i.id = a.record_id) order by i.id")

Rough query to find theses without an attached file (some may be on purpose

results = ActiveRecord::Base.connection.execute("SELECT i.id FROM theses AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Thesis' and i.id = a.record_id) order by i.id")

Find Items with the same title as an record without and attachment

ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, 'Item', depositor, member_of_paths, ingest_batch, created_at, updated_at FROM items as i2 WHERE i2.title in (SELECT i.title FROM items AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Item' and i.id = a.record_id)) order by i2.title")

Find theses with the same title as an record without and attachment

ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, 'Thesis', depositor, member_of_paths, ingest_batch, created_at, updated_at FROM theses as i2 WHERE i2.title in (SELECT i.title FROM theses AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS
a WHERE record_type = 'Thesis' and i.id = a.record_id)) order by i2.title")

Find Items with the same title as an record without and attachment: add sometimes present name of the attached file to help

results = ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, as_b.filename, 'Item', i2.depositor, i2.ingest_batch, i2.created_at, i2.updated_at, i2.member_of_paths FROM items as i2 LEFT OUTER JOIN active_storage_attachments as as_a on i2.id=as_a.record_id and as_a.record_type='Item' LEFT OUTER JOIN active_storage_blobs as as_b ON as_b.id=as_a.blob_id WHERE i2.title in (SELECT i.title FROM items AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Item' and i.id = a.record_id)) order by i2.title, as_b.filename")

CSV.open('/era_tmp/delete_me_items.csv', 'wb') do |csv|
  results.each do |row|
    csv << row.values
  end
end

Find theses with the same title as an record without and attachment: add sometimes present name of the attached file to help

results = ActiveRecord::Base.connection.execute("SELECT concat('https://era.library.ualberta.ca/items/', i2.id), i2.id, i2.title, as_b.filename, 'Thesis', i2.depositor, i2.ingest_batch, i2.created_at, i2.updated_at, i2.member_of_paths FROM theses as i2 LEFT OUTER JOIN active_storage_attachments as as_a on i2.id=as_a.record_id and as_a.record_type='Thesis' LEFT OUTER JOIN active_storage_blobs as as_b ON as_b.id=as_a.blob_id WHERE i2.title in (SELECT i.title FROM theses AS i WHERE NOT EXISTS (SELECT 1 FROM active_storage_attachments AS a WHERE record_type = 'Thesis' and i.id = a.record_id)) order by i2.title, as_b.filename")

CSV.open('/era_tmp/delete_me_theses.csv', 'wb') do |csv|
  results.each do |row|
    csv << row.values
  end
end

Notes:

  • If the filename column is empty, there should be not be an attached file
  • If the filename column has a value, the row is included because it is naively similar to a resource without an attached file and the row is present to help reconcile duplicates
  • There will be false positives in these two Google Sheets as the work assumes all ERA items/theses should have an attachment and picking "title" for naively indicating similarity is imperfect

Examples like the follow will likely need additional metadata to avoid Google labeling resources with similar metadata but different file attachments as "Duplicate, Google chose different canonical than user". Some examples

Updated list sent to the ERA service team for review