chimbori/crux

Crux replaces page title with site title.

ciferkey opened this issue · 1 comments

I've been running crux over several sites and noticed the following bug.

Problem

Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816

Test based off the README example to verify the problem:

  @Test
  fun broken() {
    val crux = Crux()

    val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()

    val document = Jsoup.connect(httpUrl.toString()).get()

    val resource = runBlocking {
      crux.extractFrom(httpUrl, document)
    }

    assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
  }

The sequence of events is:

  • HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
  • WebAppManifestParser extracts the title "BBC"
  • The fold operation in Crux.extractFrom uses Resource.plus to merge the resources overwriting the title with "BBC"
    fields = if (anotherResource?.fields == null) fields else fields + anotherResource.fields,

Possible solutions

If you update Crux.createDefaultPlugins to place WebAppManifestParser before HtmlMetadataExtractor like this:

public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
  // Static redirectors go first, to avoid getting stuck into CAPTCHAs.
  GoogleUrlRewriter(),
  FacebookUrlRewriter(),
  // Remove any tracking parameters remaining.
  TrackingParameterRemover(),
  // Prefer canonical URLs over AMP URLs.
  AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
  // Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
  WebAppManifestParser(okHttpClient),
  // Parses many standard HTML metadata attributes.
  HtmlMetadataExtractor(okHttpClient),
  // Extracts the best possible favicon from all the markup available on the page itself.
  FaviconExtractor(),
  // Parses the content of the page to remove ads, navigation, and all the other fluff.
  ArticleExtractor(okHttpClient),
)

It will produce the correct results.

This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser after HtmlMetadataExtractor or can we reorder it?

If that is not possible then we might need to consider a new way to handle merging the fields.

That sounds perfect: solving via reordering the plugins is the best solution.

I didn't envision this exact scenario when writing it up, so this is a good bug that you reported.