Crux replaces page title with site title.
ciferkey opened this issue · 1 comments
I've been running crux over several sites and noticed the following bug.
Problem
Here is an example URL that displays the problem: https://www.bbc.com/news/world-europe-61691816
Test based off the README example to verify the problem:
@Test
fun broken() {
val crux = Crux()
val httpUrl = "https://www.bbc.com/news/world-europe-61691816".toHttpUrl()
val document = Jsoup.connect(httpUrl.toString()).get()
val resource = runBlocking {
crux.extractFrom(httpUrl, document)
}
assertEquals("Ukraine anger as Macron says 'Don't humiliate Russia'", resource.fields[Fields.TITLE])
}
The sequence of events is:
- HtmlMetadataExtractor correctly extracts the right title "Ukraine anger as Macron says 'Don't humiliate Russia' - BBC News"
- WebAppManifestParser extracts the title "BBC"
- The fold operation in
Crux.extractFrom
usesResource.plus
to merge the resources overwriting the title with "BBC"
Possible solutions
If you update Crux.createDefaultPlugins
to place WebAppManifestParser
before HtmlMetadataExtractor
like this:
public fun createDefaultPlugins(okHttpClient: OkHttpClient): List<Plugin> = listOf(
// Static redirectors go first, to avoid getting stuck into CAPTCHAs.
GoogleUrlRewriter(),
FacebookUrlRewriter(),
// Remove any tracking parameters remaining.
TrackingParameterRemover(),
// Prefer canonical URLs over AMP URLs.
AmpRedirector(refetchContentFromCanonicalUrl = true, okHttpClient),
// Fetches and parses the Web Manifest. May replace existing favicon URL with one from the manifest.json.
WebAppManifestParser(okHttpClient),
// Parses many standard HTML metadata attributes.
HtmlMetadataExtractor(okHttpClient),
// Extracts the best possible favicon from all the markup available on the page itself.
FaviconExtractor(),
// Parses the content of the page to remove ads, navigation, and all the other fluff.
ArticleExtractor(okHttpClient),
)
It will produce the correct results.
This is the simplest way we can resolve it. Is there a specific reason to have WebAppManifestParser
after HtmlMetadataExtractor
or can we reorder it?
If that is not possible then we might need to consider a new way to handle merging the fields
.
That sounds perfect: solving via reordering the plugins is the best solution.
I didn't envision this exact scenario when writing it up, so this is a good bug that you reported.