fimad/scalpel

Scalpel completely fails for some sites

Closed this issue · 3 comments

jezen commented

Some sites return no markup at all, or just fail.

I made a small test case to reproduce the issue.

Change the first argument to scrapeURL to one of the other URLs to test.

#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc --package scalpel-0.4.0

import Text.HTML.Scalpel

-- SUCCESS: Prints all the HTML
reed = "http://www.reed.co.uk/jobs/london?keywords=javascript"

-- SUCCESS: Prints all the HTML
indeed = "http://www.indeed.co.uk/jobs?q=javascript&l=london"

-- FAILED: Prints the string "Failed"
jobsite = "http://www.jobsite.co.uk/vacancies?search_type=quick&query=javascript&location=london&jobTitle-input=&location-input=&radius=20"

-- FAILED: Doesn't print anything at all, which I think translates to a result
-- of `Just []`
monster = "http://www.monster.co.uk/jobs/search/?q=javascript&where=London"

main :: IO ()
main = do
  html <- scrapeURL monster $ htmls anySelector
  maybe printError printHtml html
  where
    printError = putStrLn "Failed"
    printHtml = mapM_ putStrLn
jezen commented

Subsequent digging has shown me that there is nothing at all wrong with Scalpel or Network.Curl.

It seems that some sites won't return anything unless you provide extra options like a user agent string. I managed to reproduce this with this small test:

#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc

import qualified Network.Curl as Curl

get :: String -> [Curl.CurlOption] -> IO (Curl.CurlResponse_ [(String, String)] String)
get url opts = Curl.curlGetResponse_ url opts

main :: IO ()
main = do
    response <- get url opts
    print $ Curl.respBody response
  where
    url = "http://www.jobsite.co.uk/vacancies?search_type=quick&query=javascript&location=london&jobTitle-input=&location-input=&radius=20"
    opts =
      [ Curl.CurlUserAgent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17"
      ]

Without adding the UA string, curl (or the scraping target) just returns an empty string.

This is perhaps more indicative of my lack of experience with scraping.

@fimad Do you think this should be closed as is? Or do you think it would be helpful for me to add a note about this in the documentation? Or perhaps the library should send a UA string by default?

fimad commented

Does any UA work? I'd be open to having Scalpel default to something like "scalpel/", but wouldn't want to masquerade as an existing web browser by default.

I think it would be worth adding to documentation, perhaps with a code sample, mentioning that some sites will alter responses based on the UA.

jezen commented

@fimad I'm not sure how different websites implement this. Probably no way of knowing either.

I'll send a PR with a bit of updated documentation.