Scalpel completely fails for some sites
Closed this issue · 3 comments
Some sites return no markup at all, or just fail.
I made a small test case to reproduce the issue.
Change the first argument to scrapeURL
to one of the other URLs to test.
#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc --package scalpel-0.4.0
import Text.HTML.Scalpel
-- SUCCESS: Prints all the HTML
reed = "http://www.reed.co.uk/jobs/london?keywords=javascript"
-- SUCCESS: Prints all the HTML
indeed = "http://www.indeed.co.uk/jobs?q=javascript&l=london"
-- FAILED: Prints the string "Failed"
jobsite = "http://www.jobsite.co.uk/vacancies?search_type=quick&query=javascript&location=london&jobTitle-input=&location-input=&radius=20"
-- FAILED: Doesn't print anything at all, which I think translates to a result
-- of `Just []`
monster = "http://www.monster.co.uk/jobs/search/?q=javascript&where=London"
main :: IO ()
main = do
html <- scrapeURL monster $ htmls anySelector
maybe printError printHtml html
where
printError = putStrLn "Failed"
printHtml = mapM_ putStrLn
Subsequent digging has shown me that there is nothing at all wrong with Scalpel or Network.Curl.
It seems that some sites won't return anything unless you provide extra options like a user agent string. I managed to reproduce this with this small test:
#!/usr/local/bin/stack
-- stack runghc --resolver lts-6.24 --install-ghc
import qualified Network.Curl as Curl
get :: String -> [Curl.CurlOption] -> IO (Curl.CurlResponse_ [(String, String)] String)
get url opts = Curl.curlGetResponse_ url opts
main :: IO ()
main = do
response <- get url opts
print $ Curl.respBody response
where
url = "http://www.jobsite.co.uk/vacancies?search_type=quick&query=javascript&location=london&jobTitle-input=&location-input=&radius=20"
opts =
[ Curl.CurlUserAgent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17"
]
Without adding the UA string, curl (or the scraping target) just returns an empty string.
This is perhaps more indicative of my lack of experience with scraping.
@fimad Do you think this should be closed as is? Or do you think it would be helpful for me to add a note about this in the documentation? Or perhaps the library should send a UA string by default?
Does any UA work? I'd be open to having Scalpel default to something like "scalpel/", but wouldn't want to masquerade as an existing web browser by default.
I think it would be worth adding to documentation, perhaps with a code sample, mentioning that some sites will alter responses based on the UA.