fcavallarin/htcap

Extra error when crawling

barhaterahul opened this issue · 15 comments

I was trying to crawl a website with -m active -v. I am getting these errors. Could you please look into it,
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 69 - 317)

Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 341 - 589)

I had the same error…

Here is the content of the problematic json:

[
    ["cookies",[]],
    {"status":"ok","redirect":"http://example.com","time":0}
]
Blocked a frame with origin "file://" from accessing a frame with origin "null".  The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "about". Protocols must match.{"status":"ok", "partialcontent":true}]

There is clearly some garbage in it…

After investigation, it's because that the stdout in polluted by PhantomJS error.

The best practice should be using system.stdout.write('my json') (see example here) and overwriting console.log() to provide some controle over the console output. But, I am not sure if it is really the root cause here…

Thanks! It's clearly some garbage generated by phantomjs.
Could you please provide steps to reproduce the problem?

I've got the error while crawling one of our client website, I tried to reproduce it in a more stable environment without success. Sorry…

I'll try again next week

Finally, I found a way to reproduce:

  • run analyse.js on a local path: $ phantomjs core/crawl/probe/analyze.js /

  • It returns the same type of garbage:

[
{"status":"error","code":"load","time":0}
]
Blocked a frame with origin "file://" from accessing a frame with origin "null".  The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "about". Protocols must match.

thanks!!

It looks like the error happened every time PhantomJS hits a redirect…
It became a blocker for us here, so I'm starting to work on a fix.

After some research, it's because phantomjs use stdout to provide feedback and do not offer option to deactivate this feedback, plus we can't rely on the fact that PhantomJS use either stdout or stderr in the right case (PhantomJS send output in stdout even it should have been sent in stderr)

So a solution would be using a temporary file shared between the CrawlerThreads and PhantomJS (with fs.write() more here) and read the file content afterward.

Benefits of this approach:

  • increase the reliability PhantomJS output by providing a 100% conform json
  • clean-up the js code where the output had to going through console.log() calls

An other solution would be having some kind of local http stream to share info between the 2 process… but it seems to be a bit overkill for this matter.

@segment-srl, What do you think?

I'm still unable to reproduce this issue, even with "phantomjs core/crawl/probe/analyze.js /". What version of phantomjs are you using on what os?

$ phantomjs --version
2.1.1

linux?

Yes linux…
This is interesting: I don't get the same result with the binary provided by the ubuntu repo and with the one downloaded on project page!
With the one from the project, I don't get any error…

interesting yes.. so it's an issue related on the phantomjs build.. one solution is to write analize,js output to fie instead of stdout..

I check the build difference between the 2 build (project vs ubuntu repo) and it seems that the ubuntu do not use the same process for building PhantomJS.
I asked them why here: https://answers.launchpad.net/ubuntu/+source/phantomjs/+question/462517

@barhaterahul, what version of PhantomJS do you run? Is it the version provided by Ubuntu too?

Finally, my question at launchpad regarding the difference with the build process has been closed without a straight answer…
So, I updated the readme: #20

This issue is related to phantomjs build on some linux distros. Using the binary from the officail website should fix the problem.
Since phantomjs is no more supported, htcap is now moving to headless chrome so issue similar to this one won't be fixed.