Crawl always starts from server root

Question

Crawl always starts from server root

Closed this issue 9 years ago · 3 comments

GoogleCodeExporter commented 9 years ago

When calling skipfish as:

./skipfish -o ../out https://test.com/foo/bar/baz.html

The crawler always starts from https://test.com/, ignoring path and parameters 
(and from looking at the code in database.c, it seems it does this every time a 
link points to a new host).

I'd like to submit a patch to change this behavior (through a command-line 
switch), but before I do that I'd like to know the rationale for current code, 
in order not to break any useful use case.

Best regards,
Mattia

Original issue reported on code.google.com by mattiaba...@gmail.com on 1 Jul 2013 at 8:45

Answer 1 · 2016-02-05T08:47:40.000Z

There is a separate command-line parameter to limit the scan to a specific path 
(or exclude specific paths). Without it, the scanner simply takes any number of 
"seed" URLs in the command line, but it brute-forces the entire site. All of 
them should still get crawled, just not right away.

Original comment by lcam...@google.com on 1 Jul 2013 at 8:56

Answer 2 · 2016-02-05T08:47:40.000Z


To expand on what Michal said: Using -I /foo/bar/ for explicit inclusion allows 
the active testing to be limited to /foo/bar/*

Are you concerned about / or /foo/ being actively tested ? This should not 
happen with -I. Or is there a different problem ?

Original comment by niels.he...@gmail.com on 2 Jul 2013 at 8:18

Answer 3 · 2016-02-05T08:47:40.000Z

Original comment by niels.he...@gmail.com on 17 Nov 2013 at 8:16

Changed state: Invalid