This is a small project to check and ensure the behaviour of a regex-urlfilter file which is used by the nutch injector and db updater.
Such a regex-urlfilter file contains of a list of entries like:
[+-]<REGEX_1>
...
[+-]<REGEX_N>
For each url nutch checks each in the order how they appear in the file and the sign of the first matching regex determines if the url is ignored (for -) or fetched (for +).
The main goal of this project is to give a test driven way to create such a regex-urlfilter-file. The Approach is:
- Divide the content that you want to crawl into small parts
- For each part, you will
- give some example-urls that must be fetched
- give some example-urls that must be ignored
- give regex rules for those examples
With that, this project will do the following:
- Put together all rules from the examples to the final regex-urlfilter-file
- Check for all ignore-examples if this global regex-urlfilter-file will ignore each of them
- Check for all fetch-examples if this global regex-urlfilter-file will fetch each of them
#Clone the sourcecode
git clone https://github.com/mam10eks/check-nutch-regex-urlfilter.git
cd check-nutch-regex-urlfilter
#Compile the code
mvn clean install
For each example you create a directory within regex_base_dir
in the pattern <NUMBER>_<ARBITRARY_NAME>
where
wher <NUMBER>
is an unique integer within regex_base_dir
.
The <NUMBER>
determines the order in which the examples are concatenated in the final
regex-urlfilter-file (i.e. regexes from 1 comes before 2 and so on).
Each such <NUMBER>_<ARBITRARY_NAME>
directory consists out of three files:
black.txt
: urls which should be ignoredwhite.txt
: urls which should be crawledurl-regex.txt
: nutch url-regexes to fulfill the examples provided inblack.txt
andwhite.txt
If you want to check/crawl your examples only, remove all examples for other domains in your local copy and only commit your new examples.
Simply execute
./build_regex_url_filter.sh