web crawler - apache nutch skip 'parse' stage -


i'm using apache nutch 1.10 version, , i'm changed sources save raw htmls,css,js files directory on local disk , works fine after fetching step comes slow parse stage, how can skip parsing? run crawling using command:

$ bin/crawl  urls/  data/ 10 

you're using bin/crawl script go on , on (number of rounds times) on generate-fetch-parse-... steps. check out nutch tutorial, can issue command (and building own script) using bin/nutch.

however, if understand doing correctly, meaning indexing html/css/js local filesystem, instead of changing sources, can create own plugins (you'll need parse-plugin , index-plugin think), , apply them on standard nutch process.


Comments