i'm using apache nutch 1.10 version, , i'm changed sources save raw htmls,css,js files directory on local disk , works fine after fetching step comes slow parse stage, how can skip parsing? run crawling using command:
$ bin/crawl urls/ data/ 10
you're using bin/crawl script go on , on (number of rounds times) on generate-fetch-parse-... steps. check out nutch tutorial, can issue command (and building own script) using bin/nutch.
however, if understand doing correctly, meaning indexing html/css/js local filesystem, instead of changing sources, can create own plugins (you'll need parse-plugin , index-plugin think), , apply them on standard nutch process.
Comments
Post a Comment