i need share 1 common object instance among crawlers / spiders running on scrapyd. best scenario hook object's methods on each spider's signals, like
ext = commonobject() crawler.signals.connect( ext.onspideropen, signal = signals.spider_opened ) crawler.signals.connect( ext.onspiderclose, signal = signals.spider_closed ) etc.. where commonobject instantiated , initialized only once , expose methods running crawling processes / spiders (i don't mind using singleton purpose).
based on research understand have 2 options:
- run spiders / crawlers within 1 crawlerprocess, commonobject instantiated.
- run 1 spider / crawler per crawlerprocess (default scrapy(d) behavior), instantiate commonobject somewhere in reactor , perhaps access remotely using twisted.spread.pb.
questions:
- are there cpu utilization penalties (is cpu utilized less effectively) using first option on letting scrapyd manage processes (which second option)? crawlerprocess capable of running more crawlers in parallel (not sequentially)? how schedule further spiders during run-time within same crawlerprocess? (i understand crawlerprocess.start() blocking.)
- i not advanced enough implement second option (actually not sure whether it's viable option @ all). there draw sample implementation?
- perhaps missing , there way of doing this?
Comments
Post a Comment