regex and scrapy in web crawling -

i web crawling use scrapy. currently, can extract start url not crawl later.

 start_urls = ['https://cloud.cubecontentgovernance.com/retention/document_types.aspx']  allowed_domains = ['cubecontentgovernance.com'] rules = (      rule(linkextractor(allow=("document_type_retention.aspx?dtid=1054456",)),          callback='parse_item', follow=true), )

and link want extract in develop tool is:<a id="ctl00_body_listview1_ctrl0_hypernamelink" href="document_type_retention.aspx?dtid=1054456"> pricing </a>

the corresponding url https://cloud.cubecontentgovernance.com/retention/document_type_retention.aspx?dtid=1054456

so allow field should be? lot

when try open site of start url login window.

did try print response.body in simple parse method start url? guess scrapy instance gets same login window not have url want extract linkextractor.

WIKI

Search This Blog

regex and scrapy in web crawling -

Comments

Post a Comment