i web crawling use scrapy. currently, can extract start url not crawl later.
start_urls = ['https://cloud.cubecontentgovernance.com/retention/document_types.aspx'] allowed_domains = ['cubecontentgovernance.com'] rules = ( rule(linkextractor(allow=("document_type_retention.aspx?dtid=1054456",)), callback='parse_item', follow=true), ) and link want extract in develop tool is:<a id="ctl00_body_listview1_ctrl0_hypernamelink" href="document_type_retention.aspx?dtid=1054456"> pricing </a>
the corresponding url https://cloud.cubecontentgovernance.com/retention/document_type_retention.aspx?dtid=1054456
so allow field should be? lot
when try open site of start url login window.
did try print response.body in simple parse method start url? guess scrapy instance gets same login window not have url want extract linkextractor.
Comments
Post a Comment