i having trouble specifying correct css path when scraping multiple html elements. problem pages have different set-ups element specified nth-child(#) out 1 between different pages. element interested in 'unit code' @ nth-child(20) on pages , nth-child(21) on others.
i running on hundreds of sites need figure out how deal change in location. code runs nth-child(21) , predictably returns incorrect text second url.
i using package rvest.
library(rvest) urls <- data.frame('site' = 1:2, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/unitguide.php?year=2015&semester=tri-1&unit=sle010', 'https://www.deakin.edu.au/current-students/unitguides/unitguide.php?year=2015&semester=tri-2&unit=sle339')) urls$urls <- as.character(urls$urls) ucode<- sapply(1:length(urls[,1]), function(x) html(urls[x,2]) %>% html_nodes(css='#wmt_content > div:nth-child(21) > p.standard') %>% html_text()) the html each of pages quite large , pages found @ first , second. html containing unit code, plus couple of divs, looks this:
<div class="unitguideelementitem"> <a name="0-unit-code"></a> <p style="font-size: 100%;" class="bold"> "unit code" <br> " " <br> <p style="font-size: 100%" class="standard"> "sle334" <br> </p> </div> <div class="unitguideelementitem"> <a name="0-unit-title"></a> <p style="font-size: 100%;" class="bold"> "unit title" <br> " " <br> <p style="font-size: 100%" class="standard"> "medical microbiology , immunology" <br> </div> <div class="unitguideelementitem"> <a name="0-contact-hours"></a> <p style="font-size: 100%;" class="bold"> "contact hours" <br> " " <br> <p style="font-size: 100%" class="standard"> "3 x 1 hour class per week, 5 x 3 hour practicals per trimester." <br> </div> there nothing unique section of html code compared other sections except 0-unit-codein <a> tag. looking @ w3schools page able <a> tag, can't figure out how specify <p> siblings within node. getting <a> tag:
ucode<- sapply(1:length(urls[,1]), function(x) html(urls[x,2]) %>% html_nodes(css='[name$=code]') %>% html_text()) does know how might select 'same' element, e.g. siblings of name="0-unit-code", html file when elements location changes page page? or, how return information tags can locate different tag type same parent?
edit: included package name. included link sites , included more of html clarification.
you can use xpath's "following-sibling": "find <p class=standard> sibling of , following <a name=0-unit-code>.
ucode<- sapply(1:length(urls[,1]), function(x) html(urls[x,2]) %>% html_nodes(xpath="//a[@name='0-unit-code']/following-sibling::p[@class='standard']") %>% html_text()) //a[@name='0-unit-code']finds<a>name="0-unit-code"(note: think in xpath//a[local-name()='0-unit-code']syntax doesn't seem understood in function?)- the
/following-sibling::p[@class='standard']selects following sibling ofaclass standard.
Comments
Post a Comment