html - Specify CSS when nth-child() changes between pages -

i having trouble specifying correct css path when scraping multiple html elements. problem pages have different set-ups element specified nth-child(#) out 1 between different pages. element interested in 'unit code' @ nth-child(20) on pages , nth-child(21) on others.

i running on hundreds of sites need figure out how deal change in location. code runs nth-child(21) , predictably returns incorrect text second url.

i using package rvest.

library(rvest) urls <- data.frame('site' = 1:2, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/unitguide.php?year=2015&semester=tri-1&unit=sle010',                         'https://www.deakin.edu.au/current-students/unitguides/unitguide.php?year=2015&semester=tri-2&unit=sle339'))  urls$urls <- as.character(urls$urls)  ucode<- sapply(1:length(urls[,1]), function(x)                html(urls[x,2]) %>%                 html_nodes(css='#wmt_content > div:nth-child(21) > p.standard') %>%                 html_text())

the html each of pages quite large , pages found @ first , second. html containing unit code, plus couple of divs, looks this:

 <div class="unitguideelementitem">     <a name="0-unit-code"></a>     <p style="font-size: 100%;" class="bold">         "unit code"         <br>         "&nbsp;"         <br>     <p style="font-size: 100%" class="standard">         "sle334"         <br>     </p>   </div>   <div class="unitguideelementitem">     <a name="0-unit-title"></a>     <p style="font-size: 100%;" class="bold">        "unit title"        <br>        "&nbsp;"        <br>     <p style="font-size: 100%" class="standard">        "medical microbiology , immunology"         <br>   </div>   <div class="unitguideelementitem">      <a name="0-contact-hours"></a>      <p style="font-size: 100%;" class="bold">         "contact hours"         <br>         "&nbsp;"         <br>      <p style="font-size: 100%" class="standard">         "3 x 1 hour class per week, 5 x 3 hour practicals per trimester."      <br>   </div>

there nothing unique section of html code compared other sections except 0-unit-codein <a> tag. looking @ w3schools page able <a> tag, can't figure out how specify <p> siblings within node. getting <a> tag:

ucode<- sapply(1:length(urls[,1]), function(x)                html(urls[x,2]) %>%                 html_nodes(css='[name$=code]') %>%                 html_text())

does know how might select 'same' element, e.g. siblings of name="0-unit-code", html file when elements location changes page page? or, how return information tags can locate different tag type same parent?

edit: included package name. included link sites , included more of html clarification.

you can use xpath's "following-sibling": "find <p class=standard> sibling of , following <a name=0-unit-code>.

ucode<- sapply(1:length(urls[,1]), function(x)                html(urls[x,2]) %>%                 html_nodes(xpath="//a[@name='0-unit-code']/following-sibling::p[@class='standard']") %>%                 html_text())

//a[@name='0-unit-code'] finds <a> name="0-unit-code" (note: think in xpath //a[local-name()='0-unit-code'] syntax doesn't seem understood in function?)
the /following-sibling::p[@class='standard'] selects following sibling of a class standard.

WIKI

Search This Blog

html - Specify CSS when nth-child() changes between pages -

Comments

Post a Comment