Wednesday 15 June 2011

r - Using XPath 1.0 and httr, why does the substring-before function only return the first of several elements? -



r - Using XPath 1.0 and httr, why does the substring-before function only return the first of several elements? -

i need understand how have substring-before or -after apply multiple nodes.

the code below returns not city want additional unwanted details.

require(xml) require(httr) doc <- htmltreeparse("http://www.cpmy.com/contact.asp", useinternal = true) > (string <- xpathsapply(doc, "//div[@id = 'leftcol']//p", xmlvalue, trim = true)) [1] "philadelphia office1880 jfk boulevard10th floorphiladelphia, pa 19103tel: 215-587-1600fax: 215-587-1699map , directions" [2] "westmont office216 haddon avenuesentry office plaza, suite 703westmont, nj 08108tel: 856-946-0400fax: 856-946-0399map , directions" [3] "boston office50 congress streetsuite 430boston, ma 02109tel: 617-854-8315fax: 617-854-8311map , directions" [4] "new york office5 penn plaza23rd floornew york, ny 10001tel: 646-378-2192fax: 646-378-2001map , directions"

i added substring-before(), returns first element, correctly shortened, not remaining three:

> (string <- xpathsapply(doc, "substring-before(//div[@id = 'leftcol']//p, 'office')", xmlvalue, trim = true)) [1] "philadelphia "

how should revise xpath look extract in shortened form -- before "office" 4 elements?

thank you.

if must process using xpath 2 step process may utilised. nodes selected first substring processing done current node :

require(xml) doc <- htmlparse("http://www.cpmy.com/contact.asp") sapply(doc["//div[@id = 'leftcol']//p"] , getnodeset, "substring-before(./b/text(), 'office')") [1] "philadelphia " "westmont " "boston " "new york "

http://www.w3.org/tr/xpath/#section-string-functions in xpath 1.0

a node-set converted string returning string-value of node in node-set first in document order. if node-set empty, empty string returned.

so returned 1 result hence need 2 step process. in xpath 2.0 utilize string function within xpath

"//div[@id = 'leftcol']//p/b/text()[substring-before(. , 'office')]"

or similar homecoming looking for.

r xpath xml-parsing

No comments:

Post a Comment