Tuesday 15 January 2013

xml - Handling nested elements with Python lxml -



xml - Handling nested elements with Python lxml -

given simple xml info below:

<book> <title>my first book</title> <abstract> <para>first paragraph of abstract</para> <para>second paragraph of abstract</para> </abstract> <keywordset> <keyword>first keyword</keyword> <keyword>second keyword</keyword> <keyword>third keyword</keyword> </keywordset> </book>

how can traverse tree, using lxml, , paragraphs in "abstract" element, keywords in "keywordset" element?

the code snippet below returns first line of text in each element:

from lxml import objectify root = objectify.fromstring(xml_string) # xml_string contains xml info above print root.title # returns book title line in root.abstract: print line.para # returns yhe first paragraph word in root.keywordset: print word.keyword # returns first keyword in set

i tried follow this example, code above doesn't work expected.

on different tack, still improve able read entire xml tree python dictionary, each element key , each text element item(s). found out might possible using lxml objectify, couldn't figure out how accomplish it.

one big problem have been finding when attempting write xml parsing code in python of "examples" provided simple , exclusively fictitious of much help -- or else opposite, using complicated automatically-generated xml data!

could give me hint?

thanks in advance!

edit: after posting question, found simple solution here.

so, updated code becomes:

from lxml import objectify root = objectify.fromstring(xml_string) # xml_string contains xml info above print root.title # returns book title para in root.abstract.iterchildren(): print para # returns text of paragraphs keyword in root.keywordset.iterchildren(): print keyword # returns keywords in set

this pretty simple using xpath:

from lxml import etree tree = etree.parse('data.xml') paragraphs = tree.xpath('/abstract/para/text()') keywords = tree.xpath('/keywordset/keyword/text()') print paragraphs print keywords

output:

['first paragraph of abstract', 'second paragraph of abstract'] ['first keyword', 'second keyword', 'third keyword']

see the xpath tutorial @ w3schools details on xpath syntax.

in particular, elements used in expressions above use

the / selector select root node / immediate children. the text() operator select text node (the "textual content") of respective elements.

here's how done using objectify api:

from lxml import objectify root = objectify.fromstring(xml_string) paras = [p.text p in root.abstract.para] keywords = [k.text k in root.keywordset.keyword] print paras print keywords

it seems root.abstract.para shorthand root.abstract.para[0]. need explicitly utilize element.iterchildren() access kid elements.

that's not true, both misunderstood objectify api: in order iterate on paras in abstract, need iterate on root.abstract.para, not root.abstract itself. it's weird, because intuitively think abstract collection or container nodes, , container represented python iterable. it's .para selector represents sequence.

python xml lxml

No comments:

Post a Comment