xml - Handling nested elements with Python lxml -
given simple xml info below:
<book> <title>my first book</title> <abstract> <para>first paragraph of abstract</para> <para>second paragraph of abstract</para> </abstract> <keywordset> <keyword>first keyword</keyword> <keyword>second keyword</keyword> <keyword>third keyword</keyword> </keywordset> </book>
how can traverse tree, using lxml, , paragraphs in "abstract" element, keywords in "keywordset" element?
the code snippet below returns first line of text in each element:
from lxml import objectify root = objectify.fromstring(xml_string) # xml_string contains xml info above print root.title # returns book title line in root.abstract: print line.para # returns yhe first paragraph word in root.keywordset: print word.keyword # returns first keyword in set
i tried follow this example, code above doesn't work expected.
on different tack, still improve able read entire xml tree python dictionary, each element key , each text element item(s). found out might possible using lxml objectify, couldn't figure out how accomplish it.
one big problem have been finding when attempting write xml parsing code in python of "examples" provided simple , exclusively fictitious of much help -- or else opposite, using complicated automatically-generated xml data!
could give me hint?
thanks in advance!
edit: after posting question, found simple solution here.
so, updated code becomes:
from lxml import objectify root = objectify.fromstring(xml_string) # xml_string contains xml info above print root.title # returns book title para in root.abstract.iterchildren(): print para # returns text of paragraphs keyword in root.keywordset.iterchildren(): print keyword # returns keywords in set
this pretty simple using xpath:
from lxml import etree tree = etree.parse('data.xml') paragraphs = tree.xpath('/abstract/para/text()') keywords = tree.xpath('/keywordset/keyword/text()') print paragraphs print keywords
output:
['first paragraph of abstract', 'second paragraph of abstract'] ['first keyword', 'second keyword', 'third keyword']
see the xpath tutorial @ w3schools details on xpath syntax.
in particular, elements used in expressions above use
the/
selector select root node / immediate children. the text()
operator select text node (the "textual content") of respective elements. here's how done using objectify api:
from lxml import objectify root = objectify.fromstring(xml_string) paras = [p.text p in root.abstract.para] keywords = [k.text k in root.keywordset.keyword] print paras print keywords
it seems root.abstract.para
shorthand root.abstract.para[0]
. need explicitly utilize element.iterchildren()
access kid elements.
that's not true, both misunderstood objectify api: in order iterate on para
s in abstract
, need iterate on root.abstract.para
, not root.abstract
itself. it's weird, because intuitively think abstract
collection or container nodes, , container represented python iterable. it's .para
selector represents sequence.
python xml lxml
No comments:
Post a Comment