Monday 15 July 2013

python - xml.etree writes xml to file in an unexpected manner -



python - xml.etree writes xml to file in an unexpected manner -

i'm using xml.etree.elementtree parse , alter utf-8 xml file. 2 of issues because file written in unix file format instead of windows. issue 1 obvious, line endings \n instead of \r\n. issue 2 utf-8 strings beingness rendered differently because of different file formats (i assume). how can forcefulness write() function save in windows file format? utilize write() like:

# -*- coding: utf-8 -*- import xml.etree.elementtree et import sys altspellingtree = et.parse(sys.argv[2]) altspellingroot = altspellingtree.getroot() recordlist = altspellingroot.findall("record") # grab <record> elements , iterate record in recordlist: # check existence of <alternative_spelling> element alt_spelling_node = record.find("person").find("names").find("alternative_spelling") if alt_spelling_node == none: go on else: # check if <alternative_spelling> element text solely "," if alt_spelling_node.text == ",": alt_spelling_node.text = none # remove lone comma altspellingtree.write(sys.argv[2], encoding="utf-8", xml_declaration=true)

the 3rd issue file output uses self-closing tags there used opening , closing tag (ex. <country></country> becomes <country />). there way maintain happening?

-------edit-------- here's 2 samples how xml looks before programme run:

<country></country> <category_type></category_type> <standard></standard> <names> <first_name>fernando</first_name> <last_name>romero avila</last_name> <aliases> <alias xsi:nil="true" /> </aliases> <low_quality_aliases> <alias xsi:nil="true" /> </low_quality_aliases> <alternative_spelling>romero Ávila,fernando</alternative_spelling> </names>

and same 2 samples after programme run.:

<country /> <category_type /> <standard /> <names> <first_name>fernando</first_name> <last_name>romero avila</last_name> <aliases> <alias xsi:nil="true" /> </aliases> <low_quality_aliases> <alias xsi:nil="true" /> </low_quality_aliases> <alternative_spelling>romero Ãvila,fernando</alternative_spelling> </names>

i haven't tested code if there bug, avoid self-closing tag, alter this:

altspellingtree.write(sys.argv[2], encoding="utf-8", xml_declaration=true)

to

altspellingtree.write(sys.argv[2], encoding="utf-8", xml_declaration=true, method="html")

should trick.

and much simplify code, can utilize iter search tree this:

import xml.etree.elementtree et tree = et.parse('your.xml') el in tree.iter('alternative_spelling'): # check el text or whatever if el.text == u",": el.text = "" print el.text

python xml python-2.7 utf-8 xml.etree

No comments:

Post a Comment