Saturday 15 January 2011

python - pyparsing whitespace match issues -



python - pyparsing whitespace match issues -

i tried utilize pyparsing parse robotframework, text based dsl. sytnax next ( sorry, think it's little hard me describe in bnf). single line in robotframework may looks like:

library\tsshclient name\tnode

\t tab, , in robotframework, transparently transfered 2 " "(in fact, phone call str.replace('\t', ' ') replace tab, modified length of each line, len('\t') 1 len(' ') 2.). in robot, 2 , more whitespaces , '\t' used split token, if there 1 whitespaces between words, words considered token group.

library\tsshclient name\tnode

is splitted next tokens if parsed correctly:

['library', 'sshclient', 'with name', 'node']

as there 1 whitespace between "with" , "name", parser considers belong grouping syntax token.

here code:

parserelement.setdefaultwhitespacechars('\r\n\t ') source = "library\tsshclient name\tnode" each_line = optional(word(" ")).leavewhitespace().suppress() + \ caselesskeyword("library").suppress() + \ oneormore((word(alphas)) + white(max=1).setresultname('myvalue')) +\ skipto(lineend()) res = each_line.parsestring(source) print res.myvalue

questions:

1) set whitespaces, if want matched 2 or more whitespaces or 1 or more tab, thought code like: white(ws=' ', min=2)| white(ws='\t', min=1) fail, not specify whitespace value?

2) there way matched result index? tried setparseaction, seems not index callback. need both start , end index highlight word.

3) linestart , lineend means ? print these values, seems normal string, have write in front end of line like: linestart() + balabala... + lineend() ?

thanks, however, there restriction not replace '\t' ' '

from pyparsing import * source = "library\tsshclient\t\t\twith name s1" value = combine(oneormore(word(printables) | white(' ', max=1) + ~white())) #here seems whitespace has been set ' ', why result still match '\t'? linedefn = oneormore(value) res = linedefn.parsestring(source) print res

i got

['library sshclient', 'with name', 's1']

but expected ['library', 'sshclient', 'with name', 's1']

i flinch when whitespace creeps parsed tokens, constraints single spaces allowed, should workable. used next look define values have embedded single spaces:

# each value consists of printable words separated @ # single space (a space not followed space) value = combine(oneormore(word(printables) | white(' ',max=1) + ~white()))

with done, line 1 or more of these values:

linedefn = oneormore(value)

following example, including calling str.replace replace tabs pairs of spaces, code looks like:

data = "library\tsshclient name\tnode" # replace tabs 2 spaces info = data.replace('\t', ' ') print linedefn.parsestring(data)

giving:

['library', 'sshclient', 'with name', 'node']

to start , end locations of values in original string, wrap look in new pyparsing helper method locatedexpr:

# utilize new locatedexpr value, start, , end location # each value linedefn = oneormore(locatedexpr(value))('values')

if parse , dump results:

print linedefn.parsestring(data).dump()

we get:

- values: [0]: [0, 'library', 7] - locn_end: 7 - locn_start: 0 - value: library [1]: [9, 'sshclient', 18] - locn_end: 18 - locn_start: 9 - value: sshclient [2]: [22, 'with name', 31] - locn_end: 31 - locn_start: 22 - value: name [3]: [33, 'node', 37] - locn_end: 37 - locn_start: 33 - value: node

linestart , lineend pyparsing look classes instances should match @ start , end of line. linestart has been hard work with, lineend predictable. in case, if read , parse line @ time, shouldn't need them - define contents of line expect. if want ensure parser has processed entire string (and not stopped short of end because of non-matching character), add together + lineend() or + stringend() end of parser, or add together argument parseall=true phone call parsestring().

edit:

it easy forget pyparsing calls str.expandtabs default - have disable calling parsewithtabs. that, , explicitly disallowing tabs between value words resolves problem, , keeps values @ right character counts. see changes below:

from pyparsing import * tab = white('\t') # each value consists of printable words separated @ # single space (a space not followed space) value = combine(oneormore(~tab + (word(printables) | white(' ',max=1) + ~white()))) # each line has 1 or more of these values linedefn = oneormore(value) # not expand tabs before parsing linedefn.parsewithtabs() info = "library\tsshclient name\tnode" # replace tabs 2 spaces #data = data.replace('\t', ' ') print linedefn.parsestring(data) linedefn = oneormore(locatedexpr(value))('values') # not expand tabs before parsing linedefn.parsewithtabs() print linedefn.parsestring(data).dump()

python pyparsing

No comments:

Post a Comment