Sunday, 15 January 2012

Parsing huge XML file using Go -



Parsing huge XML file using Go -

we need parse huge xml file using go. we'd utilize sax-like event based algorithm using xml.newdecoder() , decoder.token() library calls. we've created appropriate struct types xml annotations. easy peasy far.

now, go through file , observe xml.startelement tokens. , here comes problem. need decode attributes of starting token , go on content. if phone call token.decodeelement() whole content "decoded" or skipped in our scenario.

how decode attributes of specific startelement , go on element's body?

i parse wikipedia xml dumps (~50gb xml files) in go-wikiparse using plain struct/reflect decoding. it's super simple.

the strategy this:

first, read envelope token:

d := xml.newdecoder(r) _, err := d.token() if err != nil { homecoming nil, err }

e.g., <somedocument><billions-of-other-things/></somedocument> give somedocument.

then, can struct decode next things in loop:

var item d.decode(&i)

not much ram, , it's super easy parse.

xml go sax

No comments:

Post a Comment