Sunday 15 August 2010

python - Regular Expression Processing HTML -



python - Regular Expression Processing HTML -

i need replace html tags (e.g. <p>, <img>, etc.) in web page source code, want maintain <br> , <br/>. have tried:

re.sub(r'<[^>]+?>', u'', html, flags=re.i)

this achieves first goal, cannot maintain <br> or <br/>. r'<[^>br]+?>' wont accomplish goal either.

what right regular expression?

<((?!\bbr\b).)*?>

this should work case.the negative lookahead ensure <br> not picked.

edit:

<(?:(?!\bbr\/?(?=>)).)*?>

try if have such absurd things. <a href="http://host.domain.tld/br">

see demo.

http://regex101.com/r/su3fa2/57

python html regex

No comments:

Post a Comment