python - Regular Expression Processing HTML -
i need replace html tags (e.g. <p>, <img>, etc.) in web page source code, want maintain <br> , <br/>. have tried:
re.sub(r'<[^>]+?>', u'', html, flags=re.i) this achieves first goal, cannot maintain <br> or <br/>. r'<[^>br]+?>' wont accomplish goal either.
what right regular expression?
<((?!\bbr\b).)*?>
this should work case.the negative lookahead ensure <br> not picked.
edit:
<(?:(?!\bbr\/?(?=>)).)*?> try if have such absurd things. <a href="http://host.domain.tld/br">
see demo.
http://regex101.com/r/su3fa2/57
python html regex
No comments:
Post a Comment