Thursday 15 July 2010

python 3 unable to scrape -



python 3 unable to scrape -

i trying translate indonesian language english language using google translate.(because play game has lot of indonesians)

lang = id inp = input("enter translate: \n").replace(" ","%20") htmlfile = request("https://translate.google.co.in/#" + lang + "/en/" + inp, headers = {'user-agent': 'mozilla/5.0'}) htmltext = urlopen(htmlfile).read().decode('utf-8') regex = '<span id="result_box" class="short_text" lang="en">(.+?)</span>' pattern = re.compile(regex) trans = re.findall(pattern, htmltext) print(trans)

when give input []. here inspect element

<span id="result_box" class="short_text" lang="en"> <span class="hps"> greeting </span>

i need "greeting" part

it's not problem urllib, problem because of regex. default . in regex match character not of newline or carriage homecoming characters. need enable dotall mode (?s) create . match newline characters also.

regex = r'(?s)<span id="result_box" class="short_text" lang="en">(.+?)</span>'

example:

>>> import re >>> s = """<span id="result_box" class="short_text" lang="en"> ... ... <span class="hps"> ... ... greeting ... ... </span>""" >>> re.findall(r'(?s)<span id="result_box" class="short_text" lang="en">(.+?)</span>', s) ['\n\n <span class="hps">\n\n greeting\n\n '] >>> re.findall(r'(?s)<span id="result_box" class="short_text" lang="en">(?:(?!</).)*?(\w+)\s*</span>', s) ['greeting']

python

No comments:

Post a Comment