python 3 unable to scrape -
i trying translate indonesian language english language using google translate.(because play game has lot of indonesians)
lang = id inp = input("enter translate: \n").replace(" ","%20") htmlfile = request("https://translate.google.co.in/#" + lang + "/en/" + inp, headers = {'user-agent': 'mozilla/5.0'}) htmltext = urlopen(htmlfile).read().decode('utf-8') regex = '<span id="result_box" class="short_text" lang="en">(.+?)</span>' pattern = re.compile(regex) trans = re.findall(pattern, htmltext) print(trans)
when give input []
. here inspect element
<span id="result_box" class="short_text" lang="en"> <span class="hps"> greeting </span>
i need "greeting" part
it's not problem urllib
, problem because of regex. default .
in regex match character not of newline or carriage homecoming characters. need enable dotall mode (?s)
create .
match newline characters also.
regex = r'(?s)<span id="result_box" class="short_text" lang="en">(.+?)</span>'
example:
>>> import re >>> s = """<span id="result_box" class="short_text" lang="en"> ... ... <span class="hps"> ... ... greeting ... ... </span>""" >>> re.findall(r'(?s)<span id="result_box" class="short_text" lang="en">(.+?)</span>', s) ['\n\n <span class="hps">\n\n greeting\n\n '] >>> re.findall(r'(?s)<span id="result_box" class="short_text" lang="en">(?:(?!</).)*?(\w+)\s*</span>', s) ['greeting']
python
No comments:
Post a Comment