Updating Code - Python Automation -
i'm new site , new python--as in few days course. @ work, have inherited sized project involves matching 9 digit zip codes in excel file congressional districts (from website). i've noticed through investigation of code (what little know) author might using website allows 5 digit zip codes, not 9 digits. since districts share zip codes, 9 digit codes more precise. here's code i'm working with:
import urllib import re import csv import datetime print datetime.datetime.now() input_file_name = 'zip1.csv' output_file_name = 'legislator_output_%s-%0*d%0*d.csv' % ((datetime.date.today(), 2, datetime.datetime.now().hour, 2, datetime.datetime.now().minute)) print 'file name:', output_file_name input_file_handler = open(input_file_name, 'rb') input_reader = csv.reader(input_file_handler) output_file_handler = open(output_file_name, 'wb', 1) output_writer = csv.writer(output_file_handler) output_writer.writerow(['unique id', 'zip', 'plus 4', 'member url', 'member name', 'member district']) fail_list = [] counter = 0 input_line in input_reader: zip_entry = '%s-%s' % (input_line[1], input_line[2]) unique_id = input_line[0] counter += 1 #if counter > 25: go on zip_part = zip_entry.split('-')[0] plus_four_part = zip_entry.split('-')[1] params = urllib.urlencode({'zip':zip_part, '%2b4':plus_four_part}) f = urllib.urlopen('http://www.house.gov/htbin/zipfind', params) page_source = f.read() #print page_source relevant_section = re.findall(r'templatelanding(.*?)contentmain', page_source, re.dotall) rep_info = re.findall('<a href="(.*?)">(.*?)</a>', relevant_section[0]) rep_district_info = re.findall('is located in (.*?)\.', relevant_section[0]) try: member_url = rep_info[0][0] member_name = rep_info[0][1] member_district = rep_district_info[0] #member_district = rep_info[0][2] except: fail_list += [zip_entry] member_url = '' member_name = '' member_district = '' row_to_write = [unique_id, zip_part, plus_four_part, member_url, member_name, member_district, datetime.datetime.now()] output_writer.writerow(row_to_write) if counter % 50 == 0: print counter, row_to_write output_file_handler.close() print output_file_name, 'closed at', datetime.datetime.now() print len(fail_list), 'entries failed lookup' print counter, 'rows done at', datetime.datetime.now()
so, author used site allows 5 digits (the code couple of years old site). have no thought how replace correctly on new site.
if knows of solution or can point me in direction of resources might help, much appreciate it. @ moment i'm lost!
for can see, can query, example, http://www.house.gov/htbin/findrep?zip=63333-1211
replace urllib
phone call for
urllib.urlopen('http://www.house.gov/htbin/findrep', zip_entry)
python automation
No comments:
Post a Comment