Sunday, 15 June 2014

Updating Code - Python Automation -



Updating Code - Python Automation -

i'm new site , new python--as in few days course. @ work, have inherited sized project involves matching 9 digit zip codes in excel file congressional districts (from website). i've noticed through investigation of code (what little know) author might using website allows 5 digit zip codes, not 9 digits. since districts share zip codes, 9 digit codes more precise. here's code i'm working with:

import urllib import re import csv import datetime print datetime.datetime.now() input_file_name = 'zip1.csv' output_file_name = 'legislator_output_%s-%0*d%0*d.csv' % ((datetime.date.today(), 2, datetime.datetime.now().hour, 2, datetime.datetime.now().minute)) print 'file name:', output_file_name input_file_handler = open(input_file_name, 'rb') input_reader = csv.reader(input_file_handler) output_file_handler = open(output_file_name, 'wb', 1) output_writer = csv.writer(output_file_handler) output_writer.writerow(['unique id', 'zip', 'plus 4', 'member url', 'member name', 'member district']) fail_list = [] counter = 0 input_line in input_reader: zip_entry = '%s-%s' % (input_line[1], input_line[2]) unique_id = input_line[0] counter += 1 #if counter > 25: go on zip_part = zip_entry.split('-')[0] plus_four_part = zip_entry.split('-')[1] params = urllib.urlencode({'zip':zip_part, '%2b4':plus_four_part}) f = urllib.urlopen('http://www.house.gov/htbin/zipfind', params) page_source = f.read() #print page_source relevant_section = re.findall(r'templatelanding(.*?)contentmain', page_source, re.dotall) rep_info = re.findall('<a href="(.*?)">(.*?)</a>', relevant_section[0]) rep_district_info = re.findall('is located in (.*?)\.', relevant_section[0]) try: member_url = rep_info[0][0] member_name = rep_info[0][1] member_district = rep_district_info[0] #member_district = rep_info[0][2] except: fail_list += [zip_entry] member_url = '' member_name = '' member_district = '' row_to_write = [unique_id, zip_part, plus_four_part, member_url, member_name, member_district, datetime.datetime.now()] output_writer.writerow(row_to_write) if counter % 50 == 0: print counter, row_to_write output_file_handler.close() print output_file_name, 'closed at', datetime.datetime.now() print len(fail_list), 'entries failed lookup' print counter, 'rows done at', datetime.datetime.now()

so, author used site allows 5 digits (the code couple of years old site). have no thought how replace correctly on new site.

if knows of solution or can point me in direction of resources might help, much appreciate it. @ moment i'm lost!

for can see, can query, example, http://www.house.gov/htbin/findrep?zip=63333-1211 replace urllib phone call for

urllib.urlopen('http://www.house.gov/htbin/findrep', zip_entry)

python automation

No comments:

Post a Comment