[Cadre-politics] MVHub.com ZIP code sort
John Miller
jmiller at thecsl.org
Thu Sep 6 19:51:11 EDT 2007
The ZIP code sort is now live on MVHub.com, so feel free to surf out
there and have a look. The following is some background material on how
ZIP code sorting works; it's a bit longwinded, so read or skip it at
your leisure.
--John
When I started this a month ago, I had assumed that ZIP code information
was in the public domain, and that ZIP codes corresponded roughly to
geographical areas. Given that, we could download the public-domain ZIP
info, calculate the center of each ZIP code, then do a little trig to
calculate distances between ZIP codes. This is _roughly_ how things work.
The USPS created the Zone Improvement Plan (ZIP) codes back in the 60s
to make mail delivery more efficient. ZIP codes are assigned based on a
few things. The country is divided into ZIP code regions, with each
region having a unique first digit (New England = 0, West Coast = 9,
etc.). Inside each region, each state gets a range of ZIPs (MA is
1000-2799), not all of which are used. Pretty obvious so far.
Each individual ZIP code, however, is defined not by a geographical
area, but by its carrier routes. This makes sense for the postmen, who
can say "My route goes to the end of Westford Street," but when you have
a set of streets that might look like:
/-------/
/ /
/ \
| \
| /
------------------
it's tough to define a unique geographic area, especially if not all the
streets have addresses, or if there's a body of water involved. The
Census Bureau took on this task back in 2000, and defined ZIP Code
Tabulation Areas (ZCTAs). This information is seven years old, though,
and covers only regular (multi-address, non-P.O. Box) ZIP codes.
The USPS has also defined areas for ZIP codes and sells this information
for $50/state. Commercial companies have licensed the USPS data and
sell it at much more reasonable rates (approx. $50 for the entire US).
Most websites these days use this commercial data.
It's not too shocking that the Census Bureau and the USPS data don't
quite match (pretty close, though), but it's news that Google's data
doesn't always match the USPS's. For example, Google calculates the
center of the Highlands neighborhood to be just southeast of Drum Hill,
while Yahoo, the Census Bureau, and the USPS all put the center of the
Highlands at about Stevens and Westford streets. The difference in the
two locations is about a mile. Google Maps had a few other anomalous
ZIP codes as well.
To do MVHub's ZIP code sorting, I had initially hoped to query Google
Maps for a distance, then cache the distance in our database so we
didn't have to query twice. Google changed their Maps API, however, so
the Perl module I was using (Geo::Google) to query Google Maps broke.
The Geo::Google developers (conscientious folks that they are) sent me a
patch within an hour of my bug report, but I felt a bit uneasy about
relying on Geo::Google (in this case, its dependency JSON::Parser) not
to break on future Google API changes. Combining that with the
suggestion of perlmonks.org users that we have our own ZIP code
database, we purchased ($40) a list of ZIP codes, towns, latitudes, and
longitudes from zipcodedownload.com.
Once we had the latitude and longitude for each ZIP code, finding the
distance between two ZIPs was simply using a few lines of code from the
Geo::Distance module. It was pretty straightforward to load the ZIP
codes and distances into a database table, then for each MVHub program
result, query the database for the corresponding distance.
To sum up, I thought we could use Google to find ZIP information; this
unexpectedly broke. Using an existing ZIP code -> latitude/longitude
database was a far better choice, and of these databases, the USPS had
the best data.
More information about the Cadre-politics
mailing list