brian has been a Perl user since 1994. He is founder of the first Perl Users Group, NY.pm, and Perl Mongers, the Perl advocacy organization. He has been teaching Perl through Stonehenge Consulting for the past five years, and has been a featured speaker at The Perl Conference, Perl University, YAPC, COMDEX, and Builder.com. Contact brian at [email protected].
Many web server log analyzers now support "geolocation," meaning that they can turn a host name or IP address into a point on the globe. With geolocation, instead of looking at a bunch of numbers, I can look at maps. Using the Geo::IP module and the databases from MaxMind (http://www .maxmind.com/), both of which are freely available, I can add this feature to my programs, too.
Turning IP addresses into locations is not perfect, but as long as I understand how IP numbers are assigned and split up, I can put my geolocation results into perspective. I need to know a little about how IP numbers are assigned, so I can interpret the results and judge the accuracy.
Starting from the top, the Internet Corporation for Assigned Names and Numbers, better known as ICANN (http://www .icann.org/), is responsible for IP address allocation. It gives out large chunks for addresses to Regional Internet Registries (RIR) that cover certain parts of the globe. So far, these are the American Registry of Internet Addresses (ARIN), the Asia Pacific Network Information Centre (APNIC), Latin American and Caribbean Internet Addresses Registry (LACNIC), and RIPE Network Coordination Centre (RIPENIC). Another registry, African Network Information Center (AfriNIC), is on the way, too.
Each of the RIRs hand out chunks of their own IP space to major organizations, which then do the same to smaller organizations, and so on until one of those IP numbers gets to my cable modem in my apartment.
Knowing that, to figure out where any IP address is on the globe means I just have to track it through that process until I get the resolution I want. I may want to stop at the country level, or I might want to go even further. For this article, I'm going to stop at the country level.
There is a problem, though: Organizations don't have to respect these lines that we've overlaid on the globe. For example, America Online (AOL) has a very large IP allocation, but it's a multinational company. It can use its allocation as it likes. Indeed, that's one of the major differences in the free and paid versions of the MaxMind GeoIP database: The paid version can figure out which AOL addresses are not in the United States. In the free version of the database, all AOL addresses show up as the United States.
Now that I know why my results won't be perfect, I can move on to the task. Before I install the Geo::IP module, I need to get a database for it. The free version of the MaxMind database identifies the country of the IP address with 93 percent accuracy. Various levels of the paid version can get me to 98 percent country accuracy for $50, or for $370, down to the city with longitude and latitude. Considering how much work I would have to do to compile all this myself, and how much the local burger joint pays me to make fries, I think these prices are a bargain.
I'm going to stick with the free version of the database for this article, though. I get the free database from the MaxMind developer section (http://www.maxmind.com/app/geoip_country). Once I gunzip the archive, I end up with a single file, GeoIP.dat, which I move to its default location, /usr/local/share/GeoIP/GeoIP.dat.
Once I have the database in place, I can install the Geo::IP module either by installing it with CPAN.pm or downloading it and installing it by hand. It's at http://search.cpan.org/dist/Geo-IP. I give it a whirl with a simple script using my own IP address (at least, the one I have today):
% perl -MGeo::IP -le \ 'print Geo::IP->new->country_name_by_addr( "24.12.70.45" )' United States
Now I want to apply this to a bunch of IP addresses, and web-server access logs are a great source of those. I'm going to use a web access log in the Apache format, although that doesn't really matterI just need the IP address. Each line shows the start of a line from one of my logs. It starts with an IP number, followed by some whitespace, then some other fields that I don't need for this task. In the real world, I'll probably have something else that completely parses the log.
#!/usr/bin/perl while( <> ) { my( $ip ) = split; # just the first field $Seen{ $ip }++; } foreach my $ip ( sort keys %Seen ) { printf "%6d %s\n", $Seen{ $ip }, $ip; }
I get as output a long list of IP numbers and the count of how many times my web server responded to a request from that address. Those aren't the numbers of unique visitors or any other massaged number, they are just the number of requests my web server had to respond to from that IP address:
4 128.112.148.25 5 130.160.141.180 4 168.122.194.108 6 194.7.234.34 2 203.144.187.62
That's not all that interesting, though. I don't really care about IP numbers, and I want to see which countries people are in. I'll never be able to really figure out which country the pair of eyes is in because nothing stops me from logging into a computer halfway around the world and browsing the Web from there. Knowing that, I'm willing to accept the results.
#!/usr/bin/perl use Geo::IP; my $gip = Geo::IP->new(); while( <> ) { my( $ip ) = split; my $country = $gip->country_name_by_addr( $ip ); $Countries{ $country }++; } foreach my $ip ( sort keys %Countries ) { printf "%6d %s\n", $Countries{ $ip }, $ip; }
I get a columnar listing of the number of times an IP address from a country hit my web server:
346 Belgium 51 Canada 22 China 446 France 302 Germany 21 Taiwan 157 Thailand 1233 United States
That's still not good enough for me. Why should I settle for text when I can see a picture? Douwe Osinga provides a little service that can make a map that colors each country that I've visited (http://douweosinga.com/projects/visitedcountries/). I can just as easily turn that into a colored map of each country that has visited me. He uses a long URL with a query string that has a two-digit country code.
The Geo::IP module can handle this, too. I need the country code instead of the country name, so I use the Geo::IP's country_code_by_addr() method instead of country_name_by_addr(). Once I have all the country codes, I don't print them in a reportI join them without any characters in between them and use that as the URL query string. Because I want to redirect to a web page, I have the script output a CGI header that redirects to the external URL:
#!/usr/bin/perl use Geo::IP; my $gip = Geo::IP->new(); while( <> ) { my( $ip ) = split; my $country = $gip->country_code_by_addr( $ip ); $Countries{ $country }++; } my $query_string = join "", keys %Countries; my $url = "http://douweosinga.com/projects/ visitedcountries/colormap" . "?visited=" . $query_string; print "Status: 302 Moved Temporarily\nLocation: $url\n\n";
The URL returns a GIF image that is a map of the world with the visiting countries colored red (see Figure 1).
I started with IP addresses and ended up with a map of the world, using only a freely available Perl module, a free geolocation database, and a free mapping service. I could get much more fancy than this, too. With finer-resolution databases, I can get down to the city level and combine that with longitude and latitude data to make dots on a map. I don't have to stick to web access logs either: I could use geolocation to identify users as they come into my web site, or any other service I might provide. However I decide to use it, Geo::IP makes it easy.
TPJ