Donnerstag, 3. Februar 2011

GeoIP on AppEngine: How to map an IP address to a country

Let's say you would like to know what country a user is from by looking at his IPv4 address...
and you want your lookup to be quick (so you don't want to do a request to another service).

IPv6 is not supported.

Quick installation:

  • Download my geoip library with hostip.info zip archive.
  • Unpack it into your appengine app directory
  • Use the library like this:
    import geoip
    
    class YourHandler(webapp.RequestHandler):
      def get(self): 
        (country_code, country_name) = geoip.query(self.request.remote_addr)
        self.response.out.write("You are from %s(%s)" %
                                (country_name, country_code))
I will probably not update the data files, because I'm a lazy bastard. But you can create new datafiles yourself. See bellow.

Debug interface
If you want to activate the debug interface, just add the following lines to your app.yaml

   - url: /geoip.*
     script: /geoip.py
     login: admin 

And now the details of my geoip library (if you are interested)

The easiest way to do quick lookups, is to have a local datafile containing a mapping from ip address to country code.
Of course this file should be as small as possible. I came up with a solution where all data from hostip.info fits into a little more than 1MB and all free data from MaxMind fits in less than 300KB.

On the first request I map one of these files into RAM and after that I can lookup the country for an IP addresses in less than 1ms.

The structure of the data files is the following:
  • The first line maps the country ids to 2-character country codes
  • The second line maps the country ids to the english country name
  • The rest is binary data with the following structure
    • 1st byte is the A level part of the ip
    • 2nd byte is the B level part of the ip
    • 3rd byte ist he C level part of the ip
    • 4th byte is the country code (luckily there are less than 256 country codes)
A lookup is performed by doing a binary search over the binary data (one element is 4 bytes).
The lower bound of the binary search result contains in the 4th byte the country id and I map that to the country code and the country name and return it. Quite simple and straight forward.

The "geoip.py"library allows you to query the datafiles


#!/usr/bin/env python

"""
 Created by Andrin von Rechenberg, 2011.
 
 This library is free software: you can redistribute it
 and/or modify it under the terms of the GNU General Public License
 as published by the Free Software Foundation, either version 3 of
 the License, or (at your option) any later version.

 Example usage on Appengine:
 
   class YourHandler(webapp.RequestHandler):
     def get(self): 
       (country_code, country_name) = geoip.query(self.request.remote_addr)
       self.response.out.write("You are from %s(%s)" %
                               (country_name, country_code))
  
 To add the stats UI just add these lines to your app.yaml:

   - url: /geoip.*
     script: /geoip.py
     login: admin 

 Cheers,
 -Andrin

"""

import os
import re

countries = {}
ip_ranges = ""
country_names = {}

def query(ip, raiseError=False):
  global countries, country_names, ip_ranges
  if not ip_ranges:
    setup()
  try:
    parts = ip.split(".")
    find = "%c%c%c" % (int(parts[0]), int(parts[1]), int(parts[2]))
  except:
    # IPv6 is not supported
    return (None, None)
  lo = 0
  hi = (len(ip_ranges)) / 4   
  while lo < hi:
    mid = (lo+hi) // 2
    midval = ip_ranges[mid * 4:mid * 4 + 3]
    if midval < find:
      lo = mid + 1
    elif midval > find: 
      hi = mid
    else:
      break
  found = hi * 4 - 4
  if ip_ranges[found:found + 3] > find:
    found -= 4
  id = ord(ip_ranges[found + 3:found + 4])
  return (countries.get(id), country_names.get(id))

def setup():
  global countries, country_names, ip_ranges
  data = open(os.path.join(os.path.dirname(__file__),
                           "geoip.bin"), "rb").read()
  country_data = data[:data.find("\n")]
  pattern = re.compile("(\d+)\:([^\|]+)\|")
  for country in pattern.findall(country_data):
    countries[int(country[0])] = country[1]
  data = data[data.find("\n") + 1:]
  country_names_data = data[:data.find("\n")]
  for name in pattern.findall(country_names_data):
    country_names[int(name[0])] = name[1]
  ip_ranges = data[data.find("\n") + 1:]

""" ***************  ALL THE CODE BELLOW IS OPTIONAL *************** """

from google.appengine.ext import webapp
from google.appengine.ext.webapp import util

class GeoIPStats(webapp.RequestHandler):
  def stats(self):
    global countries, country_names, ip_ranges
    if not ip_ranges:
      setup()
    count = {}
    pos = 0
    this = 0
    while pos < len(ip_ranges):
      next = 256 * 256 * 256
      if pos + 4 < len(ip_ranges):
        next = (ord(ip_ranges[pos]) * 256 * 256 +
                ord(ip_ranges[pos + 1])  * 256 +
                ord(ip_ranges[pos + 2]))
      id = ord(ip_ranges[pos + 3])
      if id not in count:
        count[id] = 0
      count[id] += 256 * (next - this)
      this = next
      pos += 4
    result = [(countries.get(x),
               country_names.get(x),
               count[x]) for x in count]
    result.sort()
    return result

  def get(self):
    self.response.out.write(
        "<html><body><center><b>Appengine GeoIP by N-Dream</b><br><br>")
    ip = self.request.get("ip")
    show_stats = self.request.get("stats")
    if ip:
      (cc, name) = query(ip)
      self.response.out.write("IP is from: %s (%s)" % (cc, name))
    elif show_stats:
      self.response.out.write("<table>")
      for stat in self.stats():
        cc = stat[0]
        name = stat[1]
        count = str(stat[2])
        if len(count) > 3:
          for i in range(len(count) - 3, 0, -3):
            count = count[:i] + "'" + count[i:]
        self.response.out.write(
            "<tr><td><b>%s</b></td><td>%s</td><td align=right>%s</td></tr>" %
            (cc, name, count))
      self.response.out.write("</table>")
    else:
      self.response.out.write(
          "<form>IP:<input type=text name=ip><input type=submit></form>")
      self.response.out.write("<a href=?stats=1>Statistics</a>")
    self.response.out.write("</center></body></html>")

application = webapp.WSGIApplication([
  ('.*', GeoIPStats),
], debug=True)


def main():
  util.run_wsgi_app(application)

Creating new compressed datafiles:
If you want to create new datafiles, just run one of the following python scripts.

Create a datafile from hostip.info: 

#!/usr/bin/env python

"""
 Created by Andrin von Rechenberg, 2011.
 
 This library is free software: you can redistribute it
 and/or modify it under the terms of the GNU General Public License
 as published by the Free Software Foundation, either version 3 of
 the License, or (at your option) any later version.

 Example usage:
   python geoip_hostipinfo.py
 
 Cheers,
 -Andrin

"""

import gzip
import sys
import re
import urllib
import cStringIO

out = open("geoip.bin","wb")
country_names = {}
ip_ranges = {}

print "Downloading... (might take a while)"
zipped = urllib.urlopen("http://db.hostip.info/mirror/" +
                      "hostip_current.sql.gz").read()
data = gzip.GzipFile(fileobj=cStringIO.StringIO(zipped)).read()

for line in data.split("\n"):
  if line.startswith("INSERT INTO `countries`"):
    p = re.compile("\((\d+),'(.*?)','([A-Z]+)'\)")
    for x in p.findall(line):
      out.write(x[0] + ":" + x[2].replace("|", "") + "|")
      country_names[x[0]] = " ".join([c[0] + c[1:].lower()
                                      for c in
                                      x[1].replace("\\", "").split(" ")])
    out.write("\n")
    for x in country_names:
      out.write(x + ":" + country_names[x].replace("|", "") + "|")
    out.write("\n")
    
  
  if line.startswith("INSERT INTO `ip4_"):
    a = line[line.find("ip4_") + 4:]
    a = int(a[:a.find("`")])
    print "Processing A Level IP adress block " + str(a) + "."
    for block in line.split(")"):
      if block.strip() == ";":
        continue
      (b, c, country, city, time) = block.split("(")[1].split(",")
      ip_ranges[a * 256 * 256 + int(b) * 256 + int(c)] = int(country)

if not ip_ranges or not country_names:
  print "Countries or IP ranges are missing"
  sys.exit()
print "Writing file..."
last_country = None
for a in range(256):
  for b in range(256):
    for c in range(256):
      country = ip_ranges.get(a * 256 * 256 + b * 256 + c, 0)
      if country != last_country:
        out.write("%c%c%c%c"  % (a, b, c, country))
        last_country = country;
out.close()

... or create a datafile from maxmind.com:

#!/usr/bin/env python

"""
 Created by Andrin von Rechenberg, 2011.
 
 This library is free software: you can redistribute it
 and/or modify it under the terms of the GNU General Public License
 as published by the Free Software Foundation, either version 3 of
 the License, or (at your option) any later version.

 Example usage:
   python geoip_maxmind.py
 
 Cheers,
 -Andrin

"""

import cStringIO
import sys
import re
import urllib
import zipfile

out = open("geoip.bin","wb")
countries = {}
country_names = {}
ip_ranges = {}
print "Downloading... (might take a while)"
zipped = urllib.urlopen("http://geolite.maxmind.com/download/geoip/database/" +
                        "GeoIPCountryCSV.zip").read()
print "Processing..."
zip = zipfile.ZipFile(cStringIO.StringIO(zipped))
csv_filename = None
for file in zip.filelist:
  if file.filename.endswith(".csv"):
    csv_filename = file.filename
    break;
if not csv_filename:
  print "csv file not found in archive"
  sys.exit()
for line in zip.read(csv_filename).split("\n"):
  if not line:
    continue
  parts = [x.replace('"', "") for x in line.split(",")]
  if parts[4] not in countries:
    countries[parts[4]] = len(countries) + 1
    country_name = ",".join(parts[5:]).replace('"', "")
    country_names[country_name] = len(countries)
  for i in range(int(parts[2]) / 256, int(parts[3]) / 256 + 1):
    ip_ranges[i] = countries[parts[4]]
for country in countries:
  out.write(str(countries[country]) + ":" + country.replace("|", "") + "|")
out.write("\n")
for country in country_names:
  out.write(str(country_names[country]) + ":" + country.replace("|", "") + "|")
out.write("\n")

if not ip_ranges or not countries:
  print "Countries or IP ranges are missing"
  sys.exit()
print "Writing file..."
last_country = None
for a in range(256):
  for b in range(256):
    for c in range(256):
      country = ip_ranges.get(a * 256 * 256 + b * 256 + c, 0)
      if country != last_country:
        out.write("%c%c%c%c"  % (a, b, c, country))
        last_country = country;
out.close()


Cheers,
-Andrin

PS: Code was colorized using pygments.org

2 Kommentare:

  1. Who is the headmaster of this blog??? can you contact me on my email acidflame@hotmail.it i want tell with you for an offer thanks, i'm italian webmastar Nicola Cirillo excuse me if i write here but i don't know where i write you.

    AntwortenLöschen