Freitag, 1. Juni 2012

Storing JSON efficiently in Python on GAE - Serialization performance tests


Our system is running on Google Appengine.

If you want to store megabytes of JSON in the Google AppEngine
datastore and get it back parsed, this post is for you.

I ran a couple of performance tests where I want to store
a 4 MB json object in the datastore and then get it back at
a later point and process it.

There are several ways to do this.

Challenge 1) Serialization
You need to serialize your data.
For this you can use several different libraries.
JSON objects can be serialized using:
json.dumps, cPickle.dumps or marshal.dumps.
(these are the libraries I'm aware of atm)

Challenge 2) Compression
If your serialized data doesn't fit into 1mb you need
to shard your data over multiple datastore entities and
manually build it together when loading the entities back.
If you compress your serialized data and store it then,
you have the cost of compression and decompression,
but you have to fetch fewer datastore entities when you
want to load your data and you have to write fewer
datastore entities if you want to update your data if it
sharded.

Solution for 1) Serialization:
cPickle is very slow. It's meant to serialize real
objects and not just json. JSON is much faster,
but compared to marshal it has no chance.
The python marshal library is definitely the
way to serialize JSON. It has the best performance

Solution for 2) Compression:
For my use-case it makes absolutely sense to
compress the data the marshal lib produces
before storing it in datastore. I have gigabytes
of JSON data. Compressing the data makes
it about 5x smaller. Doing 5x fewer datastore
operations definitely pays for the the time it
takes to compress and decompress the data.
There are several compression levels you
can use to when using python's zlib.
From 1 (lowest compression, but fastest)
to 9 (highest compression but slowest).
During my tests I figured that the optimum
is to compress your serialized data using
zlib with level 1 compression. Higher
compression takes to much CPU and
the result is only marginally smaller.

Here are my test results:
cPickle ziplvl: 0

dump: 1.671010s
load: 0.764567s
size: 3297275
cPickle ziplvl: 1

dump: 2.033570s
load: 0.874783s
size: 935327
json ziplvl: 0

dump: 0.595903s
load: 0.698307s
size: 2321719
json ziplvl: 1

dump: 0.667103s
load: 0.795470s
size: 458030
marshal ziplvl: 0

dump: 0.118067s
load: 0.314645s
size: 2311342
marshal ziplvl: 1

dump: 0.315362s
load: 0.335677s
size: 470956
marshal ziplvl: 2

dump: 0.318787s
load: 0.380117s
size: 457196
marshal ziplvl: 3

dump: 0.350247s
load: 0.364908s
size: 446085
marshal ziplvl: 4

dump: 0.414658s
load: 0.318973s
size: 437764
marshal ziplvl: 5

dump: 0.448890s
load: 0.350013s
size: 418712
marshal ziplvl: 6

dump: 0.516882s
load: 0.367595s
size: 409947
marshal ziplvl: 7

dump: 0.617210s
load: 0.315827s
size: 398354
marshal ziplvl: 8

dump: 1.117032s
load: 0.346452s
size: 392332
marshal ziplvl: 9

dump: 1.366547s
load: 0.368925s
size: 391921
The results do not include datastore operations,
it's just about creating a blob that can be stored
in the datastore and getting the parsed data back.
The times of "dump" and "load" are seconds it takes
to do this on a Google AppEngine F1 instances
(600Mhz, 128mb RAM).

Here is the library i created an use:

#!/usr/bin/env python
#
# Copyright 2012 MiuMeet AG
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.appengine.api import datastore_types
from google.appengine.ext import db

import zlib
import marshal

MARSHAL_VERSION = 2
COMPRESSION_LEVEL = 1

class JsonMarshalZipProperty(db.BlobProperty):
  """Stores a JSON serializable object using zlib and marshal in a db.Blob"""

  def default_value(self):
    return None
  
  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is None:
      return None
    return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
                                 COMPRESSION_LEVEL))

  def make_value_from_datastore(self, value):
    if value is not None:
      return marshal.loads(zlib.decompress(value))
    return value

  data_type = datastore_types.Blob
  
  def validate(self, value):
    return value

Montag, 19. September 2011

ProdEagle - Rich Server Monitoring made easy

There are hundreds of services that monitor your system by checking if your servers return a HTTP 200 status code. The only problem is that HTTP 200 doesn't mean at all that your system is fine.

Let's look at the monitoring of a social network for example. You should monitor the ratio between male and female signups. If 100% of your signups are dudes, your webserver will still happily return HTTP 200 and your monitoring system will say everything is fine - but something is really wrong, e.g. the gender selection in the signup process is broken.
Your monitoring system also has to send you an alert if the number of messages that are sent per second is only half of what they were 5 minutes ago or if half the notification emails that you send to your users bounce.

ProdEagle.com let's you monitor whatever you want with almost no engineering work to be done.
All you need to do is increment a counter on the server side whenever something interesting happens, for example a female user signs up. With the ProdEagle library (available in PHP, Python, Java) this can be done in one line:

increment_counter("Signup.Female");
increment_counter("Signup.Total");


The counters get automatically exported every minute to prodeagle.com (the webapp that is your dashboard). On the dashboard you get beautiful graphs of your counters and can setup alerts if some counters drop bellow or exceed a value. With the two counters above you could already setup very interesting graphs and alerts. Examples of what ProdEagle would support out of the box are:


This can all be done with a few mouse clicks. ProdEagle is free of charge, you can just give it a try on prodeagle.com.

Dienstag, 2. August 2011

Much more efficient implementation of db.ListProperty(int)

So you want to store a "very long list of numbers" in your AppEngine Model which doesn't need to be indexed.

Well, you could just do this:



class Foo(db.Model):
  numbers = db.ListProperty(int, indexed=False)


However, you will quickly notice that the performance and the memory consumption of this sucks if you add thousands of integers to this list.

Instead you should use the native python "array.array" type as the list and store it in a BlobProperty.

Here is an appstats comparison of the same code running twice with the two different implementations (the code gets a Foo entity and appends a couple of integers to the existing numbers):

array.array and BlobProperty:
db.ListProperty(int)




What a win! Not only is the array.array implementation about 10x faster with the RPCs (over 30x faster without transactions!) but also the the time it actually takes to append a couple of integers is much better - notice the gap between datastore_v3.Get and datastore_v3.Put.

So I guess you should use this instead:


class Foo(db.Model):
  numbers = ArrayProperty()

The implementation bellow also works with floats, shorts and doubles you just need to pass item_type='type' to the ArrayProperty constructor, where type is one of the types found here:


The implementation:

#!/usr/bin/env python
#
# Copyright 2011 MiuMeet AG
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.appengine.api import datastore_types
from google.appengine.ext import db

import array

class ArrayProperty(db.BlobProperty):
  """A numeric array, which is more efficient than db.ListProperty(int)"""

  def __init__(self, verbose_name=None, item_type="L", **kwds):
    """Construct array property.

    Args:
      verbose_name: Verbose name is always first parameter.
      item_type: The type of the array.
        See http://docs.python.org/library/array.html
    """
    super(db.BlobProperty, self).__init__(verbose_name, **kwds)
    self.item_type = item_type


  def default_value(self):
    """Default value for unassigned values.

    Returns:
      A copy of the array or None as provided by __init__(default).
    """
    if self.default != None:    
      return array.array(self.item_type, self.default)


  def get_value_for_datastore(self, model_instance):
    """Get value from property to send to datastore.

    We retrieve the array from the model instance and return a
    the string representation of it.

    See base class method documentation for details.
    """
    value = self.__get__(model_instance, model_instance.__class__)
    if value is not None:
      value = db.Blob(value.tostring())
    return value

  def make_value_from_datastore(self, value):
    """Native representation of this property.

    We receive a blob from the entity and return an array of item_type.

    See base class method documentation for details.
    """
    if value is not None:
      arr = array.array(self.item_type)
      arr.fromstring(value)
      return arr
    return value

  data_type = datastore_types.Blob
  
  def validate(self, value):
    """Validate property.

    Returns:
      A valid value.

    Raises:
      BadValueError if property is not an instance of array.array(item_type).
    """
    if isinstance(value, array.array) and value.typecode == self.item_type:
      return value
    elif value is not None:
      raise db.BadValueError(
          'Property %s must be an array.array("%s") instance' %
          (self.name, self.item_type))



Enjoy,
-Andrin

Donnerstag, 17. Februar 2011

Programmatically upload images to blobstore on AppEngine

Update: There is now a native API call to do this:



from __future__ import with_statement
from google.appengine.api import files
from google.appengine.ext import blobstore

   def get_blob_key(self, data, _type):
       # Create the file
       file_name = files.blobstore.create(mime_type = _type)

       # Open the file and write to it
       with files.open(file_name, 'a') as f:
           f.write(data)

       # Finalize the file. Do this before attempting to read it.
       files.finalize(file_name)

       # Get the file's blob key
       blob_key = files.blobstore.get_blob_key(file_name)
       return blob_key


Old way:

Let's assume you would like to programmatically store an image into blobstore on Google AppEngine in Python like this:

key = blobstore_image.uploadImage(open("test.png", "rb").read())

and then serve it like this:

self.response.out.write("<img src='" + blobstore_image.getServingUrl(key) + "'>")

Well, with the normal API, that's not easily possible.

However, I created exactly this API. :)

Prerequisites:

Libraries:
You need to have a copy of Poster's encode.py in the same directory as my library.

App.yaml:
Add this to your app.yaml:
- url: /blobstore_image(/.*)?
  script: /tools/blobstore_image.py
Special dev_appserver.py!
This is very important:
You need to run an additional dev_appserver.py instance if you want to use this library locally!

dev_appserver.py --port 8080 .
dev_appserver.py --port 8081 .

My library does a post request to your app, while handling the current request. Since the dev_appserver.py is single threaded this would not work. Therefor you have to run a second instance locally that handles all blobstore related operations. But don't worry you wont notice and in production the extra code is not even used :)


The library (blobstore_image.py):


#!/usr/bin/env python
#

DEV_APP_SERVER_BLOBSTORE_ENTITY_PORT = 8081

import os
import urllib
from tools.poster import multipart_encode, MultipartParam
from google.appengine.api import images
from google.appengine.api import urlfetch
from google.appengine.ext import blobstore
from google.appengine.ext import webapp
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import template
from google.appengine.ext.webapp.util import run_wsgi_app

def getServingUrl(key, size=None, crop=False):
  global DEV_APP_SERVER_BLOBSTORE_ENTITY_PORT
  url = images.get_serving_url(key)
  if development():
    url = url[7:]
    url = ("http://localhost:%d%s" %
           (DEV_APP_SERVER_BLOBSTORE_ENTITY_PORT,
            url[url.find("/"):]))
  if size:
    url += "=s%d"
    if crop:
      url += "-c"
  return url

def uploadImage(data):
  global DEV_APP_SERVER_BLOBSTORE_ENTITY_PORT        
  params = []
  params.append(MultipartParam(
      "file",
      filename='file',
      value=data))
  payloadgen, headers = multipart_encode(params)
  payload = str().join(payloadgen)
  url = None
  if development():
    url = urlfetch.fetch(
       url=("http://localhost:%d/blobstore_image/geturl" %
             DEV_APP_SERVER_BLOBSTORE_ENTITY_PORT)).content
  else:
    url = urlfetch.fetch(
           "http://%s.latest.%s.appspot.com/blobstore_image/geturl" %
           (os.environ["CURRENT_VERSION_ID"].split(".")[0],
            os.environ["APPLICATION_ID"].replace("s~"""))).content
try:
    result = urlfetch.fetch(
        url=url,
        payload=payload,
        method=urlfetch.POST,
        headers=headers,
        deadline=10,
        follow_redirects=False)
    if "location" in result.headers:
      location = result.headers["location"]
      key = location[location.rfind("/") + 1:]
      return key
    else:
      return None
  except:
    return None

def development():
  return os.environ['SERVER_SOFTWARE'].find('Development') == 0 

class GetUrl(webapp.RequestHandler):
    def get(self):
      self.response.out.write(blobstore.create_upload_url('/blobstore_image'))

class UploadHandler(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
      upload_files = self.get_uploads('file')
      blob_info = upload_files[0]
      self.redirect('%s' % blob_info.key())

def main():
    application = webapp.WSGIApplication(
          [('/blobstore_image/geturl', GetUrl),
           ('/blobstore_image', UploadHandler),
          ], debug=True)
    run_wsgi_app(application)

if __name__ == '__main__':
  main()

Donnerstag, 10. Februar 2011

Schedule MapReduce daily on AppEngine with Cron.yaml in Python

In order to create rich daily statistic for MiuMeet I need to run MapReduces on a daily basis.

With the standard MapReduce library and a little helper class this becomes very easy to do.

cron.yaml
The cron.yaml lets you define tasks that should be executed daily on your AppEngine app.


cron:
  - description: DailyStats MapReduce
  url: /cron_mapreduce?name=Stats&reader_spec=mymr.map&entity_kind=model.MyModel
  schedule: every day 00:00

The cron_mapreduce.py takes a couple of cgi-arguments:
  • name: The name of your MapReduce
  • reader_spec: The Mapper function of your MapReduce
  • entity_kind: The Datastore entity kind you want to process
  • reader_parameters (optional): The input reader class (default is datastore input reader)
  • processing_rate (optional): The processing rate of the input reader (default: 100)
  • done_callback (optional): A URL that should be called after the MapReduce finishes (default: None)

app.yaml
Add my library to the your app.yaml


handlers:
- url: /cron_mapreduce.*
   script: /cron_mapreduce.py
   login: admin


My cron_mapreduce.py library


#!/usr/bin/env python

"""
 Created by Andrin von Rechenberg, 2011.
 
 This library is free software: you can redistribute it
 and/or modify it under the terms of the GNU General Public License
 as published by the Free Software Foundation, either version 3 of
 the License, or (at your option) any later version.

 Example usage:
   http://devblog.miumeet.com/2011/02/schedule-mapreduce-daily-on-appengine.html    
 
Cheers,
 -Andrin

"""

from google.appengine.ext import webapp
from google.appengine.ext.webapp import util

from mapreduce import control as mr_control

class ScheduleMapReduce(webapp.RequestHandler):
  def get(self):
    mr_control.start_map(
     self.request.get("name"),
     self.request.get("reader_spec", "your_mapreduce.map"),
     self.request.get("reader_parameters",
                      "mapreduce.input_readers.DatastoreInputReader"),
     { "entity_kind": self.request.get("entity_kind", "models.YourModel"),
       "processing_rate": int(self.request.get("processing_rate", 100)) },
     mapreduce_parameters={"done_callback": self.request.get("done_callback",
                                                             None) } )
    self.response.out.write("MapReduce scheduled");

application = webapp.WSGIApplication([
  ('/.*', ScheduleMapReduce),
], debug=True)


def main():
  util.run_wsgi_app(application)
if __name__ == "__main__":
  main() 


Cheers,
-Andrin

Donnerstag, 3. Februar 2011

Facebook JS API & YouTube API & AppEngine and you have CrackUp.tv

Today I wanted to look into Facebook integration for MiuMeet.

I decided to play with the Facebook Graph Javascript SDK and build a small application.

So what should I build? How'bout this:

  • A website that shows YouTube videos that your friends thought were funny on facebook.
  • Whenever a clip ends, the user needs to rate how funny the video is in order to see the next video.
  • Depending on his rating we show different videos to the users in the future.
  • If the user liked the video, "Like" it on facebook, if he loved it, post it on his facebook wall.
I was able to build and launch a version within 48h it was that easy.

See the result here: www.crackup.tv




Core parts of the code

Initialize Facebook GraphAPI Javascript:

FB.init({ appId: '186752184677345', status: true,
            cookie: true, xfbml: true });
The "cookie" part is quite important, because that's how I authenticate the user afterwards on the AppEngine server-side.

Once the user is logged in I request from facebook all the friends that also used this app.
For this you need to do a legacy call, the Graph API doesn't support it directly, but it's still very easy:

FB.api( { method: 'friends.getAppUsers' }, callback});
After this, I send a request to the AppEngine server to load all videos that the users friends rated high.
I can limit the number of datastore lookups drastically, by only looking up the friends of the user that have installed this facebook app too by using the legacy call above instead of loading /friends/. This is quite nice. If you would just use the normal friend list I would have to lookup tons of entries in the datastore that wont exist.

In AppEngine I authenticate the user by just looking at the facebook cookie:


def getAuthProfileUID(request):
  if "fbs_186752184677345" not in request.cookies:
    return None
  try:
    cookie = request.cookies["fbs_186752184677345"]
    uid = cookie[cookie.find("uid=") + 4:]
    if "&" in uid:
      uid = uid[:uid.find("&")]
    if "\"" in uid:
      uid = uid[:uid.find("\"")]
    return uid
  except:
    return None
Then I magically create the playlist for this user and hope that he posts many of them on his wall :)

Simple.

Key take-aways
  • The Facebook JS API is very simple to use and quite nicely structured
  • You don't need to talk to facebook from the server-side to authenticate the user.
  • If you want to load all the friends of a user that you have in your datastore, use the Javascript call friends.getAppUsers instead of /friends/ because you can limit the numbers of friends you need to look up drastically.
Cheers,
-Andrin