Official Randibox Blog: Python Serialization Performance 2016

As projects reach scale of moderate complexity, they must represent objects in many forms. As a json record from an API endpoint, cache, or database, as a thrift object for RPC, or as an object in-memory.

At Uber, we've begun to use the schematics library for our in-memory representations. It provides a canonical form for an object to take, which can then be serialized into thrift, sql, or various json forms. The native serialization works great for the base case.

Too Many Representations

However, you quickly run into scenarios where a field is referenced by a different attribute on different objects. Take our Document entity. It has createdAt in thrift, but created_at in memory. Moreover, in thrift it is a epoch integer and a DateTime in mysql. It has s3url in our legacy API andurl in our new database.

Enter the Mapper class. Mappers define a way to convert between representations of the same data. Without care, it can very quickly explode your code complexity, writing rules that map nearly identical attribute names.

Also, because (de)serialization operations are common, they need to be fairly fast. Here are the tricks we used to get fast optimization.

1. Pre-compute Attribute Maps

An early version of our Mappers computed the mapping between attribute names, like the Documententity above, on the fly. This is both slow and repetitive. We built a metaclass that creates a list of these attribute maps (e.g. from created_at to createdAt) at class defintion. Then, (de)serialization is simply iterating through the attribute pair list.

class Meta(type):
    def __new__(mcs, name, bases, dct):
        entity_cls = dct['entity_cls']

        # precompute the map
        record_mapper = {}
        for field in entity_cls.fields():
            record_mapper[field] = ...
        dct['record_mapper'] = record_mapper

        return super(Meta, mcs).__new__(mcs, name, bases, dct)

class Mapper(object):
    __metaclass__ = Meta
    entity_cls = MyEntity

    @classmethod
    def record_to_entity(cls, record):
        entity_data = {}
        for entity_field, record_field in record_mapper.iteritems():
            entity_data[entity_field] = ...
        cls.entity_cls(raw_data=entity_data)

2. Fast Date Parsing

dateutil is the standard for iso8601 parsing - it supports many variants (millisecond, timezone, a prefix of full iso8601). The problem is, it is slow as hell.

DocumentMapper.record_to_entity(record)
10000 loops, best of 5: 513.436079 usec per loop # REALLY slow

Replace that with a c library, ciso8601, and hugely boost performance.

DocumentMapper.record_to_entity(record)
10000 loops, best of 5: 62.783957 usec per loop  # 8.2x faster

3. Disable Schematics Validation

schematics helpfully does validation upon object __init__. However, we know the data is valid because it's from a Mapper. And schematics has no option to skip the validation. There's a cool Python trick to create an object while skipping the __init__ function.

instance = MyClass.__new__(MyClass)

will create an instance of MyClass without executing __init__.

Removing the schematics constructor and assigning attributes directly gave us another improvement.

DocumentMapper.record_to_entity(record)
10000 loops, best of 5: 9.325492 usec per loop  # 55x faster than original
dict(**record)  # gives sense of baseline
10000 loops, best of 5: 0.469947 usec per loop

dict(**record) isn't a feasible option here, since there isn't a custom class, data validation, or field validation. We're 20x off from that baseline, which is large. But in absolute terms, we're below 10us, which I'm happy with.

We haven't released the code publicly yet. If you use schematics and want to check out the library, reach out to me. Either way, I hope you find these optimizations useful in your work.

Python Serialization Performance 2016

Too Many Representations

1. Pre-compute Attribute Maps

2. Fast Date Parsing

3. Disable Schematics Validation

No comments :

War Post

Popular Posts