As projects reach scale of moderate complexity, they must represent objects in many forms. As a json record from an API endpoint, cache, or database, as a thrift object for RPC, or as an object in-memory.
At Uber, we've begun to use the
schematics
library for our in-memory representations. It provides a canonical form for an object to take, which can then be serialized into thrift, sql, or various json forms. The native serialization works great for the base case.Too Many Representations
However, you quickly run into scenarios where a field is referenced by a different attribute on different objects. Take our
Document
entity. It has createdAt
in thrift, but created_at
in memory. Moreover, in thrift it is a epoch integer and a DateTime in mysql. It has s3url
in our legacy API andurl
in our new database.
Enter the Mapper class. Mappers define a way to convert between representations of the same data. Without care, it can very quickly explode your code complexity, writing rules that map nearly identical attribute names.
Also, because (de)serialization operations are common, they need to be fairly fast. Here are the tricks we used to get fast optimization.
1. Pre-compute Attribute Maps
An early version of our Mappers computed the mapping between attribute names, like the
Document
entity above, on the fly. This is both slow and repetitive. We built a metaclass that creates a list of these attribute maps (e.g. from created_at
to createdAt
) at class defintion. Then, (de)serialization is simply iterating through the attribute pair list.class Meta(type):
def __new__(mcs, name, bases, dct):
entity_cls = dct['entity_cls']
# precompute the map
record_mapper = {}
for field in entity_cls.fields():
record_mapper[field] = ...
dct['record_mapper'] = record_mapper
return super(Meta, mcs).__new__(mcs, name, bases, dct)
class Mapper(object):
__metaclass__ = Meta
entity_cls = MyEntity
@classmethod
def record_to_entity(cls, record):
entity_data = {}
for entity_field, record_field in record_mapper.iteritems():
entity_data[entity_field] = ...
cls.entity_cls(raw_data=entity_data)
2. Fast Date Parsing
dateutil
is the standard for iso8601 parsing - it supports many variants (millisecond, timezone, a prefix of full iso8601). The problem is, it is slow as hell.DocumentMapper.record_to_entity(record)
10000 loops, best of 5: 513.436079 usec per loop # REALLY slow
Replace that with a c library,
ciso8601
, and hugely boost performance.DocumentMapper.record_to_entity(record)
10000 loops, best of 5: 62.783957 usec per loop # 8.2x faster
3. Disable Schematics Validation
schematics
helpfully does validation upon object __init__
. However, we know the data is valid because it's from a Mapper. And schematics
has no option to skip the validation. There's a cool Python trick to create an object while skipping the __init__
function.instance = MyClass.__new__(MyClass)
will create an instance of MyClass without executing
__init__
.
Removing the schematics constructor and assigning attributes directly gave us another improvement.
DocumentMapper.record_to_entity(record)
10000 loops, best of 5: 9.325492 usec per loop # 55x faster than original
dict(**record) # gives sense of baseline
10000 loops, best of 5: 0.469947 usec per loop
dict(**record)
isn't a feasible option here, since there isn't a custom class, data validation, or field validation. We're 20x off from that baseline, which is large. But in absolute terms, we're below 10us, which I'm happy with.
We haven't released the code publicly yet. If you use schematics and want to check out the library, reach out to me. Either way, I hope you find these optimizations useful in your work.
No comments :
Post a Comment