rocksdb/thrift/lib/cpp/protocol/neutronium
Dhruba Borthakur 80c663882a Create leveldb server via Thrift.
Summary:
First draft.
Unit tests pass.

Test Plan: unit tests attached

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D3969
2012-07-07 09:42:39 -07:00
..
test Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Decoder-inl.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Decoder.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Encoder-inl.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Encoder.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
intern_table.thrift Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
InternTable.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
README Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Schema-inl.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Schema.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
TARGETS Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00
Utils.h Create leveldb server via Thrift. 2012-07-07 09:42:39 -07:00

Neutronium: a very dense encoding of Thrift objects

Neutronium is a Thrift encoding format optimized for space at the expense of
speed.  It achieves high efficiency in a few ways:

1. It does not encode type and tag information.  This is stored out of line in
a Schema object, which must be provided at both encoding and decoding time.
If encoding many Thrift objects, you can transmit / store the Schema only
once.  The Schema (in thrift/lib/thrift/reflection.thrift) is itself a
Thrift object, and can be serialized / deserialized in the usual way.

2. It encodes data very compactly.  Bytes use one byte each; larger numbers
(i16, i32, i64, double) use variable-length encoding (GroupVarint).  Booleans
use one bit each.  We also encode one bit for every optional field that exists
in the structure definition (indicating whether the field is set or not).
Strings can be encoded in a variety of formats, see below.

3. Aggregates (lists, maps, sets) are encoded efficiently -- they are encoded
like a structure with a variable number of fields.  So list<i32> takes
advantage of GroupVarint encoding among consecutive values.

4. Strings can be interned: when encoding multiple strings, we can detect
duplicates and store only an ID.  This requires using an InternTable and
passing the same InternTable to the encoder and decoder (the InternTable
can be easily serialized and deserialized).

Neutronium is backwards compatible as long as the schema is identical from
encoding to decoding, and the changes you made to the Thrift definition are
backwards compatible (that is, fields were added, removed, or renamed, but
field ids remained the same)

Configuration:

Neutronium can be configured by using field attributes in your Thrift
definition:

struct Foo {
  1: i32 a (neutronium.fixed = 1),
}

Attributes for number fields:
  neutronium.fixed = 1
    Encode the number as a fixed-length value (i16 takes 2 bytes, i32
    takes 4 bytes, i64 takes 8 bytes) instead of using Varint encoding

Attributes for string fields:
  neutronium.fixed = <length>
  neutronium.pad = 'X'
    Do not encode the string length, assume that all strings have length
    <length>.  Strings longer than <length> are truncated; strings
    shorter than <length> are padded with 'X' (default: the null byte, '\0').
    Use this when you expect that all / most strings have a fixed length.

  neutronium.terminator = 'X'
    Do not encode the string length; store strings terminated with a
    terminator ('\0' will likely be a popular choice).  Encoding strings that
    contain the terminator is an error.

  neutronium.intern = 1
    Intern strings; requires a non-NULL InternTable.

Attributes for enum fields:
  neutronium.strict = 1
    Encode the enum using as few bits as necessary to encode all possible
    values; note that it becomes an error to encode an enum value that is not
    specified in the Thrift definition.