Sunday, April 3, 2016

Data encoding, schema evolution and a case study

Data Encoding

When sending data over the wire or storing it in a disk, we need a way to encode the data into bytes. There are different ways to do this:

1. Use programming language specific serialization such as Java serialization or Python's pickle, which makes developers in other languages frown.

2. Use language agnostic formats such as XML or JSON, and we are particularly excited by JSON's simplicity. XML is too old school.

However unlike XML which has xml schema,  JSON is not a strictly defined format, even with JSON schema, which has a few drawbacks to make you bite your nails . First, it doesn’t differentiate integers from floating point, second, they are still verbose because field names and type information have to be explicitly represented in the serialized format. Lastly, the lack of documentation and structure makes consuming data in these formats more challenging because fields can be arbitrarily added or removed.
Updated: JSON Schema defines seven primitive types for JSON values:
array
A JSON array.
boolean
A JSON boolean.
integer
A JSON number without a fraction or exponent part.
number
Any JSON number. Number includes integer.
null
The JSON null value.
object
A JSON object.
string
A JSON string.

3. Using binary format.

One of the IBM mainframe i-series AS400/RPG uses the a binary data for record-based columnar like format, with predefined data type size and record size. But it may waste lots of space since every field and record has different sizes.

For example: [3, 27] is an JSON array containing the items 3 and 27, it could be encoded as the long value 2 (encoded as hex 04) followed by long values 3 and 27 (encoded as hex 06 36) terminated by zero:  04 06 36 00.

4. To describe all of the bytes in the above binary format, we need a document and contract between producer and consumer, or schema. Here comes the libraries include Thrift , Protocol Buffer and Avro

The below diagram illustrates the bytes encoding using Avro format.
Source: avro.png

Schema plays pivotal role here and is required during both data serialization but also data deserialization. Because the schema is provided at decoding time, metadata such as the field names don’t have to be explicitly encoded in the data. This makes the binary encoding of Avro data very compact.

Another benefit Avro schema brings is that for static type languages like Java, the serialization and deserialization code can be auto-generated from the schema.

Schema Evolution

One thing is never changed in software development is CHANGE itself,  in this case, the schema changes. Schema evolution is defined as how the application should behave when Avro schema is changed after data has been written to the store using the older versions.

Schema evolution is about data transformation. Avro specification itself has detailed information about how the data is handled when downstream (reader) schema is different from the upstream (writer) schema. When that happens, the data transformation from one schema to the other is performed.

Schema evolution is applied only during deserialization. If the reader schema is different from the value's writer schema, then the value is automatically modified during deserialization to conform to the reader schema. To do this, default values are used.

There are three common aspects of schema evolution: backward compatibility, forward compatibility, and full compatibility.

Let's consider two cases: add a field and remove a field.

To add a field, in order to support schema evolution, a default must be defined.

{"name" : "gender", "type" : "string" , "default" : "null"}]

When reader using old schema encounters this new field, the transformation will automatically drop this new field, this is called forward compatibility.

When reader using new schema encounters data without this new field, the transformation will automatically add this new field and set its value to null. This is called backward compatibility.

The same thing applies to remove case, which I leave it as an exercise to the readers.

Enum evolution

The intriguing part about schema evolution is the Enum evolution.  The following line is from Avro specification:
  • if the writer's symbol is not present in the reader's enum, then an error is signalled.
 This makes schema evolution for enum impossible, because we often need to add new values to the Enum field. And this turns out to be a long-term issue on Avro JIRA board: https://issues.apache.org/jira/browse/AVRO-1340. The proposal is actually simple:

"Use the default option to refer to one of enum values so that when a old reader encounters a enum ordinal it does not recognize, it can default to the optional schema provided one."

Basically always define a default value for enum like below:
{"type":"enum", "name":"color", "symbols":["UNKNOWN","RED", "GREEN",
"BLUE"], "default": "UNKNOWN" }
  
Lost in translation


 One interesting side effect from schema evolution is shown as below example:

Say we have two versions of schema: v1, v2, v2 has the new field called 'color' with default value of 'UNKNOWN'.

A v2 writer created a doc with color = RED.
A v1 reader parse this data, color field is removed and update the data store.
A v2 reader parse this data, color field is restored with value to 'UNKNOWN'!
 
Avro naming convention

Record, enums and fixed are named types. The name portion of a fullname, record field names, and enum symbols must:
  • start with [A-Za-z_]
  • subsequently contain only [A-Za-z0-9_]
The reason here is because when code-generation tool generates the code from schema, record and enum becomes the variable names. And in most of languages including Java, variable names can not have dash.

A Case Study

Armed with the above knowledge, let's do a case study. Saying we have a data store consists of Apache Gora and Kafka schema registry, allows us to store data set and register schema to support schema evolution, and data store will send out any new or updated data set to Kafka pipe which allows downstream consumers to consume the dataset.  Team A works on app a, which collects data and deposit it to data store, team B works on app b, which will listen to the pipe and fetch the data from data store and transform it and put in back to data store, the team A's app a also need to read this data back and make any change and put it back again. However Team B's changes should be preserved as much as we can since it is more a downstream application in this pipeline. 

Approach 1:

App A with its own schema a, app B with its own schema b.

At the beginning, App A has reader/writer for schema a.1, b.1, App B has reader/writer for schema a.1 and b.1.

Considering this case, Team B add new fields to its schema b.1, update schema to b.2 and update its reader/writer to b.2.  app A will read this data with reader/writer for b.1, the new field is dropped, write it to a.1, app B read updated data from a.1 now without this new fields to b.2, app B either won't be able to write this data back to b.2 or has to write a data transformation logic by itself to set the value to default value.

The case of dropping fields is the similar. App A and App B is loosely coupled, but we lost the benefit of schema evolution brought by out-of-box of the platform. Often times team end up to coordinate each other to update the two schema to avoid this 'lost in translation' case.

Approach 2:

App A has its own schema, defines the payload as json string with actually confirm to schema B and along with its schema version no, which app B can retrieve and deposit to datastore.

When app B read the data from app A, it not only read the payload but also use correct writer with right version no b.1 to deposit the data to store without losing any original data. So this approach allows two application evolve around the same schema, to avoid the lost in translation case, team A can always use the latest version of schema b.

The drawback of the approach is that we need to make sure the json payload is validated against schema b.1 and now we store data as json which lose the benefit of Avro serialization. 


References:
 
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

No comments:

Post a Comment