Wednesday, April 27, 2016

[LeetCode] Patching Array

From leetcode
Given a sorted positive integer array nums and an integer n, add/patch elements to the array such that any number in range [1, n] inclusive can be formed by the sum of some elements in the array. Return the minimum number of patches required.
Example 1:
nums = [1, 3], n = 6
Return 1.
Combinations of nums are [1], [3], [1,3], which form possible sums of: 1, 3, 4.
Now if we add/patch 2 to nums, the combinations are: [1], [2], [3], [1,3], [2,3], [1,2,3].
Possible sums are 1, 2, 3, 4, 5, 6, which now covers the range [1, 6].
So we only need 1 patch.
Example 2:
nums = [1, 5, 10], n = 20
Return 2.
The two patches can be [2, 4].

Let's look at the example 1, let curr be the current number in the arrray, cover be the maximum number the numbers being covered from 1.
When curr = 1, cover = 1;
When curr = 3, since there is gap between cover and curr, we need to fill it in, so insert 2, and update cover to 1 + 2 + 3 = 6;
Done.

Example 2.
curr = 1, cover = 1;
curr = 5, fill in the gap between 1 and 5, insert 2, cover = 1+2 = 3, gap is still there between 3 and 5, insert cover+1=4, update cover = 1+2+3=6, no gap any more, update cover = 1+2+3+5=11
curr = 10, no gap, and update cover = 11+10=21
21>20, Done.

So the two most important findings are
1) when we insert a number, the coverage will be increased by that number,
2) next insert number will be the coverage+1.

So our algorithm is following:
let cover=0, insert=0
1. go through the numbers (curr) in the list until cover >= n
  while curr > cover+1 and cover<n,
    insert cover+1
    update cover = cover + cover+1
    increase insert by 1

2. while cover < n
        insert cover+1
        update cover = cover + cover+1
        increase insert by 1







Sunday, April 24, 2016

[LeetCode] Verify Preorder Serialization of a Binary Tree

One way to serialize a binary tree is to use pre-order traversal. When we encounter a non-null node, we record the node's value. If it is a null node, we record using a sentinel value such as #.
     _9_
    /   \
   3     2
  / \   / \
 4   1  #  6
/ \ / \   / \
# # # #   # #
For example, the above binary tree can be serialized to the string "9,3,4,#,#,1,#,#,2,#,6,#,#", where # represents a null node.
Given a string of comma separated values, verify whether it is a correct preorder traversal serialization of a binary tree. Find an algorithm without reconstructing the tree.
Each comma separated value in the string must be either an integer or a character '#' representing null pointer.
You may assume that the input format is always valid, for example it could never contain two consecutive commas such as "1,,3".

Image we go through the input string, if we see number, we know we have to go down the tree until we encounter the "#", if we see "#", we will stop going down the tree and return. If we use a recursive function to process the input string, for a valid preorder string, we should successfully process all of the whole string. For invalid one we either prematurely terminate the process or go beyond the length of string.

[LeetCode] LRU Cache

One of straightforward implementation of LRUCache is to use JDK's LinkedHashMap. However to implement it correctly, overriding its removeOldestEntry() method is not enough.

We need to use the correct LinkedHashMap constructor with correct arguments to make sure it behaves as LRU cache. Below is the doc from JDK:

"A special constructor is provided to create a linked hash map whose order of iteration is the order in which its entries were last accessed, from least-recently accessed to most-recently (access-order). "

And here is the code which passed the LeetCode tests and beat 90% Java implementation in terms of the runtime.

Saturday, April 16, 2016

[LeetCode] House Robber III

LeetCode House Robber III 

Problem:
the thief has found himself a new place for his thievery again. There is only one entrance to this area, called the "root." Besides the root, each house has one and only one parent house. After a tour, the smart thief realized that "all houses in this place forms a binary tree". It will automatically contact the police if two directly-linked houses were broken into on the same night.
Determine the maximum amount of money the thief can rob tonight without alerting the police.

This is a Maximum Weight Independent Set Problem in disguise, which is NP-hard. However to find the max weight set in a tree is possible.

Given a node, if we use maxWith(node) denote the max weight with node in the set, maxWithout(node) denote the max weight without node in the set, max(node) is the max weight with the subtree rooted at node.

1. if node is in the solution, then maxWith(node) = sum of maxWithout(node.children) + node
2. if node is not in the solution, the maxWithout(node) = sum of max(node.child)

Then max(node) = max of maxWith(root), maxWithout(node)

This can be done using dynamic program to start the leaves first then go up to the tree to calculate the maxWithout() and max() for each node. To achieve this we use recursive post-order traversal.





Sunday, April 3, 2016

Data encoding, schema evolution and a case study

Data Encoding

When sending data over the wire or storing it in a disk, we need a way to encode the data into bytes. There are different ways to do this:

1. Use programming language specific serialization such as Java serialization or Python's pickle, which makes developers in other languages frown.

2. Use language agnostic formats such as XML or JSON, and we are particularly excited by JSON's simplicity. XML is too old school.

However unlike XML which has xml schema,  JSON is not a strictly defined format, even with JSON schema, which has a few drawbacks to make you bite your nails . First, it doesn’t differentiate integers from floating point, second, they are still verbose because field names and type information have to be explicitly represented in the serialized format. Lastly, the lack of documentation and structure makes consuming data in these formats more challenging because fields can be arbitrarily added or removed.
Updated: JSON Schema defines seven primitive types for JSON values:
array
A JSON array.
boolean
A JSON boolean.
integer
A JSON number without a fraction or exponent part.
number
Any JSON number. Number includes integer.
null
The JSON null value.
object
A JSON object.
string
A JSON string.

3. Using binary format.

One of the IBM mainframe i-series AS400/RPG uses the a binary data for record-based columnar like format, with predefined data type size and record size. But it may waste lots of space since every field and record has different sizes.

For example: [3, 27] is an JSON array containing the items 3 and 27, it could be encoded as the long value 2 (encoded as hex 04) followed by long values 3 and 27 (encoded as hex 06 36) terminated by zero:  04 06 36 00.

4. To describe all of the bytes in the above binary format, we need a document and contract between producer and consumer, or schema. Here comes the libraries include Thrift , Protocol Buffer and Avro

The below diagram illustrates the bytes encoding using Avro format.
Source: avro.png

Schema plays pivotal role here and is required during both data serialization but also data deserialization. Because the schema is provided at decoding time, metadata such as the field names don’t have to be explicitly encoded in the data. This makes the binary encoding of Avro data very compact.

Another benefit Avro schema brings is that for static type languages like Java, the serialization and deserialization code can be auto-generated from the schema.

Schema Evolution

One thing is never changed in software development is CHANGE itself,  in this case, the schema changes. Schema evolution is defined as how the application should behave when Avro schema is changed after data has been written to the store using the older versions.

Schema evolution is about data transformation. Avro specification itself has detailed information about how the data is handled when downstream (reader) schema is different from the upstream (writer) schema. When that happens, the data transformation from one schema to the other is performed.

Schema evolution is applied only during deserialization. If the reader schema is different from the value's writer schema, then the value is automatically modified during deserialization to conform to the reader schema. To do this, default values are used.

There are three common aspects of schema evolution: backward compatibility, forward compatibility, and full compatibility.

Let's consider two cases: add a field and remove a field.

To add a field, in order to support schema evolution, a default must be defined.

{"name" : "gender", "type" : "string" , "default" : "null"}]

When reader using old schema encounters this new field, the transformation will automatically drop this new field, this is called forward compatibility.

When reader using new schema encounters data without this new field, the transformation will automatically add this new field and set its value to null. This is called backward compatibility.

The same thing applies to remove case, which I leave it as an exercise to the readers.

Enum evolution

The intriguing part about schema evolution is the Enum evolution.  The following line is from Avro specification:
  • if the writer's symbol is not present in the reader's enum, then an error is signalled.
 This makes schema evolution for enum impossible, because we often need to add new values to the Enum field. And this turns out to be a long-term issue on Avro JIRA board: https://issues.apache.org/jira/browse/AVRO-1340. The proposal is actually simple:

"Use the default option to refer to one of enum values so that when a old reader encounters a enum ordinal it does not recognize, it can default to the optional schema provided one."

Basically always define a default value for enum like below:
{"type":"enum", "name":"color", "symbols":["UNKNOWN","RED", "GREEN",
"BLUE"], "default": "UNKNOWN" }
  
Lost in translation


 One interesting side effect from schema evolution is shown as below example:

Say we have two versions of schema: v1, v2, v2 has the new field called 'color' with default value of 'UNKNOWN'.

A v2 writer created a doc with color = RED.
A v1 reader parse this data, color field is removed and update the data store.
A v2 reader parse this data, color field is restored with value to 'UNKNOWN'!
 
Avro naming convention

Record, enums and fixed are named types. The name portion of a fullname, record field names, and enum symbols must:
  • start with [A-Za-z_]
  • subsequently contain only [A-Za-z0-9_]
The reason here is because when code-generation tool generates the code from schema, record and enum becomes the variable names. And in most of languages including Java, variable names can not have dash.

A Case Study

Armed with the above knowledge, let's do a case study. Saying we have a data store consists of Apache Gora and Kafka schema registry, allows us to store data set and register schema to support schema evolution, and data store will send out any new or updated data set to Kafka pipe which allows downstream consumers to consume the dataset.  Team A works on app a, which collects data and deposit it to data store, team B works on app b, which will listen to the pipe and fetch the data from data store and transform it and put in back to data store, the team A's app a also need to read this data back and make any change and put it back again. However Team B's changes should be preserved as much as we can since it is more a downstream application in this pipeline. 

Approach 1:

App A with its own schema a, app B with its own schema b.

At the beginning, App A has reader/writer for schema a.1, b.1, App B has reader/writer for schema a.1 and b.1.

Considering this case, Team B add new fields to its schema b.1, update schema to b.2 and update its reader/writer to b.2.  app A will read this data with reader/writer for b.1, the new field is dropped, write it to a.1, app B read updated data from a.1 now without this new fields to b.2, app B either won't be able to write this data back to b.2 or has to write a data transformation logic by itself to set the value to default value.

The case of dropping fields is the similar. App A and App B is loosely coupled, but we lost the benefit of schema evolution brought by out-of-box of the platform. Often times team end up to coordinate each other to update the two schema to avoid this 'lost in translation' case.

Approach 2:

App A has its own schema, defines the payload as json string with actually confirm to schema B and along with its schema version no, which app B can retrieve and deposit to datastore.

When app B read the data from app A, it not only read the payload but also use correct writer with right version no b.1 to deposit the data to store without losing any original data. So this approach allows two application evolve around the same schema, to avoid the lost in translation case, team A can always use the latest version of schema b.

The drawback of the approach is that we need to make sure the json payload is validated against schema b.1 and now we store data as json which lose the benefit of Avro serialization. 


References:
 
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Friday, April 1, 2016

[LeetCode] Palindrome Linked List

This is classic leetcode problem: https://leetcode.com/problems/palindrome-linked-list/

Given a singly linked list, determine if it is a palindrome.
Simple solution is to go through the linked list, and put all node reference on a stack, then loop through the stack and linked list at the same time to compare the values. 

public boolean isPalindrome(ListNode head) {
        Stack<Integer> s = new Stack<>();
        ListNode n = head;
        while(n!=null){
            s.push(n.val);
            n = n.next;
        }
        while(head!=null){
            if(head.val != s.pop())
                return false;
            head = head.next;    
        }
        return true;
    }



Follow up:
Could you do it in O(n) time and O(1) space?
This can be done by reversing the half of the linked list. First of all, we need to find the middle point of the list, then reverse the second half, and compare the first half from head and second half from the tail.