Asymmetric Numeral System

Asymmetric Numeral System


The source code with demo is available:     ANSCoder

Asymmetric Numeral System was introduced by Jarek Duda for several different purposes. We consider here only data compression aspect. Those who want more information can try to read latest Jareks article The first demonstration of ANS coder was published by author at Wolfram in March 2008 that means it is not likely that this method will ever be patented because time for filing patent is running out and author confirmed several times that he has no intent to patent this method.

According to online news this coding approach is used in ZSTANDARD compressor of the Facebook.

Encoding/decoding table

stateABC
1235
24610
37815
491120
5121425
6131730
71621 
81822 
91926 
102328 
1124  
1227  
1329  
1431  


According to the method we need to create single look up table as it is shown above. Encoding may start from any row and from any symbol of the alphabet (A,B,C). If first symbol in the message is B and first row is 1 we take new row number from crossing row 1 and column B, which is 3 and move to the next row. If the next symbol is A we read next row number form the crossing of column A and row 3. This number is 7 and we can go on until the final extend of our table.
Decoding is as elementary as encoding. Having any number, let say 16, we can extract symbol A and previous row 7. Moving back to the 7th element in a table we can extract another symbol A and another new row 3. And finally from element 3 we extract B and row 1, which is the highest point at the table and end of message. Successful encoding and decoding is possible if
  • Left column in the table contains all sequential numbers.
  • All numbers in columns A,B,C are unique (no repetition).
  • Numbers in every row are larger than the number of that row.
  • Numbers in all columns are sorted.
The number of row obtained after every step is called state. Having state we can restore the message. This encoder becomes entropy encoder when numbers in columns are selected to match probability of occurrence of correspondent symbol if we divide row number by the value in the column. In this table if we divide row number by any element sitting in column A we have approximately 0.45. The ratio for B is 0.35 and for C is 0.2. The state is growing on each step according to formula statei = statei-1 / P{symbol} that means at the end of encoding we shall have number with the length equal to 1 / [P{B} * P{A} * P{A} ...] which is, obviously, in accordance with entropy.
The only part that is missed in this brief introduction is how to manage the state when we reach the bottom of the table. That can be read in the article which also have ANSCoder example. In addition to what is already explained we need to add management of bits output in encoding in the way synchronized with decoding. Let us get back to our example. This concept is close or similar to the one used in original sample of ANSCoder written by Jarek Duda. According to the concept decoder reads certain number of bits to make the state buffer of the same size before taking of every step. As you can see from the table, in decoding the state can becomes smaller on every step, so if decoder read always any number of bits to make it of the same size we never make the wrong choice and decoding becomes very simple. Read bits, process symbol, read bits, process symbol and so on.
Let us show the decoding process before we explain an encoding one. Presume we have following binary expression to decode: 11000011110 . According to main property of ANS coder we decode backwards. Based on the size of our table we appoint the size of the state buffer 5 bits. On the first step we read last 5 bits of encoded expression shown above, which is 11110 and address the table with this number, which is 30. We extract symbol C and obtain new state 6 or in binary format 110 . In order to make our new state buffer match size 5 we read another 2 bits from the expression and get 11000. which is 24. For state=24 we extract symbol A and make new state 11, which is in binary format 1011. We read one more bit and get (10110) 22. Then we do the same thing: we extract B, get state=8, read one more bit, get 16, extract A, get 7, read last two bits and get end-of-message marker 11111. We know that this is end of message because no more bits to read and state match the default marker.
Encoding/decoding table

stateABC
1235
24610
37815
491120
5121425
6131730
71621 
81822 
91926 
102328 
1124  
1227  
1329  
1431  
Encoding process

state before output of bits31162224
output bits110000
state after bit output78116
processed symbolABAC
state after processing16222430
The decoding process is much simpler than encoding. In encoding we need to output bits to provide this simple rule that we used in decoding. The encoding process is shown in the table on the left. We can start from any 5 bit number but we prefer distinguished end/start marker 31 (11111). The concept for encoding is a little-bit more complicated. We output bits before processing a symbol. As you can see we can not process 31, there is no row number 31, so we output as many bits as necessary in order to process the symbol. We can find that out because we know what symbol we are processing. It is A, and we need to output two least significant bits, which is 11. We show the whole process in the table. The number before we output bits is 31. Below this number are two bits that we output to the stream. The result after output state=7. We process 7 with symbol A and get new state=16. Next symbol is B. According to the table we can not process B with state=16, so we need to output one least significant bit 0 and make new state=8. After processing B for the state=8 we get 22 and so on. We have to output least significant bits because we need to reduce the number. After we process our very last symbol C we have state=30 and we output it completely so our output buffer looks like it is shown above.

The code was written in May, 2008. This explanation was added and edited some time later. Andrew Polar.