Anatomy of ROLZ data archiver

The text of DEMO program	iRolz
The sample used for the test	iRolz2
C# version	iROLZ.cs
C# version with file type signature and CRC	iROLZCRC.cs

I appreciate work of John Coe for conducting exhaustive testing and providing feedback regarding improvement and optimization of code.

In the long list of different approaches to data compression one concept has proven to be a leading tool – it is combination of Lempel-Ziv algorithm with context predictive arithmetic coding. There is no shortage of sites explaining concept of LZ algorithm, however, fast and effective implementation is significantly different from it. One modification of LZ is called ROLZ (Reduced Offset Lempel-Ziv). It allows to achieve compression and performance near best compressors in the industry, while being only near 500 lines of portable C++ code valid for any operating system. The code can be easy converted to Java and used in mobile devices such as Blackberry.

Anatomy of ROLZ data archiver

Introduction

In 1977 Abraham Lempel and Jacob Ziv introduced algorithm for text compression where repeated fragments are replaced by couples OFFSET, LENGTH. For example, the fragment

constructors and destructors

may be presented as

constructors and de,16,9,

where numbers 16 and 9 instruct decoder to step back on 16 symbols and output 9 symbols, representing earlier occurred fragment.

In this example we can tell couple OFFSET, LENGTH from other symbols that are called LITERALS, but in generic case it represents the technical problem. This is easiest of all problems that developer may face implementing LZ and usually solved by introducing a flag that separates literals from OFFSET, LENGTH. It is either one bit flag 0 or 1 before LITERAL or OFFSET, or other format solution. There are some elementary programming tricks to avoid inconvenience of outputting individual bits such as introducing one additional byte as a format chart for next 8 bytes. For example, if this format byte is 00000100 it tells that next 5 bytes are literals and then we have OFFSET, LENGTH couple and another two literals after them. Other two technical problems are dramatically more complicated. One of them is how to find these repeated fragments. Simple scanning of processed data needs unacceptably long time. The other problem is long OFFSET values. The compression becomes effective when repeated fragments are searched within at least 65536 preceding bytes, that are called SEARCH BUFFER. Given the size of the SEARCH BUFFER the OFFSET numbers look like 47689, 32789, 11097, ... and so on, with very rare repetition and very low correlation, so they can not be compressed significantly. There are many modifications of LZ algorithm dedicated to solution of these two mentioned above problems. Some of them can be seen in this article but ROLZ is not there. In ROLZ offsets are presented as very small numbers such as 0,1,7,10,..., that are even smaller than literals by the price of insignificant increase in the number of literals.

Reduced Offset Lempel-Ziv

The first closest to ROLZ concept was proposed by Ross Williams in 1991. The presumed inventor of modern ROLZ concept is Malcolm Taylor. The evidence is on-line discussion, where Taylor answers questions regarding ROLZ. First presumed implementation is WinRK archiver, the code for which is not disclosed. First open source implementation of ROLZ is archiver named BALZ by Ilia Muraviev. His near 500 lines program provides compression close to 7-zip. The core part of 7-zip is called LZMA and implemented by near 5000 lines. The other known implementation of ROLZ is called RZM . Source code is not disclosed.

In this article both mine and Muraviev solutions are explained. They are close in compression ratio and performance but algorithmically different. Since many researchers were experimenting with LZ77 and some were developing commercial software, where they were bounded by non-disclosure agreement, I can not guarantee the novelty of my algorithm. I can only assure that I did not see my implementation so far in other sources. I called my implementation iROLZ.

iROLZ

The concept of iROLZ is to make self addressing array, each element of which contains location of occurrence of the previous identical word as it is shown in the table

#	word	previous location
0	0	0
1	0	0
2	con	0
3	ons	0
4	nst	0
5	str	0
6	tru	0
7	ruc	0
8	uct	0
9	cto	0
10	tor	0
11	ors	0
12	rs_	0
13	s_a	0
14	_an	0
15	and	0
16	nd_	0
17	d_d	0
18	_de	0
19	des	0
20	est	0
21	str	5
22	tru	6
23	ruc	7
24	uct	8
25	cto	9
26	tor	10
27	ors	11

The table is sort of chart shown for explanation only. The self-addressing array used in the program contains elements sitting in column named 'previous location'. For example, we can see that the word CTO is sitting on positions 25 and 9 in considered string. If we address to 25th element we find 9 - the address of the previous location of the same word. When element equals 0 it means that no more occurrences is expected. When we have array of integers similar to column named 'previous location', we can quickly find all occurrences of the current word in the recent history or SEARCH BUFFER. Each non zero entry in self-addressing array brings pointer to a new location of occurred word. To complete the search we need to find the length of matched fragment and output the pointer to longest match. This is where we can save space for offset. Since the SEARCH BUFFER must be identical for encoding and decoding we may output not the offset itself but the number of steps we need to go back in self-addressing array. For example, if the longest match for word CTO from position 25 is on position 9 we output 0, since we need to make only one step back, and not actual offset equal to 16.
The population of self-addressed array is quick and not algorithmically challenging. We populate array while encoding or decoding data. As it can be seen, when element 25 is being processed and the word CTO is found, we need to enter number 9, which is last location for the word CTO. That can be easy arranged over auxiliary array that is called LAST POSITION LOOKUP, that records data as function of 3 bytes L[CTO]=9 or 2 bytes L[CT]=9 or using elementary hash to reduce 3 bytes to a reasonable length. When we run on word CTO we address to array L[CTO] and find 9. This number 9 goes to element 25 and number 25 goes to L[CTO]=25.
For better understanding of this part I provide simple DEMO of the class Dictionary that performs most important part of the algorithm. In the DEMO the size of the prefix word can be chosen as 2 or 3 bytes. The self-addressing array can be presented as circular buffer with size limited to only 65536 elements. Same size can be used for last position lookup. In suggested dictionary DEMO user can find out that circular buffer provides correct search for SEARCH BUFFER of twice larger than self-addressing size. There is additional bonus in circular buffer. When data is written to the end of circular buffer and started from the beginning overwriting it, we may have different words overlapping, and we have to interrupt search when found that new step brings us to a different word. However, we do not need to interrupt the search because we may simply find the random match. For example, we are searching matches for word CTO and make several steps back. Due to overwriting the array we may step back to location of other word and unexpectedly find the good match. We may keep and output it, because in decoding this self-addressing array will be identical and we can use same match. When I observed this behavior I decided that this is a bug but later I found that it is simple bonus that adds a little-bit to compression. The core part of maintaining the dictionaly is amazingly elementary and shown in the code snippet below

In encoding we pass each new byte to the function updateDictionary, after which we can call function getNextPosition that returns positions of the previous occurrences of the searched 2 byte or 3 byte word. In decoding we pass each new byte in the same way to updateDictionary and, when OFFSET, LENGTH is found, call getNextPosition the same number of times to extract required position. Needless to say that we save number of steps and not the actual OFFSETS.

It may look like we need to output at least 2 additional literals, because in decoding we need to have this word in order to start looking it in a dictionary. For example, if the matched fragment is ABCDEFGHI, AB must be left as literals. That is not true for every couple. We may have OFFSET, LENGTH couples following each other. When expand one couple, we may have last two symbols in expanded expression as this prefix word that we use to look an expression in a dictionary. For executables we normally have number of literals 4 times larger than number of couples, for text files number of couples is typically 10 times larger than literals and number of literals between the couples is frequently less than 2.

iROLZ Implementation Details

For better understanding of iROLZ algorith I publish intermediate skeleton iROLZ application. It does not have entropy coder. It provides self test by converting original data to array of short integers with literals and offset, length couples and back.

ROLZ implemented by Ilia Muraviev

I introduced my iROLZ not because something is wrong with Muraviev ROLZ but for a simple diversity and because my combination of two arrays is more memory efficient, at least theoretically. In his implementation he introduced two dimensional array BUF[WORD][INDEX], where first index is used for the word and second index is used to hold occurrences of this word. The array element is the position of a particular occurrence. For example, after word STR occurred 2 times on positions 5 and 21 this BUF[][] will have values

BUF[STR][0] = 21; BUF[STR][1] = 5;

When word STR occur for the next time we address to BUF[STR][i], find previous positions 21 and 5 and check if long enough match available. This is so-called big picture or concept of algorithm implemented in his program. There are some details. The first index of BUF[][] is actually 2 bytes and not changeable in the program. In my implementation user has option to change it to 1,2 or 3. He also saves short (7 bit) hash of first 3 bytes in order to expedite execution. For example, along with number 21, first element of BUF[STR][0] will contain hash(data[22], data[23], data[24]). Presume we see next occurrence of STR on position 126, then we find hash(data[127], data[128], data[129]) and from comparing hashes we quickly decide if there is need for further processing of this couple of fragments. This simple programming trick reduce overall processing time almost twice. There are also some optimization details that we do not discuss here. It can be figured out from analyzing his code.

The difference from classic LZ

First difference is larger number of LITERALS. When we compare classic LZ encoding

constructors and de,16,9,

to ROLZ encoding

constructors and dest,0,7.

we can see these 2 (or 3) extra literals preceding OFFSET, LENGTH couple. However the expression

constructors and destructors and decomposition,

may be encoded by ROLZ as

constructors and dest,0,7,0,7,composition,

because, after expanding first couple, we have word ORS preceding expression _AND_DE. Indexes are 0 because we need to make only one step back in self-addressed array to find required position.

Using context binary coder for compression of ROLZ output

When data is processed by iROLZ we have 4 types of integers: LITERALS, OFFSET INDEXES, LENGTHS and FLAG. FLAG separates LITERALS from other data. First decision has to be made regarding format and usage of FLAG. In BALZ literals and lengths are mixed up together in 9 bit integers. Literals are all below 256 and lengths take values from minimum to 255 but output as values between 256 and 511. In this way they can be told from each other. I introduced one bit flag before literal as 0 and one bit flag before index as 1, so my data looks like follows:

0, LITERAL, 0, LITERAL, 1, INDEX, LENGTH, 0, LITERAL, ... and so on,

The convenient solution for encoding this mixed stream of data is binary arithmetic coder. At the current moment there are already a lot of open source adaptive binary coders, so there is no need of writing your own. There are context properties in LITERALS, in flags and in LENGTHs, each value of them statistically related to previous values. The context properties in INDEXes are not observed. For a quick DEMO I used adaptive binary coder of Matt Mahoney and bit predictor of Ilia Muraviev. For quick test of adaptive binary arithmetic coder I extracted this coder from BALZ by removing ROLZ from the code. It can be found in BALZnoROLZ application. I also added another optional 2 bytes model for contextual modeling of a bit that improved compression on approximately 10 percent.

Testing results

I compared iROLZ to several quick practical encoders: RAR, 7-zip, BALZ 1.15, bzip2, Stuffit. The encoding time and compression ratio varies insignificantly. The testing file sample is taken from Maximum Compression. The compression ratios are shown relatively to original size. The result slightly depends on the set of parameters on the top of the program file. To be concrete the result in the table is shown for the version named iROLZ2.

Encoder\File	A10.jpg	acrord32.exe	english.dic	FlashMX.pdf	fp.log	mso97.dll	ohs.doc	rafale.bmp	vcfiu.hlp	world95.txt
RAR	1.001	0.360	0.245	0.830	0.044	0.491	0.197	0.297	0.172	0.210
iRolz 2	0.991	0.398	0.221	0.835	0.038	0.523	0.202	0.263	0.177	0.223
zpaq v 4.02	0.985	0.438	0.310	0.830	0.029	0.534	0.207	0.212	0.177	0.170
Stuffit 14.0.0.18 option = optimal	0.761	0.377	0.271	0.745	0.030	0.517	0.208	0.131	0.159	0.180
bzip2	0.992	0.439	0.300	0.841	0.035	0.558	0.217	0.214	0.173	0.193
BALZ	0.992	0.378	0.209	0.831	0.032	0.495	0.199	0.253	0.169	0.207
LZMA 4.65 = 7-zip	1.004	0.363	0.209	0.818	0.040	0.480	0.189	0.234	0.147	0.190
7zip 9.16 beta	1.004	0.325	0.209	0.818	0.040	0.431	0.189	0.235	0.148	0.191

Code for RAR and Stuffit is not disclosed. UnRAR - RAR decoder for Windows is about 16,000 lines, but it includes archiving of directory tree and encryption. The file compression part is about 13,000 lines. Core part of LZMA or 7-zip is about 5000 lines, bzip2 is about 7000 lines, BALZ and iROLZ is about 500 lines. Stuffit actually specializes in re-compression of already compressed data. This explains better result for JPG and PDF. The execution speed can be compared on Large Text Compression Benchmark. I can also mention FlashziP. I did not find source code but algorithm is described by author as ROLZ. Testing showed size and performance near BALZ and RAR.

Room for optimization

Presume we have data fragment as shown below

12345...34567890...1234567890

If we process last third fragment we find couple (OFFSET1, 5) and then couple (OFFSET2, 5), however the better solution might be one couple (OFFSET3, 8) with symbols 12 left as literals. Big number of short couples reduce overall compression because some literals are compressed well enough. Like in this example, literals 12 are repeated and on that reason may be compressed well.

Generically the problem can be defined as dynamic programming and presented as a logical tree. In each node we have a data value, which we can output as literal or, in case of match, as couple OFFSET INDEX, LENGTH. Some literals may not have a match and be output as literals only. That makes each node to have either one or two branches. Following branches we can reach the end of the data sample and get the compression ratio. Obviously, we interested in finding the path to the best compression. Needless to say that testing all possible paths is computationally expensive and not possible, but some simple selection criteria may be considered. One particular improvement can be done using 2 dictionaries instead of one. When making instance of dictionary class we specify length of the SEARCH BUFFER and number of prefix symbols. Depending on these parameters different dictionary instances may find different matches. We may preserve the best match with adding flag indicating for decoder which dictionary to use. Experiment with 2 dictionaries showed stable 6 to 8 percent improvement with very small increase of processing time. The coding sample for two dictionary iROLZ is available.

Room for artificial intelligence

A really interesting opportunity is available due to a fundamental property of the dictionary. Any symbol passed to dictionary can be in the same way changed in encoding and decoding without destroying the data integrity. For example, you can pass 0 instead of each literal in both encoding and decoding and see that round trip is OK. That operation, unfortunately, does not do any good. The replacement of symbols passing to dictionary must be an intelligent process. I can show all benefits on example. Presume we have 2 fragments following in data on short distances from each other:
ABCD0FGHIJ... ABCDEFGHIJ...
If we pass symbol E to the dictionary instead of 0, while processing the first fragment, we achieve much better compression of the second fragment. In this matter we rather pass to the dictionary data that possibly will be seen than those that actually occurred. This is what intelligent process must accomplish. The class Dictionary should be slightly changed in this case. It should hold another circular buffer with data passed to dictionary and that buffer has to be used as SEARCH BUFFER. Such improvement is already done and this program is called iROLZStream. The core part is class DictionarySteam that process data in encoding and decoding in a streaming way, holding SEARCH_BUFFER as class member in form of limited in size circular buffer. It can process data continuously without hitting the limit in buffer size and stream data from source to receiver in a real-time client-server application. Besides obvious advantage of streaming processing of data it can be used as a tool for experimenting with artificial intelligence. It can be seen that any extra symbols can be added, altered or skipped when populating the dictionary. If these operations are done in the same way in encoding and decoding the round trip is successful. As it already mentioned, it is new unresearched area. In previously known LZ modifications SEARCH BUFFER represented past data, it was never altered to increase the luck in looking for the longest match. One of the possible strategies in modification of SEARCH BUFFER can be STATISTICAL RESTUFFING. The class member array m_search_buffer is scanned in function getLongestMatch , which allow to make histogram for every scanned symbol. When data in the SEARCH BUFFER are going to be replaced by new data the frequently scanned symbol chains are going to be repopulated into dictionary. Interesting that this STATISTICAL RESTUFFING does not require preprocessing of all data or moving backwards like in optimal parsing.

Andrew Polar, July, 2010.
Recent version Sept, 2010.