Primitive BWT DEMO in C# bwt.cs
MSufSort DEMO in C++ msufsort.cpp



Suffix sort demo

Burrows-Wheeler Transforms (BWT) opened a competition in writing fast suffix sort algorithms. We skip here the explanation of BWT because it can be easy found on-line and explain the matter of suffix sort.

In the suffix sort we need to sort indexes of all possible circular buffers in a given text (for example, ABRACADABRA)

ABRACADABRA
1234567891011


In this example the first circular buffer in sorted list will be AABRACADABR that starts at position 11, followed by ABRAABRACAD that starts at 8. I think the example sets the problem clear. The circular buffer is formed by reading given text to the end and continue from the beginning until all text is read. It is completely determined by starting index. In order to find which circular buffer precedes which, we need to make comparison of all symbols involved, in the same way how we determine position of the word in a dictionary.

QSORT technique may be applied with one stipulation that comparisson is made not for a single symbol but for all sequentual identical symbols. Such comparisson makes drammatical slow down in QSORT algorithm. In some text files it may take 50 sec. to sort 1 MB of data on regular lap-top computer. The example can be seen in the link above bwt.cs . To expedite QSORT some tricks can be applied:

  • the sorting applied within individual 2 byte context
  • when two suffixes are swapped, the suffixes of all followed by them identical symbols are swapped also


  • I think the first trick is clear. In the word ABRACADABRA we can see 2 combinations of BR. We sort separately suffixes for each individual combination, that means for BR we sort only indexes 4 and 11. For AB we sort 3 and 10 and so on.

    Second trick is called induced sorting. We can see that circular buffer BRAABRACADA (index 9) precedes BRACADABRAA (index 2) that means that 10 precedes 3 and 11 precedes 1. For this example final suffix sort will represent array of indexes 11,8,1,4,6,9,2,5,7,10,3.

    The example of fast suffix sort is shown in the second link msufsort.cpp . It is highly simplified version of original code, found on-line, for the purpose of better understanding but still preserving original algorithm. I advice to use it for educational purpose only and find complete version for practical usage. Although the main concept may be clear from my brief explanation, the details of the implementation can not be learned so quick. I can address reader to the article for more information.