#97051 - 2002-12-18 09:09 PM
KiXGolf: Burrows-Wheeler Transform
|
Sealeopard
KiX Master
Registered: 2001-04-25
Posts: 11164
Loc: Boston, MA, USA
|
Update: Code now has case-sensitivity enabled to eliminate potential problems in sorting algorithms.
============================= The Burrows-Wheeler Transform =============================
This month we are going to decode a string mangled with the Burrows Wheeler Transform. Please don't be scared by the somewhat long motivation that follows, even if you are a beginner. At the end you will be shown a few pretty simple algorithms.
Download the complete package at http://people.bu.edu/jenmeyer/kixtart/kixgolf_bwt.zip. You MUST use this package as it includes a UDF that will be used to score your entry automatically.It also includes a printable PDF document outlining the Burrows-Wheeler Transform. Please note the Special Recognition section further down in the desciption.
========== Motivation ==========
The Burrows-Wheeler Transform (BWT) is a method that takes a block of text as input and rearranges it using a sorting algorithm. The unexpected thing is that the transform is reversible (which you normally don't expect. "sort" rarely comes with an "unsort"). The result of the transform is a block of text where consecutive letters are very often the same, so it can be easily compressed with even very naive algorithms. The BZIP program is based on this. So how does it work ? It's pretty easy in fact. Let's assume the input text is "DRDOBBS". Write the string and all rotated versions under each other like this: quote:
D R D O B B S (row 0) R D O B B S D (row 1) D O B B S D R (row 2) O B B S D R D (row 3) B B S D R D O (row 4) B S D R D O B (row 5) S D R D O B B (row 6)
Next sort rows: quote:
B B S D R D O (row 4) B S D R D O B (row 5) D O B B S D R (row 2) D R D O B B S (row 0) O B B S D R D (row 3) R D O B B S D (row 1) S D R D O B B (row 6)
And finally, write down the last column: "OBRSDDB". Also note that the D in the last column of the second-last row (that's the new row 5, or the old row 1) is the D that starts the word "DRDOBBS". These two together are the result of the transform: "OBRSDDB" and 5. It's pretty unexpected that this transform is reversible. One way to do that is to recover the (row n) lines in the sorted table. Let's first try to recover the sorted table a bit. We obviously already know: quote:
. . . . . . O . . . . . . B . . . . . . R . . . . . . S . . . . . . D . . . . . . D (row 1) . . . . . . B
In fact, we also know the first column, because it's just the letters in the last column, but sorted: quote:
B . . . . . O B . . . . . B D . . . . . R D . . . . . S O . . . . . D R . . . . . D (row 1) S . . . . . B
But this helps a lot ! From the row labeled with "(row 1)" we now know that R follows that D (remember that the strings were rotated, so the letter in the first column follows the letter in the last column). So if we know that the R is the first letter in row 1, we also know it's the last letter in row 2. So we can now extend our table to: quote:
B . . . . . O B . . . . . B D . . . . . R (row 2) D . . . . . S O . . . . . D R . . . . . D (row 1) S . . . . . B
Now repeat the same reasoning for the row labeled "(row 2)" We see that the letter after R must be a D which corresponds to row 3. We already have one of the D's labeled, so it must be the other one. And let's repeat this process: quote:
B . . . . . O (row 4) B . . . . . B D . . . . . R (row 2) D . . . . . S O . . . . . D (row 3) R . . . . . D (row 1) S . . . . . B
Oops, here we run into a problem. We know from the row labeled "(row 4)" that the letter after O is a B, but there are two of them in the last column. After which one should we write "(row 5)" ? To answer that, consider the letters that follow each B. By looking at a B in the last column and looking at the corresponding letter in the first column, we know they are B and S respectively. And since all rows are sorted, we in fact know: quote:
B B . . . . O (row 4) B S . . . . B D . . . . . R (row 2) D . . . . . S O . . . . . D (row 3) R . . . . . D (row 1) S . . . . . B
This B and that S in the second column of course also occur in the first column somewhere (on the second and last row in fact), and since the first column is also sorted, they occur in the same order, B first (on the second row) and the S later (on the last row), and they each have a B in their last column). So after this subtle bit of reasoning we get to a very simple rule: if you have repeated letters in the first row (the B in BB and BS), they correspond to that same letter in the last row in exactly the same order.
So the lines in a picture like this will never cross:
The B in the first column of row one corresponds to the B in the last column of the second row, and the B in the first column of the second row corresponds to the B in the last column of the last row. Let's go back to our question. We had O which was followed by B due to the top row. This is the first B in column 1, so it corresponds to the first B in the last column, the one on the second row. Therefore that's the one we must label with "(row 5)". So we get: quote:
B . . . . . O (row 4) B . . . . . B (row 5) D . . . . . R (row 2) D . . . . . S O . . . . . D (row 3) R . . . . . D (row 1) S . . . . . B
The rest is simple (don't forget that after row 6 we get row 0, not row 7): quote:
B . . . . . O (row 4) B . . . . . B (row 5) D . . . . . R (row 2) D . . . . . S (row 0) O . . . . . D (row 3) R . . . . . D (row 1) S . . . . . B (row 6)
And finally we write down the first letters from the rows labeled "(row 0)", "(row 1)", ..., "(row 6)" and we get "DRDOBBS"
========== Algorithms ==========
=========== Algorithm 1 ===========
Wow, that was hard ! Or was it ? Let's redo the exercise using the special rule we found: "All occurrences of the same letter in the first column will correspond to all occurrences of that letter in the last column in the same order" (see the picture where this rule is demonstrated if the letter is B). Then we can start with the encoded string with below it the sorted string and position 5 marked: OBRSDDB BBDDORS In the sorted string at the bottom we see the first R marked. Now mark the first R in the top string (and the D below it): OBRSDDB BBDDORS So now we have the first D below marked. Now mark the first D at the top and the O below it (forget the old mark): OBRSDDB BBDDORS First O at the bottom marked, mark first O at the top: OBRSDDB BBDDORS First B at the bottom marked (this also happens to be the first time we are processing a B, but "first" here means it's the first B in the string "BBDDORS"), mark the first B at the top (that is, the first B in the string "OBRSDDB"): OBRSDDB BBDDORS Second B at the bottom marked, so mark the second B at the top: OBRSDDB BBDDORS First S at the bottom marked, mark first S at the top: OBRSDDB BBDDORS Second D at the bottom marked, but when we mark the second D at the top, we see we are back at the original: OBRSDDB BBDDORS When we look at each of the letters we marked at the top in order, we see: "DRDOBBS"
=========== Algorithm 2 ===========
Another way is to observe that if you know the first few columns in the sorted table, you can fill in the last column too. And since the letters in the last column are prefixes of the string you already have in a row, you can move that letter to the front, and you get a set of strings that must also be in the sorted table (but not sorted yet). Sort them, and we are in the start situation again, but all strings are one letter longer. This leads to the following method: First write down a column consisting of the encoded string: O B R S D D B Then sort it: B B D D O R S Now write the encoded string in front of that again: OB BB RD SD DO DR BS And sort it again: BB BS DO DR OB RD SD Repeat prefixing and sorting until you get: BBSDRDO BSDRDOB DOBBSDR DRDOBBS OBBSDRD RDOBBSD SDRDOBB Now take row 5, and move the D at the end to the front, giving: DRDOBBS Many more algorithms exist of course. Maybe they even lead to shorter programs...
============= Specification =============
You will have to write a KiXtart function called BWTDecode, which will decode a BWT-encoded string with a length between 0 and 999 characters. The valid character space of the encoded string is ASCII codes 32-126. The function will also require an integer offset. The offset number represents the character that will be the first character of the output (counting starts at zero, and the number will be less than the string length, you don't have to handle invalid input). The function must return the decoded string or an empty string in case the decoding fails. We use normal ASCII sorting. So if you get
code:
$encoded = ’OBRSDDB’ $offset = 5 ? 'encoded = '+$encoded ? 'offset = '+$offset $decoded = BWTDecode($encoded,$offset) ? 'decoded = '+$decoded
then the cosole output should be:
code:
encoded = OBRSDDB offset = 5 decoded = DRDOBBS
============= General rules =============
- The program must be written as one or more lines.
- The program is expected to finish in a reasonable time, e.g. if it gets started when the challenge opens, it's expected to be finished by the time the challenge closes even on a somewhat slow computer. The program has to be valid during the period that the challenge runs.
- You can submit scores as often as you want to until the deadline, no reason to wait until the last minute. In fact, other people want to see the score to beat. So don't be a spoilsport by hoarding your score. Submit early and often.
- The program may only use the KiXtart commands/functions/macros, no other code fragments are allowed.
- The program should be self-contained (except for any I/O mentioned in the challenge). In particular, you may not do things like fetching extra data from a remote site.
- You may assume ASCII as character set.
- You are not allowed to use the included BSort() function. All functions required for the decoding must be placed inside the appropriate section of the script.
When posting KiXtart Golf Scores, please include the kixgolf_bwt.txt file that is created in the script directory. It contains some basic information about the computer that the script is run on. ======== Deadline ======== The game starts December 18 (3:00pm EST). The private coding phase ends Sunday, December 29 at 3pm EST and the public coding phase ends January 5 (3:00pm EST).
============ Test program ============
A test program is provided to help screen entries and to provide the Golf Score. Any program that passes the test program can be submitted. If you are surprised that your solution passed the test program, please submit it anyway! That will help me identify bugs in the test program.
=================== Special Recognition ===================
Special recognition for the person who can identify the author of input array element 4.
================================================================ KiXtart GOLF - How To Play ================================================================
Most importantly, anybody can play, no age restrictions, no penalties, no handicap!
The object in "real" golf is to hit the ball in the hole in the fewest strokes. The object in KiXtart Golf is to get from input (tee) to target (hole) in the fewest keystrokes.
Example: How many positive elements are in array $a?
Array $a could be of structure $a=[1, 2 ,-3, 4, -5, -7, 8, 9]
One approach:
code:
for $b=0 to ubound($a) if $a[$b]>0 $c=$c+1 endif next
for a score of 45.
Another solution is:
code:
DO $b=$b+1 if $a[$b]>0 $c=$c+1 endif UNTIL $b>(UBOUND($a)+1)
for a score of 53.
Better approach: Code sample 1
================================================================ KiXtart GOLF - The Rules ================================================================
1) The goal of KiXtart Golf is to score the lowest strokes.
2) Strokes are all characters in a piece of code except whitespace characters, unless the whitespace character is necessary for the line of code to work. Therefore, carriage returns and line feeds do not count or spaces in between the '=' sign when assigning variables, e.g. '$a = $b' scores 5.
3) Code can be constructed any way you like, as long as it is syntactically correct with KiXtart.
4) The final solution MUST pass all test scripts that accompagny the KiXtart golf challenge.
5) The use of '$' as a variable is allowed.
6) In case of questions about a particular way to count the KiXtart Golf Challenge organizer has the last call.
7) During the private coding phase, no code is allowed to be posted. Violations result in disqualification of said player.
8) During the public coding phase, code should be posted, reused, and borrowed from other players.
9) The following script can be used to count the KiXtart Golf score: http://www.kixtart.org/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=2;t=003608
================================================================ KiXtart GOLF - The Duration of the Competition ================================================================
1) Private coding phase: From date/time of posting the tournament challenge to the following Sunday, 3pm EST (BBS+6 time)
2) Public coding phase: From Sunday, 3pm EST (BBS+6 time) to the following Sunday, 3pm EST (BBS+6 time)
3) Final results: Sunday, 3pm EST (BBS+6 time)
code:
; KiXtart Golf ; ; Burrows-Wheeler Transform Decoder ; ; UDF ExtGolfStats() will be included at a later time
dim $iRc, $sSource, $sEncoded, $sDecoded, $iOffset, $asSourceArray[5] dim $iRandomElement, $sShellCmd, $sKiXGolfScrEngVer, $iKiXGolfScore dim $sResult, $sOutput, $sOutputFile dim $sGolfStatsUDF, $bExtGolfStats dim $sGolfScoreUDF, $bGolfScore dim $sTitle
; set requried options $sTitle='KiXtart Golf: Burrows-Wheeler Transform' $iRC=SETTITLE($sTitle) $iRC=SETOPTION('Explicit','on') $iRC=SETOPTION('NoVarsInStrings','on') $iRC=SETOPTION('WrapAtEOL','on') $iRC=SETOPTION('CaseSensitivity','on')
; load UDFs $sGolfScoreUDF=@scriptdir+'\'+'kixgolf3.udf' if exist($sGolfScoreUDF) call $sGolfScoreUDF $bGolfScore=1 else $bGolfScore=0 endif $sGolfStatsUDF=@scriptdir+'\extgolfstats.udf' if exist($sGolfStatsUDF) call $sGolfStatsUDF $bExtGolfStats=1 else $bExtGolfStats=0 endif $sKiXGolfScrEngVer = '3.0.3'
; create a couple of test arrays $asSourceArray[0]='DRDOBBS' $asSourceArray[1]='I hope the decoding algorithms make sense.' $asSourceArray[2]='An optional run-length encoder (RLE) should be placed ' $asSourceArray[2]=$asSourceArray[2]+'right before and after the Burrows-Wheeler ' $asSourceArray[2]=$asSourceArray[2]+'Transform encoder to maximize compression.' $asSourceArray[3]='This algorithm has first been published in 1996 and is NOT ' $asSourceArray[3]=$asSourceArray[3]+'patented. This means, we could write a custom ' $asSourceArray[3]=$asSourceArray[3]+'UDF to compress/decompress text based on the ' $asSourceArray[3]=$asSourceArray[3]+'Burrows-Wheller Transform and an optional ' $asSourceArray[3]=$asSourceArray[3]+'run-lengh encoding engine.' $asSourceArray[4]="Marley was dead: to begin with. There is no doubt whatever about '" $asSourceArray[4]=$asSourceArray[4]+"that. The register of his burial was signed " $asSourceArray[4]=$asSourceArray[4]+"by the clergyman, the clerk, the undertaker, " $asSourceArray[4]=$asSourceArray[4]+"and the chief mourner. XXX signed it. And XXX's " $asSourceArray[4]=$asSourceArray[4]+"name was good upon 'Change, for anything he chose" $asSourceArray[4]=$asSourceArray[4]+" to put his hand to. Old Marley was as dead as " $asSourceArray[4]=$asSourceArray[4]+"a door-nail. Mind! I don't mean to say that I " $asSourceArray[4]=$asSourceArray[4]+"know, of my own knowledge, what there is " $asSourceArray[4]=$asSourceArray[4]+"particularly dead about a door-nail. I might " $asSourceArray[4]=$asSourceArray[4]+"have been inclined, myself, to regard a coffin-nail " $asSourceArray[4]=$asSourceArray[4]+"as the deadest piece of ironmongery in the trade. " $asSourceArray[4]=$asSourceArray[4]+"But the wisdom of our ancestors is in the simile; " $asSourceArray[4]=$asSourceArray[4]+"and my unhallowed hands shall not disturb it, or the " $asSourceArray[4]=$asSourceArray[4]+"Country's done for. You will therefore permit me to " $asSourceArray[4]=$asSourceArray[4]+"repeat, emphatically, that Marley was as dead as a " $asSourceArray[4]=$asSourceArray[4]+"door-nail." $asSourceArray[5]="I'm wishing the KiXtart community and all KiXtart Golfers Happy " $asSourceArray[5]=$asSourceArray[5]+"Holidays and a joyful New Year"
; select any of the test strings $iRC=SRND(@TICKS) $iRandomElement=RND(ubound($asSourceArray)) $sSource=$asSourceArray[$iRandomElement]
; display source string info ? 'Source string = '+$sSource ? 'String length = '+len($sSource)
; encode string $sEncoded=BWTEncode($sSource) $iOffset=$sEncoded[1] $sEncoded=$sEncoded[0]
; display encoded string info ? 'Encoded string = '+$sEncoded ? 'Offset = '+$iOffset
; start the collection of extended KiXtart Golf Statistics if $bExtGolfStats $iRc=StartExtGolfStats($sTitle) endif
; decode string $sDecoded=BWTDecode($sEncoded,$iOffset)
; stop the collection of extended KiXtart Golf Statistics if $bExtGolfStats $iRc=StopExtGolfStats($sTitle) endif
; display decoded string info ? 'Decoded string = '+$sDecoded
; check validity select case $sSource and not $sDecoded color r+/n ? 'Validation = failed (decoded string is empty)' color n/n $sResult='failed' case $sSource==$sDecoded ? 'Validation = passed (decoded string is identical to encoded string)' $sResult='passed' case 1 color r+/n ? 'Validation = failed (unknown error)' color n/n $sResult='failed' endselect
; generate KiXtart Golf Score if $bGolfScore $iKiXGolfScore=KixGolf(@scriptdir+'\'+@scriptname) else $iKiXGolfScore='Please download the KiXtart Golf package at ' $iKixGolfScore=$iKixGolfScore+'http://people.bu.edu/jenmeyer/kixtart/kixgolf_bwt.zip' endif
; generating official KiXGolf Score $sOutput=$sOutput+@CRLF+'KiXtart' $sOutput=$sOutput+@CRLF+'KiXtart Version = '+@KIX $sOutput=$sOutput+@CRLF+'KiXGolf Script = '+@SCRIPTNAME $sOutput=$sOutput+@CRLF+'' $sOutput=$sOutput+@CRLF+'Computer' $sOutput=$sOutput+@CRLF+'OS = '+@PRODUCTTYPE $sOutput=$sOutput+@CRLF+'CPU = '+@CPU $sOutput=$sOutput+@CRLF+'Speed = '+@MHZ+' MHz' $sOutput=$sOutput+@CRLF+'' $sOutput=$sOutput+@CRLF+'KiXGolf Scoring Engine' $sOutput=$sOutput+@CRLF+'Scoring Engine = '+$sKiXGolfScrEngVer $sOutput=$sOutput+@CRLF+'' $sOutput=$sOutput+@CRLF+'KiXtart Golf Score' $sOutput=$sOutput+@CRLF+'Tournament = '+$sTitle $sOutput=$sOutput+@CRLF+'Date & Time = '+@DATE+' '+@TIME $sOutput=$sOutput+@CRLF+'KiXGolf Result = '+$sResult $sOutput=$sOutput+@CRLF+'KiXGolf Score = '+$iKiXGolfScore $sOutput=$sOutput+@CRLF+' ' if $bExtGolfStats $sOutput=$sOutput+@CRLF+GetExtGolfStats() endif $sOutput=$sOutput+@CRLF+'Thank you for participating in KiXtart Golf!'
; saving and displaying official KiXtart Golf Score $sOutputFile=@scriptdir+'\'+left(@scriptname,instrrev(@scriptname,'.'))+'txt' $iRC=redirectoutput($sOutputFile,1) ? '[QUOTE]' ? $sOutput ? '[/QUOTE]' $iRC=redirectoutput('') ? $sOutput
exit 0
; begin Burrows-Wheeler Transfrom Encoder function BWTEncode($sSource) dim $aEncoded[1], $asShftSrc, $iIndex, $iSourceLength dim $sOffsetString
$aEncoded[0]='' $aEncoded[1]=0
if not $sSource $BWTEncode=$aEncoded exit 0 endif
$iSourceLength=len($sSource)
; create encoded matrix redim $asShftSrc[$iSourceLength-1] $asShftSrc[0]=''+$sSource for $iIndex=1 to $iSourceLength-1 $asShftSrc[$iIndex]=''+substr($asShftSrc[$iIndex-1],2)+left($asShftSrc[$iIndex-1],1) next $sOffsetString=$asShftSrc[1]
; sort encoded matrix $asShftSrc=bsort($asShftSrc)
; retrieve last column $aEncoded[0]='' for $iIndex=0 to $iSourceLength-1 $aEncoded[0]=$aEncoded[0]+''+right($asShftSrc[$iIndex],1) next
; determine offset $iIndex=0 while not $asShftSrc[$iIndex]==$sOffsetString and $iIndex<($iSourceLength-1) $iIndex=$iIndex+1 loop $aEncoded[1]=$iIndex
$BWTEncode=$aEncoded endfunction ; end Burrows-Wheeler Transfrom Encoder
; begin BSort sorting algorithm ;FUNCTION BSort() ; ;ACTION Bubble sort for kixtart arrays ; ;AUTHOR Shawn Tassie ; ;SYNTAX ARRAY = BSORT(ARRAY [, ORDER]) ; ;PARAMETERS ARRAY ; A single dimensional kixtart array ; ; ORDER ; The sort order (0/1). If omitted or set to zero, the array will be sorted in ; ascending order. If set to 1, the array will be sorted in descending order. ; ;RETURN The sorted array ; ;REMARKS To my knowledge, this is as good (fast) as bubble sorts get. Would love to compare ; benchmarks and/or get feedback in terms of performance improvements. ; ;DEPENDENCIES none ; ;EXAMPLE $array = Peach,Pumpkin,Orange,Grape,Lime,Apple,Rasberry,Cherry,Lemon ; $array = bsort($array) ; ;KIXTART BBS http://kixtart.org/cgi-bin/ultimatebb.cgi?ubb=get_topic&f=12&t=000070 ; function bsort($array, optional $order) dim $i,$j,$m
if ubound($array) for $i = 0 to ubound($array) $m = $array[$i] for $j = $i + 1 to ubound($array) if ($array[$j] < $m and not $order) or ($array[$j] > $m and $order) $m = $array[$j] $array[$j] = $array[$i] $array[$i] = $m endif next next $bsort = $array else $bsort = 0 endif endfunction ; end BSort sorting algorithm ; any modifications above this line will result in immediate disqualification
; begin Burrows-Wheeler Transfrom Decoder
;! function BWTDecode($sEncoded, $iOffset) endfunction ;! ;!
; end Burrows-Wheeler Transfrom Decoder
; any modifications below this line will result in immediate disqualification
[ 23. December 2002, 21:35: Message edited by: sealeopard ]
_________________________
There are two types of vessels, submarines and targets.
|
Top
|
|
|
|
#97070 - 2002-12-19 06:15 PM
Re: KiXGolf: Burrows-Wheeler Transform
|
Shawn
Administrator
Registered: 1999-08-13
Posts: 8611
|
oh - i wish Split() was more powerful
|
Top
|
|
|
|
Moderator: Arend_, Allen, Jochen, Radimus, Glenn Barnas, ShaneEP, Ruud van Velsen, Mart
|
0 registered
and 718 anonymous users online.
|
|
|