Tuesday, January 23, 2007

Burrows-Wheeler Transform in Haskell

I got curious about program transformation techniques, and went looking to see what kind of work had been done on round-trip parsing: i.e. using the same grammar to turn a string into an abstract syntax tree, and then to turn it back into a string.

Well, I got distracted by this very odd thing I found called the Burrows-Wheeler transform. It's a weird way of sorting a string that is reversible. (And happens to be likely more easily compressed with something like run-length encoding, hence its use in compression programs such as bzip2). Basically you sort the string, then replace each character with the one that used to appear to the left of it back in the original ordering (or an "end of string" marker for the character that came from the front of the string).

Anyway, as an exercise I wrote an encoder and decoder in Haskell. Here it is, for your amusement. Feel free to use it if you want, but it's probably not very efficient. I need to read up on when Haskell does and does not save results of function calls -- are they always memoized?

module Bwt3 where
import Ix
import Data.List

-- For BWT we want to compare strings with the special
-- rule that the end of string sorts as greater than any
-- other character
compHighEol [] [] = EQ
compHighEol lst1 [] = LT
compHighEol [] lst2 = GT
compHighEol (x:xs) (y:ys)
| x > y = GT
| x < y = LT
| x == y = compHighEol xs ys

-- Datatype for rotated string; the string starts where the
-- integer member says it does, and wraps around when it hits the
-- end, conceptually
-- This lets us store multiple rotations of the same string
-- without using storage for each one; they all point to the same
-- structure
data Rotation = Rot Int [Char] deriving Eq
instance Show Rotation where
show (Rot k l) = (drop k l) ++ ['@'] ++ (take k l)
instance Ord Rotation where
compare (Rot a as) (Rot b bs) = compHighEol (drop a as) (drop b bs)

-- List of all possible rotations of a string
allrots str = [ Rot i str | i <- range(0, length str) ]

-- The actual output of this algorithm is a string with a flag
-- embedded in the middle of it, which can't be a character. So
-- we introduce a type for this purpose that
-- allows for an EOF symbol. Which by the way, sorts after
-- all other characters, to help with decoding

data FlaggedChar = FC Char | FC_EOF deriving Eq

instance Show FlaggedChar where
show FC_EOF = "@"
show (FC x) = show x

instance Ord FlaggedChar where
compare FC_EOF FC_EOF = EQ
compare _ FC_EOF = LT
compare FC_EOF _ = GT
compare (FC x) (FC y) = compare x y

-- BWT encoding: make all rotations, sort them, and take the last
-- character of each.
encode str = map lastchar (sort (allrots str))

-- lastchar just pulls the last character out of the
-- rotated string, or FC_EOF if that's the last item
lastchar (Rot 0 xs) = FC_EOF
lastchar (Rot ix xs) = FC (head (drop (ix-1) xs))

-- The rest of this stuff is for decoding the [FlaggedChar]
-- list we made above

-- Given a list (of anything), return a list of Ints
-- showing where the elements of the list would end up
-- if sorted.
-- We do that by tagging each element with an integer, sorting
-- those tagged items, and
-- collecting the tags afterwards to see where they ended up
sortPermutation xs =
map snd ( sortBy compfst (zip xs [0..]))
where compfst x y = compare (fst x) (fst y)

-- Turn that into a cycle by cycling through it once
getCycle xs =
take (length xs)
((iterate ((sortPermutation xs) !!) . fromInteger ) 0)

-- apply the permutation to the encoded item
-- and rotate the result so the end of string flag comes at the end
applyCycle cycle encoded =
fCtoString $
tail $
dropWhile (/= FC_EOF) answer ++ takeWhile (/= FC_EOF) answer
where answer = map (encoded !!) cycle -- Oops: fixed typo 9/2008

fCtoString :: [FlaggedChar] -> [Char]
fCtoString [] = ""
fCtoString ((FC c):cs) = [c] ++ (fCtoString cs)
fCtoString ((FC_EOF):ds) = fCtoString ds -- ignore FS_EOF

-- Finally the decode function puts it all together
decode xs = applyCycle (getCycle xs) xs


Jim Stuttard said...

Hi Chris,
Just found this but ghc 6.8.3 tells me perm is out of scope. Do I need a compiler switch?

Chris Bogart said...

Hi Jim,
No, you shouldn't. But I don't see where perm is defined. I must have neglected to paste something into the post -- I'll look at it and post an update soon.

Jim Stuttard said...

Thought not - thanks for the confirmation and the correction :)

Chris Bogart said...

I just fixed it -- it had 'perm' where it should have had 'cycle'. I must have renamed some variables then not tested before posting. Sorry about that! Thanks for the catch!