Visualization browsing by tag


Followup to my earlier post on Hilbert curve timeseries plots

Monday, August 24th, 2009

Note that the Hilbert code is not mine.  Rather it is from a Haskell Golf Challenge post in a pastebin and appropriated because I wanted a quick Hilbert curve without having to think through a hazy minded Friday evening.  Can’t find the original post now, actually.  That’ll teach me to bookmark things that are important.  Here’s the other person’s code.  If anyone can help me claim it, I will post your name as author.  Again, apologies in advance for that :-(.

h :: [Bool] -> ([Bool], [Bool])
h = l

go :: Bool -> Bool -> ([Bool], [Bool]) -> ([Bool], [Bool])
go x y (xs,ys) = (x:xs, y:ys)

l, r :: [Bool] -> ([Bool], [Bool])
l (False:False:ns) = go False False $ right (r ns)
l (False:True:ns) = go False True $ l ns
l (True:False:ns) = go True True $ l ns
l (True:True:ns) = go True False $ left (r ns)
l _ = ([], [])

r (False:False:ns) = go False True $ left (l ns)
r (False:True:ns) = go False False $ r ns
r (True:False:ns) = go True False $ r ns
r (True:True:ns) = go True True $ right (l ns)
r _ = ([], [])

left, right :: ([Bool], [Bool]) -> ([Bool], [Bool])
left (False:xs, False:ys) = go False True $ left (xs,ys)
left (False:xs, True:ys) = go True True $ left (xs,ys)
left (True:xs, False:ys) = go False False $ left (xs,ys)
left (True:xs, True:ys) = go True False $ left (xs,ys)
left _ = ([], [])

right (False:xs, True:ys) = go False False $ right (xs,ys)
right (True:xs, True:ys) = go False True $ right (xs,ys)
right (False:xs, False:ys) = go True False $ right (xs,ys)
right (True:xs, False:ys) = go True True $ right (xs,ys)
right _ = ([], [])

-- Infrastructure for testing:
bits :: Int -> Int -> [Bool]
bits n k = go n k [] where
  go 0 k = id
  go n k = go (n-1) (k `div` 2) . (odd k:)

num :: [Bool] -> Double
num (False:xs) = num xs / 2
num (True:xs) = (num xs + 1) / 2
num [] = 0

hilbert :: Int -> Int -> (Double, Double)
hilbert n k = (\(x,y) -> (num x, num y)) (h (bits n k))


Here begins my own code.  To use the Data.Colour.blend function, I need to normalize all the values.  Here is that function, which could be made considerably more efficient with a minimax instead of independently calling minimum and maximum, but again, the point here is illustration of a technique, not the most beautiful code.

normalize :: [Double] -> [Double]
normalize values = [(val-minval) / (maxval-minval) | val <- values]
    where minval = minimum values
          maxval = maximum values

Following that, we have a function and its helper for creating a hilbert plot of the data.  Note the use of the constant 64.  The Hilbert code above keeps everything within a unit vector of the origin, so we scale out for the resolution.  The resolution should properly be ceiling . log2 of the number of items in the list, which could be calculated efficiently, but it would clutter the code.

vis :: Int -> [Double] -> BaseVisual
vis n values = concat $ zipWith vis'
                [(64*x,64*y) | (x,y) <- (map (hilbert n) [0..2^n-1])]
                (normalize values)

Finally here is the visualization of a single point whose x,y vector is now the Hilbert point for its position in the timeseries.  We blend between two colors, blue for 0 and orange for 1. This could just as easily be a more complicated colourmap, but this is the code that generated the colormap from the previous post.

vis' :: (Double,Double) -> Double -> BaseVisual
vis' (x,y) val = fillcolour (blend val c0 c1)
               . filled True
               . outlined False
               $ arc{ center = Point x y, radius=0.5 }
    where c0 = opaque blue
          c1 = opaque orange

And finally the main program.  All we do here is take the filename from the arguments, read in the lines as Double values, and map those values to colored hilbert points using our vis function.

main = do
  name <- (!!0) <$> getArgs
  k <- map read . tail . words <$> readFile name
  let visualization = vis degree k
      degree = (ceiling $ log (realToFrac . length $ k) / log 2)
  renderToSVG (name ++ ".svg") 128 128 visualization

Plotting timeseries in space filling curves

Monday, August 24th, 2009

Fitting many timeseries in the same area and formatting them for quick comparison is challenging and an important problem.  For example, say that you want to pick stocks in the S&P 500 that perform similarly or which respond similarly to certain numerical transformations (say they have the same trend lines or the same noise frequency).  Line graphs are inappropriate for this because they’re too visually noisy and because you can only compare easily up to down, not left to right, visually, limiting the number of graphs on the page at the same time.

For certain comparisons, we aren’t interested in the individual values of a timeseries nor its trend along time so much as we are how it relates to other timeseries in the same dataset.  For this purpose over the weekend, I created these:


These are timeseries simply plotted along a Hilbert curve.  The 8 you see here are the first 8 stocks in alphabetical order of the S&P 500, plotted with one segment per day for all trading days in the last decade.  Blues are lower values.  Oranges are higher values.  They have each been normalized internally for their minimum and maximum prices, so the shape of the curve is more important than the maxima or minima.

In this figure, we see two glyphs that are similar: the first and the next to last.  These correspond to Agilent and Adobe, respectively.  Let’s look at their 5 year charts to see how similar they in fact are:


Adobe (ADBE). Source: Google Finance

Source: Google Finance

Agilent (A). Source: Google Finance

Now, despite some surface differences, you’ll note that the shapes of the graphs are indeed similar, and that when Agilent goes up, Adobe tends to go up, and when Agilent goes down, Adobe tends to as well.  Now, these two companies are unrelated in terms of market sector, so can we really say that these similarities matter?  If it were a short trend, maybe not, but this is a 5 year graph that we’re comparing here, and they’re remarkably similar to each other.  It’s worth delving deeper into.  They could be majority held by the same people, they could track for reasons that aren’t immediately obvious, they could have the many of the same people on their board of directors, or it could in the end be complete coincidence (but paradoxically predictive coincidence).

Here, by the way, for comparison is Apple, the third glyph from the Hilbert graph, shown as an area chart on the same timescale as A and ADBE:


Apple (AAPL). Source: Google Finance

One might expect, considering that the majority of Adobe’s projects run on Apple computers and a majority of Apple computers run Adobe products, that when Apple does well, Adobe does well, and vice versa.  We can see that this is not the case.  Much as the two Hilbert graphs are unrelated, AAPL and ADBE’s performance are unrelated.

Noe also that I managed to pack 8 timeseries in less space total using the Hilbert glyphs than the Google finance graphs.

And yes, by the way, this was written in Haskell using Hieroglyph.  I don’t have the code on me right now, but I’ll post it later on today when I’m in front of my personal computer.

A quick buster example

Thursday, July 9th, 2009

Some people have asked of late for examples of Buster.  What follows is the (annotated) code that I used to generate the FDL video of the particles swarming around the lambda in the last post:

module Main where

import qualified Data.Array.IO as A
import Graphics.Rendering.Hieroglyph
import Graphics.Rendering.Hieroglyph.OpenGL
import App.EventBus
import IsoFDL
import Random
import Data.Colour
import Data.Colour.Names
import System.Random
import Control.Concurrent

dim = 480
minSeedCoord = fromIntegral $ quot dim 2 - 80
maxSeedCoord = fromIntegral $ quot dim 2 + 80
minIdealDist = (-100)
maxIdealDist = 100
npts = 100
basestrength = 4 

randomRIOs :: Double -> Double -> IO [Double]
randomRIOs a b = do
    g <- getStdGen
    return $ randomRs (a, b) g

This is the only behaviour we define, adjustLayout, and it runs a single layout iteration over the data.  We take the old locales of points, compute an FDL iteration, take the new locales of points, and produce the event, which is “Visible” (this is important, as Hieroglyph looks for all items in the Visible group, assumes they are o the class Visual, and attempts to render them.  We also produce a re-rendering request, and return the list of rerender and our layout as a future to be used at the appropriate time by the buster framework.  One final thing to note about our event is that it should live for 10 iterations.  Often geometry is persistent, but this was used to give the “contrail” effect to the particles as they move, effectively layering slightly transparent geometry over the top of older slightly transparent geometry to create the effect.

The Hieroglyph code is also fairly straightforward.  We apply linewidth and line color combinators (using the colours in Russell O’Connor’s excellent Data.Colour library) to a list of paths which are slightly exaggerated (that’s the +/-(x1-x0) and +/-(y1-y0) operations on the Points in the path) vectors along the path that was most recently traveled.

adjustLayout :: [Attractor] -> (Arr (Int,Int) Double) -> (Arr Int Double) -> (Arr Int Double) -> Behaviour [EData HieroglyphGLRuntime]
adjustLayout poles dists xs ys b = do
    pts <- externalizeLayout xs ys
    layoutIteration basestrength poles xs ys dists
    pts' <- externalizeLayout xs ys
    event' <- produce "Visible" "" (show . head $ pts') (Iterations 10)
                . (:[])
                . EOther
                . Geometry
                . linewidth 1
                . strokecolour (dissolve 0.3 (opaque darkblue))
                . name "points"
                $ zipWith (\(x0,y0) (x1,y1) -> path{ begin=Point (x0-(x1-x0)) (y0-(y1-y0)), segments=[Line $ Point (x1+(x1-x0)) (y1+(y1-y0))] }) pts pts'
    rerender <- produce "Hieroglyph" "" "Rerender" once []
    future b . return $ [event',rerender]

The next function, behaviour would normally be pure, but because we have some data that is internal to the adjustLayout behaviour, it’s got a do.  One of the things that I will change in the near future about Hieroglyph and OpenGL is to make the events it can respond to more polymorphic.  Currently I require that to use Hieroglyph in OpenGL that the event data type is [EData HieroglyphGLRuntime].  That’s pretty inane if I do say so myself, and in the future, there will be a constraint more along the lines of (HieroglyphInteractiveEventData a) => a rather than a static type to be determined at compile time.  However, with the current state of things, this is our behaviour, and it’s still pretty straightforward.  renderBehaviour depends on the result of adjustLayout.  That’s it.  The <~< establishes the dependency relationship and buster takes care of the rest.  The “poles” are a series of attractors (defined in IsoFDL.hs), which are static fields attached to a fixed point in space, attracting or repelling depending on the constant given them.

behaviour = do
    randomDistances <- randomRIOs minIdealDist maxIdealDist >>= A.newListArray ((0,0),(npts-1,npts-1)) . take (npts^2) . drop (2*npts)
    randomXs <- randomRIOs minSeedCoord maxSeedCoord >>= A.newListArray (0,npts-1) . take npts
    randomYs <- randomRIOs minSeedCoord maxSeedCoord >>= A.newListArray (0,npts-1) . take npts . drop npts

    poles <- mapM (\(x,y) -> compileAttractor . Attractor x y . take npts $ repeat 150)
        [(100,(440-100)), (170,(440-160)), (220,(440-220)), (160,(440-280)),
         (100,(440-320)), (280,(440-280)), (350,(440-320)), (380,(440-300))]

    return $ renderBehaviour <~< adjustLayout poles randomDistances randomXs randomYs

main = behaviour >>= boilerplateOpenGLMain  [initializeBus "Testing Force Directed layouts" dim dim]

Finally, our main function is incredibly simple.  We simply bind the behaviour to the boilerplate buster OpenGL main.  The other argument to boilerplateOpenGLMain is a list of “widgets” to be bound to the bus at the beginning of the application.

An ugly force-directed layout implementation

Thursday, July 9th, 2009

Updated 7.9.2009: I’ve added another video showing the effect of attractors on the layout

Unalbe to show flash video

I’ve been working on visualizations silently of late.  Today, though, I threw together a quick FDL (force-directed layout) algorithm based on POV-Ray’s static-field isosurface or “blob” equation, which is (sorry, but I’m using plaintext)

density = strength * (1 - (distance / radius)^2)^2

The documented equation has a problem, in that if distance > radius, the density approaches infinity fast, so I change it to be:

density = strength * (1 - (min(distance, radius) / radius)^2)^2

That solves the infinity problem, and now all there is to the algorithm is to apply it:

adjust :: Double -> Double -> Double -> Double -> Double -> Double -> (Double,Double)
adjust sx sy ss sr dx dy = if isNaN vx || isNaN vy then (dx,dy) else (dx+force * vx, dy+force * vy)
    where force = density ss sr dist
          dist = sqrt ((dx-sx)^^2 + (dy-sy)^^2)
          vx = (dx-sx) / dist
          vy = (dy-sy) / dist

Note that it is possible for dist to be NaN for a single point-to-point interaction, so we correct for that.  Basically, this function is a single adjustment from point set to a point set that fits our force constraint a little better.  In the following code, we take a static vector, <sx,sy> and a movable vector <dx,dy> and take the distance.  We apply the density function to the distance using ss as our strength and sr as the radius or distance at which the density function falls off to zero.  We return the translated point P’ which is translated along the normal vector <vx,vy> by the variable “force”. Simple and straightforward.

Now I know… by my title, you’re thinking, “What’s ugly about that?” Well, when I code fast, I use shortcuts.  The following code, which uses this adjustment function as a kernel for the actual force-directed layout algorithm is imperative.  I’ve tried FDL and MDS (multi-dimensional scaling) algorithms before using lists and tries and always ran into enough overhead that it significantly diminished the number of points that are viable to run a single iteration in real-time.  I’m sure there’s a way to do it, but I solicit the community’s help in suggesting a faster way than this.  Yes, I could have used STUArray and avoided IO, but that doesn’t really eliminate the imperative nature of things.

layoutIteration :: Double -> [Attractor] -> (Arr Int Double) -> (Arr Int Double) -> (Arr (Int,Int) Double) -> IO ()
layoutIteration alpha attractors xsArray ysArray radii = do
    bounds <- A.getBounds radii
    forM_ (A.range bounds) $ \(r,c) -> when (r /= c) $ do
        radius <- radii -| (r,c)
        x0 <- xsArray -| c
        y0 <- ysArray -| c
        x1 <- xsArray -| r
        y1 <- ysArray -| r
        let (x',y') = adjust x0 y0 alpha radius x1 y1

        xsArray =| r $ x'
        ysArray =| r $ y'

    forM_ attractors $ \(CAttractor ax ay arArray) -> do
        bounds <- A.getBounds arArray
        forM_ (A.range bounds) $ \ix -> do
            x0 <- xsArray -| ix
            y0 <- ysArray -| ix
            ar <- arArray -| ix
            let (x',y') = adjust ax ay (-alpha) ar x0 y0
            xsArray =| ix $ x'
            ysArray =| ix $ y'

As you can see, this is pretty ugly, even with the aliased (-|) readArray and (=|) writeArray binary operators cleaning things up a bit.  It’s straightforward, and random access to the points is O(1).  There are algorithms for which this is more important than this one, and I can see a list comprehension version coalescing even as I write this, but this kind of code is O(N^2). The constants are sometimes (depending on the adjustment function) quite high, so what I really need is the fastest version possible, probably taking a lot of advantage of fusion in the compiler.  The other problem, which is more subtle,  is that points are updated multiple times in one pass and we always want to use the latest point.  Any functional solution would have to take this into account (and storing all the update deltas and taking their centroid won’t work, because each update wouldn’t then be based on the point’s most current position).

This one seems to handle up to about 250 points quite well, which is decent for an FDL algorithm (keep in mind, that’s 10000 distance calculations per iteration, all of which have to happen before sending any points to the video card and other such matters — doing this 30-50 times a second along with other functionality is harder than it sounds).  Handling more than that would require one of two modifications to the algorithm: either select a random N moves to make per iteration or subdivide the point set into sqrt(N) size blocks, perform the FDL on each of those, and then on their centroid, translating them all by the delta.

So essentially, there were two purposes to this exercise.  One was coding an FDL algorithm in Haskell, but the other was trying out an FDL I hadn’t seen tried before.  This one in particular does not result in the usual circular pattern arising and also handles negative forces as easily as positive forces with a simple, low-overhead function. Oh, and of course the other was showing that I could get realtime performance of such an algorithm (that previously people insisted had to be done in C to be done properly) with a relatively naive implementation.

Force directed layout algorithms are useful for tasks like graph layouts and other dense datasets as well as for performing a sort of “visual” clustering.   Here’s an example showing the starting state, with all points clustered in a square in the middle of the screen:


And here is one after the layout has run for a few hundred iterations:

screenshot-testing-force-directed-layoutsThe evenness reflected here is because the ideal distances I selected are random.  If there is a system to the distances, there will be a system to the layout.  The neat thing about this algorithm is that given any set of input points and distances, the result of the algorithm at any point is deterministic. Follows is the code that I used to generate the above pictures and a short movie of the forces in action.

This is the code for the algorithm and a test program in Hieroglyph/Buster

A video of the algorithm in action

Unalbe to show flash video

Heiroglyph: Interactively graph a spreadsheet in 99 lines of code.

Thursday, February 5th, 2009


99 lines of code, even with imports. The app is simple, but it builds upon the sparkline example of the other day. In this app, we import an entire spreadsheet of tabular data and create sparklines for each column of data where the data is floating point. The user can sort the data by a column by clicking on the left side of a sparkline. This doesn’t do a lot of error checking, and only works on spreadsheets where the columnar data type is homogenous, but that’s a large class of datasets, and it simplifies our code considerably. First, our boring imports:

> module Main where
> import Graphics.Rendering.Hieroglyph
> import Graphics.Rendering.Hieroglyph.Scaffolds.Interactive
> import qualified Data.Map as Map
> import Data.Map (Map)
> import qualified Data.Set as Set
> import Data.Set (Set)
> import qualified Data.ByteString.Lazy.Char8 as C
> import Data.List (foldl',transpose,mapAccumL,sort)
> import System.Environment (getArgs)
> import Data.Maybe (fromJust)
> import qualified Data.IntMap as IntMap

Now we setup a bit of color for our visualization: attribute sets for describing thin black lines (the sparklines), thicker black lines (for the text), and the background color.

> blackstroke = plain{ strokeRGBA=black, linewidth=0.5 }
> blacktext = plain{ strokeRGBA=black }
> whitefill = plain{ fillRGBA=white, filled=True, outlined=False }

Then we setup our datatype that we’re visualizing. Note that we’re using the interactive scaffolding, so this will be an interactive application.

> data MyData = MyData {
>     interactiveScaffolding :: Scaffolding
>   , spreadsheet :: Map String [Double]
>   , ncolumns :: Int
>   }

Next we setup the data so that it’s responsive to the user input.

> instance Interactive MyData where
>     getInteractiveScaffolding = interactiveScaffolding
>     setInteractiveScaffolding a b = a{ interactiveScaffolding = b }
>     interactionFold dat geom

Yes, the 80 is arbitrary. It seems to work well, and to some extent, this is a draft application and isn’t meant to be perfect. Here, if the user clicks on the left side of the application, we check to see which sparkline the user is over and sort that column in ascending order.

>         | xcoord < 80 && getMouseLeftButtonDown dat = dat { spreadsheet = sortSpreadsheet dat }

Then we ignore all other inputs.

>         | otherwise = dat
>                 where Point xcoord ycoord = getMousePosition dat

Next is our familiar sparkline function from my blog post the other day. As a review, we remap all the values in the line to be inside the height of a sparkline and we take those and create a path out of them.

> sparkline width height (Point startx starty) values = path{ begin=point0 , segments=map Line points , attribs=blackstroke }
>    where (point0:points) = zipWith Point xvals yvals
>          xvals = iterate (+(width/n)) startx
>          yvals = map (remap mn mx starty (starty+height)) values
>          (mx,mn,_,_,_,n) = stats values

To visualize a whole spreadsheet, we start at the origin and work our way down using Map.mapAccum to create a visualization that is the same structure as our spreadsheet. We could simply pull the columns out of the spreadsheet and return a list of primitives instead of a map from name to primitives, but this way, we could filter the output from this function based on the structure of the original data. To me, this is closer to the intent of visualization; it’s less about the drawing aspect and more about the data, and manipulations on visualizations are based on the data, not the geometry or the screen.

> visSpreadsheet width height (Point startx starty) mp = snd . Map.mapAccum stackSparklines starty $ mp
>     where stackSparklines starty' values = (starty'+height, sparkline width height (Point startx starty') values)

This function places the names of the columns to the left of the sparklines.

> visSpreadsheetNames height starty names = snd . mapAccumL stackNames starty $ names
>     where stackNames starty' name = (starty'+height, text{ str=name, bottomleft = Point 5 starty', attribs=blacktext })

This merely defines our background rectangle

> background w h = rectangle{ width = w , height = h , attribs=whitefill }

And finally, vis calls our vis functions and also sets up the occlusion order, putting the names and sparklines on top of the background. The height of each sparkline is the number of sparklines divided by the height of the window.

> vis dat = visSpreadsheetNames (sy/ncf) 10 names
>           `over` visSpreadsheet (sx-offset) (sy/ncf) (Point offset 0) sheet
>           `over` background sx sy
>     where sx = getSizeX dat
>           sy = getSizeY dat
>           nc = ncolumns dat
>           ncf = fromIntegral nc
>           sheet = spreadsheet dat
>           Point xcoord ycoord = getMousePosition dat
>           offset = 6 * (fromIntegral . maximum . map length $ names)
>           names = Map.keys sheet

This is a simple function to read a spreadsheet and filter out columns that aren’t numeric. Note that it only checks the first item of each column, so it really doesn’t do sufficient error checking, however it’s faster than loading the entire spreadsheet to check to make sure each is numeric. Yes, we could have used Parsec here instead of defining our own isNumeric, but I wanted to keep down the number of libraries I was importing for tutorial’s sake.

> readSpreadsheet name = do
>      sheet <- (transpose . map (C.split '\t') . C.lines) `fmap` C.readFile name
>      return $ foldl' go Map.empty sheet
>   where go m (x:xs) | isNumeric xs = Map.insert (C.unpack x) (map (read . C.unpack) xs) m
>                     | otherwise = m
>         isNumeric = C.all isNumeral . head
>         isNumeral = (flip Set.member) nums
>         nums = Set.fromList "0123456789."

Remap a range of values to a different range. This is used in the sparkline function, and since our origin is at the top left and values go down, the seeming reversal of the first two arguments of the function is correct for our purposes.

> remap mx mn mn' mx' a = (mx'-mn') * (a-mn) / (mx-mn) + mn'

Once again, you should recognize this stats function from the blog post from the other day.

> stats (x:xs) = finish . foldl' stats' (x,x,x,x*x,1) $ xs
>     where stats' (mx,mn,s,ss,n) x = ( max x mx , min x mn , s + x , ss + x*x , n+1 )
> finish (mx,mn,s,ss,n) = (mx,mn,av,va,stdev,n)
>    where av = s/n
>          va = ss/(n-1) - n*av*av/(n-1)
>          stdev = sqrt va

This sorts the spreadsheet based on which sparkline the mouse pointer is on using the mapsort algorithm below.

> sortSpreadsheet dat = mapsort sheet (names !! item)
>     where sy = getSizeY dat
>           nc = ncolumns dat
>           ncf = fromIntegral nc
>           item = round $ (ycoord / sy) * ncf
>           Point _ ycoord = getMousePosition dat
>           names = Map.keys sheet
>           sheet = spreadsheet dat

This function is a little more elegant than it looks. It’s not terribly useful to do things this way here, but this is a lazy version of sorting the entire map. The more straightforward way to sort the map uses transpose to turn columns into rows, but also makes the whole thing head strict on the first element retrieved from the map. This is only head strict per column.

> mapsort mp key = reorder mp
>     where sort' vs = sort . zip vs $ indices
>           ordering = map snd . sort' $ (mp Map.! key)
>           nitems = (length ordering :: Int)
>           indices = iterate (+1) (0::Int)
>           reorder vs = (let arr = IntMap.fromList . zip indices $ vs in map (arr IntMap.!) ordering)

Now finally our very simple main. Take the argument from the command line that is our spreadsheet and create the visualization from it. Note that this GUI is not motion sensitive, as it really doesn’t need to be.

> main = do
>     [fname] <- getArgs
>     sheet <- readSpreadsheet fname
>     let dat = MyData scaffold sheet (Map.size sheet)
>     simpleGui dat vis ("Sparklines for " ++ fname)

Hieroglyph Haddock docs

Wednesday, February 4th, 2009

Since Hackage is building with GHC 6.10 now and we’re still waiting on a final release of Gtk for 6.10, it’s not generating the Haddock docs for Hieroglyph.  I’m posting them here for the time being. That’s all!

Introducing Hieroglyph. Purely functional 2D visualization on top of Cairo.

Tuesday, February 3rd, 2009

As a followup to my tutorial last year on Beautiful Code, Compelling Evidence, the tutorial on doing infovis and visual analytics in Haskell, I set about trying to do a pure-functional wrapper to Cairo.  After a month or so of banging my and Conal’s head against it, I am ready to (with many MANY thanks to Conal Elliott for his help in getting me to think about this in a semantic way) release a first pass at purely functional 2D drawing, called Hieroglyph.

My original goal was to come up with something that simply produced no IO when creating a scene and was fast enough to create interactive visualizations.  What I was eventually directed toward was semantic design and trying to figure out what a 2D visualization means. Hieroglyph is a first cut at that — a compromise between strict semantic design and staying close enough to Cairo to not make things easy in Hieroglyph that create performance problems in Cairo.  Future versions of Hieroglyph or forks will be less based on Cairo and more based on semantic design, but for now, I’m satisfied with what I have.

So the first question is: What is a visualization?  My answer is that it’s a function from data to a visual.  A visual is an unstructured collection (bag) of primitives.  The data can be any data that is interesting to the user doing the visualizing.

To encapsulate the Visual, we have:

type BaseVisual = [Primitive] 

class Visual a where
  primitives :: a -> BaseVisual

In Graphics.Rendering.Hieroglyph.Visual, I define instances on each of the basic collection types in the Haskell standard library, Visual v =>  Data.Set.Set v, Data.IntMap.IntMap v, Data.Map.Map a v,  [v], and v itself.  This allows the programmer to be unconcerned with how the collection of visual primitives is structured, allowing in many cases a straightforward map from data to primitives with the same structure.  In addition, there are two relationships between Visuals that are important to describe.  One is occlusion.  To describe the occlusion characteristics of two visuals, I have defined the combinator “a `over` b” to describe a composed visual in which a occludes b.  The other is specificity.  There is a concept in visualization of level-of-detail, and this is encapsulated by the combinator “a `moreSpecific` b”, which declares that a is more specific than b.  Both over and moreSpecific describe partial orderings over primitives, which can be used for filtering, or used by the renderer to order and constrain drawing.

Now for the primitives.  A primitive is nothing more than a declaration of attributed geometry.  Our primitives in Hieroglyph are Arc, Path, Rectangle, Text, Union, Image, and Hidden.  I chose to implement primitives as “labeled” functions using the Haskell record modification syntax.  While you can declare a primitive using the type constructor, it is easier and more expedient to use the lower case version of each of the primitive constructors and modify them using the record modification syntax.  The Haddock documentation describes what each record field is for every primitive.

Over the next few days, I’ll be giving examples of Hieroglyph and talking a little bit more about my design process as I do.  In the meantime, enjoy Hieroglyph and let me know about any bugs!

Ethan Heln’s galleries of networks and data visualizations

Wednesday, January 21st, 2009

Not his own stuff, but a sort of photo-blog on Flickr of things he’s found interesting.  They’re attributed and as an eclectic collection, it’s pretty interesting.  He’s got a good eye.

Review: Data Flow - Visualizing Information in Graphic Design

Tuesday, January 20th, 2009

My director handed “Data Flow: Visualizing Information in Graphic Design“  to me shortly before Christmas.  Being part of the visualization group, I’m always looking for new ideas in shiny, gelatin-covered books and magazines, hoping for inspiration in the form of informative, concise, and original visualization of interesting data.  I’m new to the field, relatively speaking, and so I’m still in a very active process of building my visual vocabulary.  I’ve got a strong eye for what I like, and what I like to imitate, though.  Tufte.  I love Tufte.

This book is about the same size, physically, as a Tufte book, and some, at least of the visualizations follow my favorite of his principles: trust the viewer with the data (present all the data you reasonably can), do not present information that is not data, and elegance and minimalism trump flashy.  Quite a number of the visualizations don’t follow these principles at all, but a number do, and for examples of those, I like this book.

What I don’t like about this book is that very little explanatory text goes with each visualization.  Now, if you’re unfamiliar with visualization, you might think that vis is supposed to be self-documenting.  That’s true up to a point in that things should be visually intuitive; you don’t use yellow to represent cold on the same graph where you use blue to represent hot, as a simple example.  But good visualizations of large datasets present so much data to the user that an extended legend is generally warranted.  This extended legend ought to include:

  • An explanation of what the dataset is.
  • An audience appropriate explanation of the domain.
  • How the dataset maps to the visualization.
  • Patterns that the researcher who commissioned the visualization has found or finds interesting.

Especially for this book, which includes hundreds of different visualizations from different domains, such legends would be helpful. I’m left as a reader without context, and while I can look at any page and think to myself, “That’s sure shiny,” I can’t look at visual elements and judge for myself whether a particular visual element was the “right” one for the data shown.  I’m not trying to review others’ work so much as garner for myself a visual vocabulary.  Here, I feel like I’m getting the words, but not the definitions.  A sort of Scrabble dictionary for data.

The other issue I have with the book is that it includes along with some very good visualizations, a number of infographics which seem to be designed to persuade rather than inform.  These visualizations feel more informative than they are, and lead the user to come to the conclusions that the graphic designer or her commissioner has already come to, rather than give them the chance to see things for themselves.

Yes, I think there is value in being visually striking, but for instance the graphic on page 155, “An Underground Disco Information Organization” gives me very little data with an excessive amount of context, which itself is presented like evidence.  Graphicdotal evidence, like anecdotal evidence, tells a convincing story, but it often occludes raw data, methods, and significant results.

So, my favorites from this book:

  1. Pie, pp. 38 - Graphing the mood of movies by analyzing frames for inherent color and plotting it on circles.  Ingenious.
  2. Data Visualization of a social network, pp. 60-64 - Different views of a social network show different connections.  I especially like the combination of the radial layout with the US map.
  3. War Report, pp 246 - Here, despite the sensitivity of the subject, the designer has chosen to leave his own presentation simple enough to let the data tell its own story.  The story that good raw data shown clearly tells is always more powerful than half-cocked data shown with graphicdotal markers.
  4. Sunlight Calendar, pp 202 - A complete calendar of the average sky hue at different times of day on each day of the year.
  5. Atlas of the North Sea, pp 105 -  Complex matrix visualization of the North Sea’s geography and significant environmental and geological factors.
  6. Silvertown Affect Map, pp 88 - Deceptively simple, and an elegant use of the data available to them.
  7. Cyberwar, pp 94-45 - Graphing a botnet attack
  8. Literary Organism, pp. 78 - A graph drawing of Jack Kerouac’s On the Road with a brilliant legend and visually appealing concept.
  9. The Shape of Globalization, pp 35 - Embedded circles to represent nested companies.
  10. Total Interaction, pp 48 - Combining pie charts with Florence Nightingale charts and anchored bubbles for a visually complex but informative graph.

So my review of this book is mixed.  Go to it for visually inspiring design and appreciation of how much data can be put onto a page at once without turning into noise.  Be wary of graphicdotal elements and techniques that are more persuasive than they are informative.  And write the authors and tell them to tell us more about the visualizations they collected!

The Docuverse.

Monday, January 19th, 2009

Diagram of the Docuverse

Diagram of the Docuverse

Updated link to source code and a win32 executable

1.5 million documents on the screen at once.  That, despite what it looks like, is what you’re looking at.   Click through for a larger picture, although I have to say that it’s at its awesomest on the Renci VisWall, our 9′x16′, 16 projector display wall.  Yes, this was also done in Haskell, and fortunately I still have the code for the prototype.  The idea here is that I use standard formulae from the text retrieval world’s vector space model to position all the documents in a collection in relation to a series of queries.

Each document is represented by a single star on the image, and each query makes up the core of one of the galaxies that form.  Text Retrieval often uses a linear combination of several factors in its model of “relevance” and these, instead of combined into a single score, are instead plotted along the axes of a cylindrical coordinate system.  In this docuverse diagram, the radius is what’s known as TFxIDF, or the dot product of the query vector and the document vector.  The angle around the core is dictated by the cosine of the query vector and document vector.  The height off the zero plane is the standard score of the length of the document (Z score).

In this diagram, the queries are positioned randomly, but in the future I’d like to have them positioned via multidimensional scaling with respect common document terms, so a user can immediately see how similar they are.  The docuverse can also be used to represent document clusters, where the centroid is the core of the galaxy instead of a query.

The wonderful thing about this is that it’s interactive.  You can zoom around in the universe and get up close to galaxies, although as of yet I haven’t implemented actually navigating to documents based on clicking on stars, or handling level of detail so that individual stars are named the same as their documents close up. This is achieved in Haskell using HOpenGL, Data.Array.UArray, and OpenGL Vertex Arrays.

As a useful visualization, it gives you two things.  One thing it gives you, given a set of queries, is a characterization of how relevant the collection is to your set of queries as a whole versus another set of queries (which may well be topics instead of standard web-queries).  The other thing it gives you is a sense of how similar documents are textually to each other.  If you have a small subset of documents that you know are relevant to you, you can (conceivably, although this is not implemented in the prototype) highlight them on the docuverse and then select subsets of documents to be output for analysis later.  In high-recall applications, this kind of collection browsing, I think, will be essential in larger document collections.

I’ve been able to spin this up with a 6 million document collection and have it retain some level of interactivity.  The collection you see pictured here is NIST’s wt10g collection, a 10 gigabyte web-crawl done several years ago for benchmarking search engines in the Text Retrieval Conference (TREC).  One day, I’ll have time to work on it again, and i’ll release more than just a prototype and be able to open source the code without shame.