Hieroglyph HOWTO, part I: Sparklines

Written by Jeff Heard on February 3rd, 2009
sline1
> module Main where
>
> import Data.List
> import Graphics.Rendering.Hieroglyph
> import System.Environment (getArgs)
> import qualified Data.ByteString.Lazy.Char8 as BStr
>

This is a simple demonstration of Hieroglyph for a non-interactive application.  The drawing is output directly to a PNG file of the specified width and height.

Formally, a visualization is a function from data to a Visual, where a Visual is an unstructured collection of Primitives.  These Primitives are documented in the module Graphics.Rendering.Hieroglyph.Primitives.  A purely functional framework such as this affords the programmer several advantages. Number one is that the drawing closely resembles the data — it is a straightforward
transformation, not unlike a stylesheet, and entirely without side effects or dependencies outside the drawing itself. Another advantage is that the drawing itself can be transformed and read by code. For example, the Visual can be indexed in an R-Tree or P-Tree and collision detection (mouse detection) becomes straightforward to implement. Another advantage is by knowing that
the drawing algorithm is a straightforward breadth first search without frills, it is easy to reason about why your drawing didn’t come out the way you expected it to. A final advantage is that you can be lazy about what you draw. Being declarative instead of imperative about drawing means that you don’t have to make breaks out of your code or write separate code to change level of detail, or to do off-screen culling. These become just a series of data transformations or a matter of how much of your Visual you pass to the renderer.

Alright. If I haven’t convinced you by now you ought to be doing your 2D
drawing in a purely functional way, I’m not going to.

To create a visulization in Hieroglyph, we write a function or functions from our data to a Visual. Then we use a transfer operation to take the Visual and put it to the page in main.

Our first simple visualization will be a sparkline. Sparklines are inline unscaled line graphs meant to replace or accent a table in a block of text. Let’s see how we do that. First of all, here’s the type signature of our sparkline, so you can see that while we’re defining a drawing, since there’s no IO, we aren’t actually doing the drawing yet.

> sparkline :: Point -> Double -> Double -> [Double] -> Primitive

Note that Point is a 2D point defined as two Doubles x and y by Hieroglyph. Primitive is the instance of Visual we’re using, since a sparkline can be represented by a single path. Now, on to the code.

> sparkline (Point startx starty) width height values = path{ begin=point0, segments=map Line points }

That one line defines the drawing itself, a line beginning at the leftmost x whose segments are compressed horizontally into the space we define by “width” and whose y values are compressed to height and scaled according to the values we passed in. A sparkline. It took me four lines of text, but it takes Hieroglyph and Haskell one line of code, with a bit of addenda.

>    where (point0:points) = zipWith Point xvals yvals
>          xvals = iterate (+(width/n)) startx
>          yvals = map (remap mx mn starty (starty+height)) values
>          (mx,mn,_,_,_,n) = stats values

Here we define our points in terms of x values and y values. The x values are in turn defined by a list of doubles starting with the leftmost point, startx and continuing on by increments of our defined width divided by the number of data points we’re trying to compress into that space. The y values are defined by re-mapping the range of values from its natural range, defined by mx and mn, to the geometry of our drawing, defined by height and starty. We define the stats function shortly and the remap function only a little further down. Note briefly, though, the underscores. Those mean that stats returns more values that we’re actually going to use, and we’re ignoring them. Now, due to the magic of laziness in Haskell, those values and their dependents are never actually calculated, saving us potentially a fair bit of performance.

I’m going to go ahead and define main for you before I define anything else, because then you can see how we go from a Visual, which is an abstract and purely functional entity to drawing on paper, which is not. Here goes.

> main = do

This line grabs the program arguments in the order we expect them. It’s not generally condoned to do things this way, but it prevents us from littering our example with bulky parameter checking code and from littering our import list with other modules for you to find and compile.

>   [dataset,widthAsc,heightAsc,outputfile] <- getArgs

There’s a lot going on in the next line. We map the function to the left of `fmap` over the values on the right, which are of type “IO BStr.ByteString” (appropriately, since they come from disc). Reading from right to left, the left side splits the string into lines, unpacks each line from our lazy representation to indivdual Haskell Strings, and then reads those Strings into Doubles.

>   values <- (map (read . BStr.unpack) . BStr.lines) `fmap` BStr.readFile dataset

Now, we briefly read our PNG width and height from the arguments, and then progress to the visualization.

>   let width = read widthAsc
>       height = read heightAsc
>       visualization = (sparkline origin width height values){ attribs = whiteStroke }
>       whiteStroke = plain{ strokeRGBA=white }

Here. These two short lines create our sparkline and set it to be black when we show it.

The next line is the main action of the program and transfers the drawing we’ve done to the physical drawing on disc.

>   renderToPNG outputfile (round width) (round height) visualization

And that’s it for main. Now to define the stats function as a function from a list of doubles to a tuple of crunched math:

> stats :: [Double] -> (Double,Double,Double,Double,Double,Double)

First define the finished product. The function foldl’ takes an initial value and folds the list into it, like one might fold egg-whites into a cake batter. The (x,x,x,x*x,1) is the initial value, defined by the first value in the list of data.

> stats (x:xs) = finish . foldl' stats' (x,x,x,x*x,1) $ xs

Now we define stats’, our folding function, which calculates the global extrema, list length, sum, and sum of squares all at once. Why, when Haskell defines for us so conveniently functions for maximum, minimum, sum, and length, would we want to do this? Simple. This is done in one list traversal instead of several. One for loop as opposed to five.

> stats' (mx,mn,s,ss,n) x = ( max x mx
>                           , min x mn
>                           , s + x
>                           , ss + x*x
>                           , n+1 )

Now we get to the finish. We’ve calculated the sum and sum of squares, but not the mean or standard deviation. We replace those now, and the task of computing summary stats is finished.

> finish (mx,mn,s,ss,n) = (mx,mn,av,va,stdev,n)
>    where av = s/n
>          va = ss/(n-1) - n*av*av/(n-1)
>          stdev = sqrt va

One last function to define before we’re done. Remap takes a value in the range [a,b] and remaps it to the range [c,d]

> remap :: Double -> Double -> Double -> Double -> Double -> Double
> remap mx mn mx' mn' x = (x-mn) / (mx-mn) * (mx'-mn') + mn'

That simple, and in a very few lines of code, and with no modules imported other than Hieroglyph, we’ve created a sparkline!

Hieroglyph contains primitives for lines, splines, arcs, quads, text, and loading and placing images. These can be used as we’ve seen in this module, in static images, or they can be used in interactive, dynamic visualizations by adding the Interactive scaffolding module.

Control.Monad.IfElse

Written by Jeff Heard on January 29th, 2009

There’s been some discussion on the mailing list about changing the type of Control.Monad.when.  I had a need a little while back for the old LISP anaphoric-if.  I also had a need for something that turned out to be in Control.Monad already — when (but I didn’t know it at the time).  Since then, I’ve been throwing together miscellaneous useful control flow mechanisms, and I’ve packaged them in Control.Monad.IfElse. In this post, I’ll talk a little bit more about how to use them and when.

First, there’s the anaphoric if, when, and conditional: aif, awhen, and acond.  By anaphoric, we mean to say that the result of the conditional test is used as part of the result.

awhen :: Maybe a -> (a -> m ())
aif :: Maybe a -> (a -> m b) -> m b -> m b
acond :: Monad m => [(Maybe a, a -> m ())] -> m ()

In all cases, if the value of the test is “Just a”, then a gets passed into a clause.  In the case of awhen, the clause takes a and returns an m () (in my case, usually IO or Render).  In the case of aif, there is a parameterless else clause that is evaluated upon a Nothing.  And finally, acond takes a list of pairs where the first element is being tested and the second is the clause that is evaluated for a on a “Just a” There are also lifted variants of these, awhenM, aifM, and acondM, where the Maybe a is encased in a monad.

In addition, the documentation isn’t generated for them, but there are a few operators that exist in the source.  I’ll update the doc to reflect them, but

(>>?) is a synonym for when

and

(>>=?) is a synonym for awhen

Usage for these is:

test >>? clause

atest >>=? (\a -> clause)

In addition, there are liftM2-ed logical and and or, (||^) and (&&^).

The rest are pretty well documented in the Haddock.  There is a cond and condM, which is like an if-else-if chain (and ncond and ncondM are unless else chains).  There’s a whenM counterpart to when, where the test is lifted into the same monad as the clause.  There’s whileM and untilM, which unfold their clauses until the test is False (or True in the case of until).  There’s return’, which is return, but strict in its argument.  And finally there’s “returning,” which is a substitute for “clause >> return a”

That’s most of the library.  Enjoy.

Visualizing groundwater nitrate concentration

Written by Jeff Heard on January 21st, 2009

This one brings back memories.  This is a snapshot from my first ever Haskell visualization, which I sadly no longer have the full code for.  I actually used Haskell here to calculate the data points and output textures and height-fields that were then imported into POV-Ray and rendered to give you what you see here.  It was my first experience with Control.Parallel.Strategies and my first experience with arrays in Haskell.  The thing I’m particularly proud of are the “calligraphic” contour lines that I got.

The contour lines are created by first finding every point where a contour threshold is crossed and then along with the point, calculating the gradient of the heightfield slope perpendicular to that point.  The flatter the slope, the wider the line.  The more tilted the slope, the narrower the line.  The boundaries of the lines mark where the height has changed from the threshold by at least a meter. The lighter side of the line shows the “upside” of the slope.  The darker side of the line shows the “downside” of the slope.

The actual data itself is an inverse-square interpolated set of points on a regular grid; each point represents a sensor installed in a 200 square yard field in eastern North Carolina.  These sensors were measured twice a month for several years.  The sensors measure two things that are shown on this visualization:  groundwater depth and groundwater nitrate concentration.  Depth is shown by the actual surface.  Nitrate concentration is shown by the colormap.  The blues you see are where nitrates are within EPA regulations.  The yellows and reds are outside EPA regulations.  Nitrate concentrations above EPA regulations can cause “blue baby syndrome” among other health problems, and safe levels of the toxin are not known for wildlife.  Pollution is most often caused by leaky hog-waste cesspools.

Groundwater nitrate concentrations in a test field in eastern NC

Groundwater nitrate concentrations in a test field in eastern NC

Ethan Heln’s galleries of networks and data visualizations

Written by Jeff Heard on January 21st, 2009

Not his own stuff, but a sort of photo-blog on Flickr of things he’s found interesting.  They’re attributed and as an eclectic collection, it’s pretty interesting.  He’s got a good eye.

Review: Data Flow - Visualizing Information in Graphic Design

Written by Jeff Heard on January 20th, 2009

My director handed “Data Flow: Visualizing Information in Graphic Design“  to me shortly before Christmas.  Being part of the visualization group, I’m always looking for new ideas in shiny, gelatin-covered books and magazines, hoping for inspiration in the form of informative, concise, and original visualization of interesting data.  I’m new to the field, relatively speaking, and so I’m still in a very active process of building my visual vocabulary.  I’ve got a strong eye for what I like, and what I like to imitate, though.  Tufte.  I love Tufte.

This book is about the same size, physically, as a Tufte book, and some, at least of the visualizations follow my favorite of his principles: trust the viewer with the data (present all the data you reasonably can), do not present information that is not data, and elegance and minimalism trump flashy.  Quite a number of the visualizations don’t follow these principles at all, but a number do, and for examples of those, I like this book.

What I don’t like about this book is that very little explanatory text goes with each visualization.  Now, if you’re unfamiliar with visualization, you might think that vis is supposed to be self-documenting.  That’s true up to a point in that things should be visually intuitive; you don’t use yellow to represent cold on the same graph where you use blue to represent hot, as a simple example.  But good visualizations of large datasets present so much data to the user that an extended legend is generally warranted.  This extended legend ought to include:

  • An explanation of what the dataset is.
  • An audience appropriate explanation of the domain.
  • How the dataset maps to the visualization.
  • Patterns that the researcher who commissioned the visualization has found or finds interesting.

Especially for this book, which includes hundreds of different visualizations from different domains, such legends would be helpful. I’m left as a reader without context, and while I can look at any page and think to myself, “That’s sure shiny,” I can’t look at visual elements and judge for myself whether a particular visual element was the “right” one for the data shown.  I’m not trying to review others’ work so much as garner for myself a visual vocabulary.  Here, I feel like I’m getting the words, but not the definitions.  A sort of Scrabble dictionary for data.

The other issue I have with the book is that it includes along with some very good visualizations, a number of infographics which seem to be designed to persuade rather than inform.  These visualizations feel more informative than they are, and lead the user to come to the conclusions that the graphic designer or her commissioner has already come to, rather than give them the chance to see things for themselves.

Yes, I think there is value in being visually striking, but for instance the graphic on page 155, “An Underground Disco Information Organization” gives me very little data with an excessive amount of context, which itself is presented like evidence.  Graphicdotal evidence, like anecdotal evidence, tells a convincing story, but it often occludes raw data, methods, and significant results.

So, my favorites from this book:

  1. Pie, pp. 38 - Graphing the mood of movies by analyzing frames for inherent color and plotting it on circles.  Ingenious.
  2. Data Visualization of a social network, pp. 60-64 - Different views of a social network show different connections.  I especially like the combination of the radial layout with the US map.
  3. War Report, pp 246 - Here, despite the sensitivity of the subject, the designer has chosen to leave his own presentation simple enough to let the data tell its own story.  The story that good raw data shown clearly tells is always more powerful than half-cocked data shown with graphicdotal markers.
  4. Sunlight Calendar, pp 202 - A complete calendar of the average sky hue at different times of day on each day of the year.
  5. Atlas of the North Sea, pp 105 -  Complex matrix visualization of the North Sea’s geography and significant environmental and geological factors.
  6. Silvertown Affect Map, pp 88 - Deceptively simple, and an elegant use of the data available to them.
  7. Cyberwar, pp 94-45 - Graphing a botnet attack
  8. Literary Organism, pp. 78 - A graph drawing of Jack Kerouac’s On the Road with a brilliant legend and visually appealing concept.
  9. The Shape of Globalization, pp 35 - Embedded circles to represent nested companies.
  10. Total Interaction, pp 48 - Combining pie charts with Florence Nightingale charts and anchored bubbles for a visually complex but informative graph.

So my review of this book is mixed.  Go to it for visually inspiring design and appreciation of how much data can be put onto a page at once without turning into noise.  Be wary of graphicdotal elements and techniques that are more persuasive than they are informative.  And write the authors and tell them to tell us more about the visualizations they collected!

A random note on programming with Gtk2Hs.

Written by Jeff Heard on January 19th, 2009

Gtk has quite a few conventions that are entirely unintuitive and make for some very interesting bugs in your app if you’re not aware of them.  For instance, calling a redraw function inside an event handler is wrong.  You wouldn’t know that, because no-one really says it anywhere, but let’s say you have a Gtk window that holds an OpenGL scene or a Cairo scene.  If you call your render function at the end of a Gtk mouse button handler, then you stand a small chance, especially if your app is multithreaded of the whole thing coming crashing down with an unintelligible error.  Instead, you should so something like this:

mouseMotionHandler :: UIState a => MVar a -> Gtk.Event -> IO Bool
mouseMotionHandler uistate_mvar event = do
    state <- takeMVar uistate_mvar
    putMVar uistate_mvar
        $ setMousePosition (Point (Gtk.eventX event) (Gtk.eventY event)) state
    dwin <- Gtk.widgetGetDrawWindow . fromJust . drawing $ state
    Gtk.drawWindowGetPointer dwin
    Gtk.drawWindowInvalidateRect dwin (Gtk.Rectangle 0 0 (round . sizeX $ state) (round . sizeY $ state)) False
    return False

The important code here is in the next to last line, Gtk.drawWindowInvalidateRect.  This code tells the Gtk main loop that the window area in dwin is dirty and that an “expose” event needs to occur.  The sizeX and sizeY of the rectangle should be the width and height of the window.  The dwin is obtained by calling Gtk.widgetGetDrawWindow on any widget with a drawing window (this is a Window, a DrawingArea, or a GLDrawingArea). Then your expose handler will look like this:

Gtk.onExpose canvas $ evt -> renderer state scene >> return True

where renderer is the name of the function that actually draws your scene on the drawing window.  Once you’ve setup your event handlers like this, you can switch to a multithreaded app, and at least your event handlers won’t cause your program to die.

The Docuverse.

Written by Jeff Heard on January 19th, 2009

Diagram of the Docuverse

Diagram of the Docuverse

Updated link to source code and a win32 executable

1.5 million documents on the screen at once.  That, despite what it looks like, is what you’re looking at.   Click through for a larger picture, although I have to say that it’s at its awesomest on the Renci VisWall, our 9′x16′, 16 projector display wall.  Yes, this was also done in Haskell, and fortunately I still have the code for the prototype.  The idea here is that I use standard formulae from the text retrieval world’s vector space model to position all the documents in a collection in relation to a series of queries.

Each document is represented by a single star on the image, and each query makes up the core of one of the galaxies that form.  Text Retrieval often uses a linear combination of several factors in its model of “relevance” and these, instead of combined into a single score, are instead plotted along the axes of a cylindrical coordinate system.  In this docuverse diagram, the radius is what’s known as TFxIDF, or the dot product of the query vector and the document vector.  The angle around the core is dictated by the cosine of the query vector and document vector.  The height off the zero plane is the standard score of the length of the document (Z score).

In this diagram, the queries are positioned randomly, but in the future I’d like to have them positioned via multidimensional scaling with respect common document terms, so a user can immediately see how similar they are.  The docuverse can also be used to represent document clusters, where the centroid is the core of the galaxy instead of a query.

The wonderful thing about this is that it’s interactive.  You can zoom around in the universe and get up close to galaxies, although as of yet I haven’t implemented actually navigating to documents based on clicking on stars, or handling level of detail so that individual stars are named the same as their documents close up. This is achieved in Haskell using HOpenGL, Data.Array.UArray, and OpenGL Vertex Arrays.

As a useful visualization, it gives you two things.  One thing it gives you, given a set of queries, is a characterization of how relevant the collection is to your set of queries as a whole versus another set of queries (which may well be topics instead of standard web-queries).  The other thing it gives you is a sense of how similar documents are textually to each other.  If you have a small subset of documents that you know are relevant to you, you can (conceivably, although this is not implemented in the prototype) highlight them on the docuverse and then select subsets of documents to be output for analysis later.  In high-recall applications, this kind of collection browsing, I think, will be essential in larger document collections.

I’ve been able to spin this up with a 6 million document collection and have it retain some level of interactivity.  The collection you see pictured here is NIST’s wt10g collection, a 10 gigabyte web-crawl done several years ago for benchmarking search engines in the Text Retrieval Conference (TREC).  One day, I’ll have time to work on it again, and i’ll release more than just a prototype and be able to open source the code without shame.

ProteinVis: Visualizing a large tree in Haskell and OpenGL

Written by Jeff Heard on January 18th, 2009

The following is a screenshot of ProteinVis, a tree viewer for hierarchical clusterings of proteins in the human Genome.  I did this project a while back with a colleague, Xiaojun Guan, and a professor of biology at UNC Chapel Hill, Dr William Kaufman.  Clicking on a node in the tree builds a set (actually, these are lazily computed from sets that are declared to be part of the tree at load time) of proteins that share Gene Ontology terms.

ProteinVis: A viewer for clusterings of human proteins

Click for a larger view

The basic idea is to look at the Gene Ontology terms that have been assigned to individual proteins in context of other similar proteins to see if the Gene Ontolog terms make sense in context. This was one of my first visualizations in OpenGL using Haskell, and so the code isn’t what I’d call pretty.  In any case, I’m releasing the source code here under Renci’s open source license, which is included in the source distribution.  I don’t really expect it to be of much interest, though, except to see what I did to get the visualization up and running.  These days I’m writing much better code.

ProteinVis, for win32 environments

ProteinVis, for Linux x86_64 environments

ProteinVis darcs repository

Simple Futures in Haskell

Written by Jeff Heard on January 17th, 2009

One of those things I have to do fairly often in multithreaded programming is send off a whole bunch of threads to do their thing while I do something else on the main thread until they’re done. For example, imagine you’re downloading a bunch of images from the web, you don’t want to call httpGet one image right after another, because network resources are slow and processing them takes up almost no CPU time.  But on the other hand, forkIO doesn’t return anything, so a thread thunk will have to put its contents somewhere you can access them later.  Thus, my short, simple solution, far too small to bother putting up on Hackage:

module Control.Concurrent.Future where

import Control.Concurrent

future :: IO a -> IO (MVar a)
future thunk = do
    ref <- newEmptyMVar
    forkIO $ thunk >>= putMVar ref
    return ref

forceAll :: [MVar a] -> IO [a]
forceAll = mapM takeMVar

To use this, simply do a mapM (future thunk) parms, where thunk corresponds to (thunk :: IO a -> IO (MVar a)) and parms is the list of parameter sets (one item per thread), an [a].  This will spark off a whole bunch of threads that will do their thing while you continue along in the program.  Then when you need all the data, forceAll (also known as a “barrier” in multithreaded programming lingo) will take a list of asynchronous actions and force the program to wait for them all to complete.  If you need all the results immediately after you spark the threads, then the following simple line will do:

stuff' <- mapM future stuff >>= forceAll

iBiblio traffic, search engine hits, and cross-traffic

Written by Jeff Heard on January 17th, 2009

Here’s a visualization concept I came up with a while back to look at the way search engines and word-of-mouth affects hit frequency on the iBiblio web-traffic log.  iBiblio consists of around 420 sites.   Each one of the circles you see represents one of the websites.  The size of each pie slice inside grows with respect to the number of hits by individual search engines (see the legend for which ones).  The size of the circle grows with respect to the overall number of hits by people other than search engines.  Hits are counted by number of unique incoming IP addresses per day.  Links get drawn between cliques of websites where more than 1/4th of the unique IP addresses are the same on that day, meaning, more or less, that those sites often share traffic.

This viz was developed entirely using Haskell and Cairo to crunch the weblogs and draw the data.  The total amount of data was around 10TB (yes, terabytes), and the visualization took about a day to process into a static animation.  Note that these are both size-compressed.  The original is meant to run on a wall-sized (16′x9′) or on our specialized visualization dome.  If you click on the image, you can see the original size.

A month in the life of iBiblio

A day in the life of iBiblio

A day in the life of iBiblio