infoviz browsing by tag


Plotting timeseries in space filling curves

Monday, August 24th, 2009

Fitting many timeseries in the same area and formatting them for quick comparison is challenging and an important problem.  For example, say that you want to pick stocks in the S&P 500 that perform similarly or which respond similarly to certain numerical transformations (say they have the same trend lines or the same noise frequency).  Line graphs are inappropriate for this because they’re too visually noisy and because you can only compare easily up to down, not left to right, visually, limiting the number of graphs on the page at the same time.

For certain comparisons, we aren’t interested in the individual values of a timeseries nor its trend along time so much as we are how it relates to other timeseries in the same dataset.  For this purpose over the weekend, I created these:


These are timeseries simply plotted along a Hilbert curve.  The 8 you see here are the first 8 stocks in alphabetical order of the S&P 500, plotted with one segment per day for all trading days in the last decade.  Blues are lower values.  Oranges are higher values.  They have each been normalized internally for their minimum and maximum prices, so the shape of the curve is more important than the maxima or minima.

In this figure, we see two glyphs that are similar: the first and the next to last.  These correspond to Agilent and Adobe, respectively.  Let’s look at their 5 year charts to see how similar they in fact are:


Adobe (ADBE). Source: Google Finance

Source: Google Finance

Agilent (A). Source: Google Finance

Now, despite some surface differences, you’ll note that the shapes of the graphs are indeed similar, and that when Agilent goes up, Adobe tends to go up, and when Agilent goes down, Adobe tends to as well.  Now, these two companies are unrelated in terms of market sector, so can we really say that these similarities matter?  If it were a short trend, maybe not, but this is a 5 year graph that we’re comparing here, and they’re remarkably similar to each other.  It’s worth delving deeper into.  They could be majority held by the same people, they could track for reasons that aren’t immediately obvious, they could have the many of the same people on their board of directors, or it could in the end be complete coincidence (but paradoxically predictive coincidence).

Here, by the way, for comparison is Apple, the third glyph from the Hilbert graph, shown as an area chart on the same timescale as A and ADBE:


Apple (AAPL). Source: Google Finance

One might expect, considering that the majority of Adobe’s projects run on Apple computers and a majority of Apple computers run Adobe products, that when Apple does well, Adobe does well, and vice versa.  We can see that this is not the case.  Much as the two Hilbert graphs are unrelated, AAPL and ADBE’s performance are unrelated.

Noe also that I managed to pack 8 timeseries in less space total using the Hilbert glyphs than the Google finance graphs.

And yes, by the way, this was written in Haskell using Hieroglyph.  I don’t have the code on me right now, but I’ll post it later on today when I’m in front of my personal computer.

The Docuverse.

Monday, January 19th, 2009

Diagram of the Docuverse

Diagram of the Docuverse

Updated link to source code and a win32 executable

1.5 million documents on the screen at once.  That, despite what it looks like, is what you’re looking at.   Click through for a larger picture, although I have to say that it’s at its awesomest on the Renci VisWall, our 9′x16′, 16 projector display wall.  Yes, this was also done in Haskell, and fortunately I still have the code for the prototype.  The idea here is that I use standard formulae from the text retrieval world’s vector space model to position all the documents in a collection in relation to a series of queries.

Each document is represented by a single star on the image, and each query makes up the core of one of the galaxies that form.  Text Retrieval often uses a linear combination of several factors in its model of “relevance” and these, instead of combined into a single score, are instead plotted along the axes of a cylindrical coordinate system.  In this docuverse diagram, the radius is what’s known as TFxIDF, or the dot product of the query vector and the document vector.  The angle around the core is dictated by the cosine of the query vector and document vector.  The height off the zero plane is the standard score of the length of the document (Z score).

In this diagram, the queries are positioned randomly, but in the future I’d like to have them positioned via multidimensional scaling with respect common document terms, so a user can immediately see how similar they are.  The docuverse can also be used to represent document clusters, where the centroid is the core of the galaxy instead of a query.

The wonderful thing about this is that it’s interactive.  You can zoom around in the universe and get up close to galaxies, although as of yet I haven’t implemented actually navigating to documents based on clicking on stars, or handling level of detail so that individual stars are named the same as their documents close up. This is achieved in Haskell using HOpenGL, Data.Array.UArray, and OpenGL Vertex Arrays.

As a useful visualization, it gives you two things.  One thing it gives you, given a set of queries, is a characterization of how relevant the collection is to your set of queries as a whole versus another set of queries (which may well be topics instead of standard web-queries).  The other thing it gives you is a sense of how similar documents are textually to each other.  If you have a small subset of documents that you know are relevant to you, you can (conceivably, although this is not implemented in the prototype) highlight them on the docuverse and then select subsets of documents to be output for analysis later.  In high-recall applications, this kind of collection browsing, I think, will be essential in larger document collections.

I’ve been able to spin this up with a 6 million document collection and have it retain some level of interactivity.  The collection you see pictured here is NIST’s wt10g collection, a 10 gigabyte web-crawl done several years ago for benchmarking search engines in the Text Retrieval Conference (TREC).  One day, I’ll have time to work on it again, and i’ll release more than just a prototype and be able to open source the code without shame.

ProteinVis: Visualizing a large tree in Haskell and OpenGL

Sunday, January 18th, 2009

The following is a screenshot of ProteinVis, a tree viewer for hierarchical clusterings of proteins in the human Genome.  I did this project a while back with a colleague, Xiaojun Guan, and a professor of biology at UNC Chapel Hill, Dr William Kaufman.  Clicking on a node in the tree builds a set (actually, these are lazily computed from sets that are declared to be part of the tree at load time) of proteins that share Gene Ontology terms.

ProteinVis: A viewer for clusterings of human proteins

Click for a larger view

The basic idea is to look at the Gene Ontology terms that have been assigned to individual proteins in context of other similar proteins to see if the Gene Ontolog terms make sense in context. This was one of my first visualizations in OpenGL using Haskell, and so the code isn’t what I’d call pretty.  In any case, I’m releasing the source code here under Renci’s open source license, which is included in the source distribution.  I don’t really expect it to be of much interest, though, except to see what I did to get the visualization up and running.  These days I’m writing much better code.

ProteinVis, for win32 environments

ProteinVis, for Linux x86_64 environments

ProteinVis darcs repository

iBiblio traffic, search engine hits, and cross-traffic

Saturday, January 17th, 2009

Here’s a visualization concept I came up with a while back to look at the way search engines and word-of-mouth affects hit frequency on the iBiblio web-traffic log.  iBiblio consists of around 420 sites.   Each one of the circles you see represents one of the websites.  The size of each pie slice inside grows with respect to the number of hits by individual search engines (see the legend for which ones).  The size of the circle grows with respect to the overall number of hits by people other than search engines.  Hits are counted by number of unique incoming IP addresses per day.  Links get drawn between cliques of websites where more than 1/4th of the unique IP addresses are the same on that day, meaning, more or less, that those sites often share traffic.

This viz was developed entirely using Haskell and Cairo to crunch the weblogs and draw the data.  The total amount of data was around 10TB (yes, terabytes), and the visualization took about a day to process into a static animation.  Note that these are both size-compressed.  The original is meant to run on a wall-sized (16′x9′) or on our specialized visualization dome.  If you click on the image, you can see the original size.

A month in the life of iBiblio

A day in the life of iBiblio

A day in the life of iBiblio