1.5 million documents on the screen at once. That, despite what it looks like, is what you’re looking at. Click through for a larger picture, although I have to say that it’s at its awesomest on the Renci VisWall, our 9′x16′, 16 projector display wall. Yes, this was also done in Haskell, and fortunately I still have the code for the prototype. The idea here is that I use standard formulae from the text retrieval world’s vector space model to position all the documents in a collection in relation to a series of queries.
Each document is represented by a single star on the image, and each query makes up the core of one of the galaxies that form. Text Retrieval often uses a linear combination of several factors in its model of “relevance” and these, instead of combined into a single score, are instead plotted along the axes of a cylindrical coordinate system. In this docuverse diagram, the radius is what’s known as TFxIDF, or the dot product of the query vector and the document vector. The angle around the core is dictated by the cosine of the query vector and document vector. The height off the zero plane is the standard score of the length of the document (Z score).
In this diagram, the queries are positioned randomly, but in the future I’d like to have them positioned via multidimensional scaling with respect common document terms, so a user can immediately see how similar they are. The docuverse can also be used to represent document clusters, where the centroid is the core of the galaxy instead of a query.
The wonderful thing about this is that it’s interactive. You can zoom around in the universe and get up close to galaxies, although as of yet I haven’t implemented actually navigating to documents based on clicking on stars, or handling level of detail so that individual stars are named the same as their documents close up. This is achieved in Haskell using HOpenGL, Data.Array.UArray, and OpenGL Vertex Arrays.
As a useful visualization, it gives you two things. One thing it gives you, given a set of queries, is a characterization of how relevant the collection is to your set of queries as a whole versus another set of queries (which may well be topics instead of standard web-queries). The other thing it gives you is a sense of how similar documents are textually to each other. If you have a small subset of documents that you know are relevant to you, you can (conceivably, although this is not implemented in the prototype) highlight them on the docuverse and then select subsets of documents to be output for analysis later. In high-recall applications, this kind of collection browsing, I think, will be essential in larger document collections.
I’ve been able to spin this up with a 6 million document collection and have it retain some level of interactivity. The collection you see pictured here is NIST’s wt10g collection, a 10 gigabyte web-crawl done several years ago for benchmarking search engines in the Text Retrieval Conference (TREC). One day, I’ll have time to work on it again, and i’ll release more than just a prototype and be able to open source the code without shame.