snippets

...now browsing by tag

 
 

Snippet: sorting the elements of a map

Thursday, February 5th, 2009

So in an application where you have a lot of data columns indexed by name, it’s often necessary to sort that data based on one or more of the columns.  You might be tempted to do this the straightforward way of

  1. Extract the data from the map using toList
  2. unzip and transpose to create rows instead of columns
  3. sort the resultant list using Data.List.sort or Data.List.SortBy
  4. transpose, zip, and put the data into a new map.

Now that’s a reasonable way to do things, but it raises one minor issue, which might not be so minor if you’ve got a really large dataset.  It’s “head-strict”.  The second you access one of the elements of the map, the entire sort procedure takes place for all the data in the map.  If you’re only needing to access a column or two of a large multicolumn dataset once it’s sorted and on top of that, are only needing a few elements of those columns, this is SERIOUSLY overkill.  Here’s a better way:

module MapSort where

import qualified Data.Map as Map
import qualified Data.List as List
import qualified Data.IntMap as IntMap

sort mp key = Map.map reorder mp
    where sort' vs = List.sort . zip vs $ indices
          ordering = map snd . sort' $ (mp Map.! key)
          indices = enumFrom (0::Int)
          reorder vs = IntMap.toAscList . IntMap.fromList $ zip ordering vs

This solution is head-strict per column instead of for the whole map.  How did we achieve that?  We created an ordering out of the sort key column and then applied that ordering on each one of the lists stored in the map individually.  Simple and efficient.  There are probably even more efficiencies to be had here, but this one works for me quite well.  From here, it should be straightforward to implement sortBy and even use a list of keys instead of a single key as the sorting key for the map.

Simple Futures in Haskell

Saturday, January 17th, 2009

One of those things I have to do fairly often in multithreaded programming is send off a whole bunch of threads to do their thing while I do something else on the main thread until they’re done. For example, imagine you’re downloading a bunch of images from the web, you don’t want to call httpGet one image right after another, because network resources are slow and processing them takes up almost no CPU time.  But on the other hand, forkIO doesn’t return anything, so a thread thunk will have to put its contents somewhere you can access them later.  Thus, my short, simple solution, far too small to bother putting up on Hackage:

module Control.Concurrent.Future where

import Control.Concurrent

future :: IO a -> IO (MVar a)
future thunk = do
    ref <- newEmptyMVar
    forkIO $ thunk >>= putMVar ref
    return ref

forceAll :: [MVar a] -> IO [a]
forceAll = mapM takeMVar

To use this, simply do a mapM (future thunk) parms, where thunk corresponds to (thunk :: IO a -> IO (MVar a)) and parms is the list of parameter sets (one item per thread), an [a].  This will spark off a whole bunch of threads that will do their thing while you continue along in the program.  Then when you need all the data, forceAll (also known as a “barrier” in multithreaded programming lingo) will take a list of asynchronous actions and force the program to wait for them all to complete.  If you need all the results immediately after you spark the threads, then the following simple line will do:

stuff' <- mapM future stuff >>= forceAll