Haskell/Libraries/Data structures primer

This chapter introduces some of the data structures available from the libraries. We will focus on common and particularly useful examples that every Haskeller should learn about.

Trade-offs edit

This chapter continually emphasizes shortcomings with lists, but that does not mean you should quit using them! Lists are the default data structure in Haskell for good reasons: beyond their simplicity, lists have a pretty high power-to-weight ratio in a lazy, purely functional setting. Laziness makes it possible to use lists as streams where we sequentially process elements that are generated on demand. That process allows functions such as map, filter, foldr, takeWhile, and zipWith to work as effective replacements for many common uses of iterative control structures.

As powerful as they may be, lists are better suited to patterns like streaming and iteration control rather than simple data storage and retrieval. Of course, switching to a different data structure involves trade-offs. There will be advantages and disadvantages to any data structure, and the right choice depends on the problem at hand.

Lookups: Data.Map and co. edit

First, we'll consider a common problem: performing lookups on a data structure. Given a collection of associations between keys and values, we may want to retrieve the value, if any, corresponding to some key. We could simply store the associations as a list of pairs, [(k, v)]. Indeed, Prelude contains lookup :: Eq k => k -> [(k, v)] -> Maybe v, but to find a value from within a list, we have to go through all pairs in the list, testing the keys for equality until we reach either the one we are looking for (or exhaust the list). A lookup in a plain association list is an O(n) operation, as the expected number of steps needed to perform it grows proportionally to the length of the list. It is easy to see how that can become a problem when there are a lot of associations.

We can achieve better lookups by switching to a more appropriate data structure. The Map type provided by Data.Map from the containers package is a fine general-purpose choice. Note that Data.Map is generally imported qualified, to avert name clashes with Prelude functions.

GHCi> import qualified Data.Map as M
GHCi> :t M.empty
M.empty :: M.Map k a

In a Map, keys and values are arranged in a (size balanced, binary) tree. That tree form makes looking for a key work by simply going down a particular branch of the tree. Tree manipulation, however, happens entirely behind the scenes. Map, like many other data structures from the libraries, is used as an abstract type through an interface with no mention of the tree implementation backing it. In particular, constructors are not exported: a new Map is built e.g. by either inserting associations into an empty map or by using the utility function fromList:

GHCi> let foo = M.fromList [(1, "Robert"), (5, "Ian"), (6, "Bruce")]
GHCi> :t foo
foo :: M.Map Integer [Char]

The Data.Map interface provides O(log n) lookups...

GHCi> :t M.lookup
M.lookup :: Ord k => k -> M.Map k a -> Maybe a
GHCi> M.lookup 5 foo
Just "Ian"
GHCi> M.lookup 7 foo
Nothing

...as well as scores of other useful operations — unions, intersections, deletions and so forth. Instances for a handful of important type classes such as Functor are available as well.

GHCi> M.size $ M.union foo $ M.fromList [(11, "Andrew"), (17, "Mike")]
5
GHCi> fmap reverse foo
fromList [(1,"treboR"),(5,"naI"),(6,"ecurB")]

Variations edit

Other modules providing map and map-like data structures worth knowing about include:

  • Data.IntMap from containers provides a more efficient map implementation limited to Int keys.
  • Data.Set, also in containers, provides a set implementation. Sets are appropriate when the interesting operation is simply finding whether a value is in a collection, rather than retrieving a value given a key. They are a lot like a map in which only the keys matter, and so many considerations about performance and implementations apply to both sets and maps.
  • The unordered-containers package provides hash maps and sets. They offer efficiency gains (such as almost constant-time lookups) without limiting the type of the key to Int, at the cost of a comparatively limited interface and the loss of the ordering guarantees of the container tree-based maps.

Peeking at both ends with Data.Sequence edit

One of the peculiarities of lists is that they are asymmetric. Given that we construct and deconstruct lists by their head with `(:)`, operations at the head are more efficient than the corresponding operations at the tail. For instance, while prepending an element with `(:)` takes constant time, naïvely appending a single element with \xs x -> xs ++ [x] takes time proportional to the length of xs. That means building up a list by repeatedly appending will take quadratic time in the number of elements, which is really bad.

When lots of operations at the middle or at the tail have to be done, an excellent list-like alternative to lists are sequences, as provided by the Data.Sequence module, which is also part of containers. Sequences and lists are quite different from one another, even though many of the familiar list functions reappear in some guise in Data.Sequence. While lists are lazy and can be infinite, sequences are finite and strict. The trade-off which makes sequences useful is that, at the cost of some overhead with respect to lists, many operations which were troublesome with lists perform much better. Remarkably, we get both appending and prepending in constant time, length in constant time and also concatenation and random access in logarithmic time. All of that is available through a pleasant, purely functional interface.

GHCi> import qualified Data.Sequence as S
GHCi> import Data.Sequence((<|), (|>), (><), ViewL(..), ViewR(..))
GHCi> let foo = S.fromList [1, 3, 5, 2, 9]
GHCi> :t foo
foo :: S.Seq Integer

(<|) prepends, (|>) appends, and (><) concatenates.

GHCi> 0 <| foo
fromList [0,1,3,5,2,9]
GHCi> foo |> 18
fromList [1,3,5,2,9,18]
GHCi> foo >< foo
fromList [1,3,5,2,9,1,3,5,2,9]

You can also pattern match at both ends. For that, you use either viewl or viewr to get the desired view of the sequence and then match using EmptyL and (:<) and EmptyR and (:>) respectively.

GHCi> S.viewl foo
1 :< fromList [3,5,2,9]
GHCi> S.viewr foo
fromList [1,3,5,2] :> 9
GHCi> let xs :> x = S.viewr foo
GHCi> xs
fromList [1,3,5,2]
GHCi> x
9

Raw performance with arrays edit

Performance demands when dealing with large volumes of data can be quite stringent. For situations which require fast processing of bulk data, with laziness and streaming not being relevant concerns, Haskell offers true arrays in the vein of those found in C and elsewhere. Arrays are compact memory-wise, offer constant time random access and many blazingly fast operations (the main exceptions being those that require copying the arrays, such as immutable array concatenation), at the cost of a certain unwieldiness arising from the deep differences in behaviour between arrays and the purely functional data structures we normally deal with. There are several array libraries available; each of them generally providing a number of different kinds of arrays — from those whose usage do not feel very different from usual Haskell data structures to C-like mutable arrays of raw primitive values.[1] Here we will mention three of the most popular array libraries.

  • vector is a good default choice if you are just starting with arrays, or if you do not have specialised needs covered by other libraries. It provides one-dimensional arrays with an interface reasonably similar to that found in other data structure libraries such as containers.
  • array, in contrast, has a more intimidating interface. As for features, it supports multi-dimensional arrays and custom indexing. Importantly, it is also part of the language standard, and is bundled with GHC, which makes it useful for library writers unwilling to incur additional dependencies. We provide an overview of standard arrays in a separate chapter, which can be used as an introduction to array-related terminology.
  • repa is a sophisticated library providing state-of-the-art multi-dimensional parallelisable arrays. It is well suited for tasks such as image processing.

text, bytestring, and the problem with String edit

A quick glance at the base libraries might leave us with the impression that Strings are the preferred way of getting data into and out of a Haskell program. However, there are several issues with String which make it a poor fit for such a role.

  • The most obvious problem is performance. A String is just a list of Char. In applications that have to process even modestly large amounts of text or binary data, the advantages of general-purpose linked lists are overshadowed by the large losses of efficiency relative to what a specialised implementation would make possible.
  • With binary data, a deeper issue is that a Char-based representation makes little sense, as we are actually dealing with raw bytes.
  • Finally, while Haskell Chars are Unicode characters, overall the support for different encodings and internationalisation in the base libraries is somewhat lacking.

Those shortcomings of String are addressed by the libraries text and bytestring. Both are de facto standards; pretty much all modern libraries whose functionality involves any significant volumes of data input or output use them. They have clearly separate use cases:

  • text is for efficient processing of Unicode text. It supports conversion between encodings and, with the companion text-icu library, a wide assortment of Unicode services.
  • bytestring is for efficient processing of binary data, in all of its guises - cases include network packets, raw image data, serialization (through libraries such as binary and cereal); you name it.

The core types of both libraries, Text and ByteString, are implemented as specialised, monomorphic containers of Char and Word8 (i.e. raw bytes) respectively. The internal representation is array-based and very compact. In terms of interfaces, both libraries are quite straightforward. The main subtlety to be aware of is that in both cases there are strict and lazy variants of the types. The strict versions are well-suited for processing large volumes of small pieces of data, while the lazy ones are processed in chunks, and therefore allow for streaming and processing of large pieces of monolithic data without memory consumption woes.

A convenience feature worth being aware of when dealing with String replacements is that the OverloadedStrings GHC extension makes it possible to have automatic, type-directed conversion of string literals to Text or ByteString. This can be quite helpful, especially for Text.

  1. It goes without saying that mutable arrays must live in either the IO or the ST monad.