This chapter introduces some of the data structures available from the libraries. We will focus on common and particularly useful examples that every Haskeller should learn about.
Trade-offs
editThis chapter continually emphasizes shortcomings with lists, but that does not mean you should quit using them! Lists are the default data structure in Haskell for good reasons: beyond their simplicity, lists have a pretty high power-to-weight ratio in a lazy, purely functional setting. Laziness makes it possible to use lists as streams where we sequentially process elements that are generated on demand. That process allows functions such as map
, filter
, foldr
, takeWhile
, and zipWith
to work as effective replacements for many common uses of iterative control structures.
As powerful as they may be, lists are better suited to patterns like streaming and iteration control rather than simple data storage and retrieval. Of course, switching to a different data structure involves trade-offs. There will be advantages and disadvantages to any data structure, and the right choice depends on the problem at hand.
Lookups: Data.Map and co.
editFirst, we'll consider a common problem: performing lookups on a data structure. Given a collection of associations between keys and values, we may want to retrieve the value, if any, corresponding to some key. We could simply store the associations as a list of pairs, [(k, v)]
. Indeed, Prelude contains lookup :: Eq k => k -> [(k, v)] -> Maybe v
, but to find a value from within a list, we have to go through all pairs in the list, testing the keys for equality until we reach either the one we are looking for (or exhaust the list). A lookup in a plain association list is an O(n) operation, as the expected number of steps needed to perform it grows proportionally to the length of the list. It is easy to see how that can become a problem when there are a lot of associations.
We can achieve better lookups by switching to a more appropriate data structure. The Map
type provided by Data.Map
from the containers
package is a fine general-purpose choice. Note that Data.Map
is generally imported qualified, to avert name clashes with Prelude functions.
GHCi> import qualified Data.Map as M GHCi> :t M.empty M.empty :: M.Map k a
In a Map
, keys and values are arranged in a (size balanced, binary) tree. That tree form makes looking for a key work by simply going down a particular branch of the tree. Tree manipulation, however, happens entirely behind the scenes. Map
, like many other data structures from the libraries, is used as an abstract type through an interface with no mention of the tree implementation backing it. In particular, constructors are not exported: a new Map
is built e.g. by either inserting associations into an empty
map or by using the utility function fromList
:
GHCi> let foo = M.fromList [(1, "Robert"), (5, "Ian"), (6, "Bruce")] GHCi> :t foo foo :: M.Map Integer [Char]
The Data.Map
interface provides O(log n) lookups...
GHCi> :t M.lookup M.lookup :: Ord k => k -> M.Map k a -> Maybe a GHCi> M.lookup 5 foo Just "Ian" GHCi> M.lookup 7 foo Nothing
...as well as scores of other useful operations — unions, intersections, deletions and so forth. Instances for a handful of important type classes such as Functor
are available as well.
GHCi> M.size $ M.union foo $ M.fromList [(11, "Andrew"), (17, "Mike")] 5 GHCi> fmap reverse foo fromList [(1,"treboR"),(5,"naI"),(6,"ecurB")]
Variations
editOther modules providing map and map-like data structures worth knowing about include:
Data.IntMap
fromcontainers
provides a more efficient map implementation limited toInt
keys.Data.Set
, also incontainers
, provides a set implementation. Sets are appropriate when the interesting operation is simply finding whether a value is in a collection, rather than retrieving a value given a key. They are a lot like a map in which only the keys matter, and so many considerations about performance and implementations apply to both sets and maps.- The
unordered-containers
package provides hash maps and sets. They offer efficiency gains (such as almost constant-time lookups) without limiting the type of the key toInt
, at the cost of a comparatively limited interface and the loss of the ordering guarantees of thecontainer
tree-based maps.
Peeking at both ends with Data.Sequence
editOne of the peculiarities of lists is that they are asymmetric. Given that we construct and deconstruct lists by their head with `(:)`, operations at the head are more efficient than the corresponding operations at the tail. For instance, while prepending an element with `(:)` takes constant time, naïvely appending a single element with \xs x -> xs ++ [x]
takes time proportional to the length of xs
. That means building up a list by repeatedly appending will take quadratic time in the number of elements, which is really bad.
When lots of operations at the middle or at the tail have to be done, an excellent list-like alternative to lists are sequences, as provided by the Data.Sequence
module, which is also part of containers
. Sequences and lists are quite different from one another, even though many of the familiar list functions reappear in some guise in Data.Sequence
. While lists are lazy and can be infinite, sequences are finite and strict. The trade-off which makes sequences useful is that, at the cost of some overhead with respect to lists, many operations which were troublesome with lists perform much better. Remarkably, we get both appending and prepending in constant time, length in constant time and also concatenation and random access in logarithmic time. All of that is available through a pleasant, purely functional interface.
GHCi> import qualified Data.Sequence as S GHCi> import Data.Sequence((<|), (|>), (><), ViewL(..), ViewR(..)) GHCi> let foo = S.fromList [1, 3, 5, 2, 9] GHCi> :t foo foo :: S.Seq Integer
(<|)
prepends, (|>)
appends, and (><)
concatenates.
GHCi> 0 <| foo fromList [0,1,3,5,2,9] GHCi> foo |> 18 fromList [1,3,5,2,9,18] GHCi> foo >< foo fromList [1,3,5,2,9,1,3,5,2,9]
You can also pattern match at both ends. For that, you use either viewl
or viewr
to get the desired view of the sequence and then match using EmptyL
and (:<)
and EmptyR
and (:>)
respectively.
GHCi> S.viewl foo 1 :< fromList [3,5,2,9] GHCi> S.viewr foo fromList [1,3,5,2] :> 9 GHCi> let xs :> x = S.viewr foo GHCi> xs fromList [1,3,5,2] GHCi> x 9
Raw performance with arrays
editPerformance demands when dealing with large volumes of data can be quite stringent. For situations which require fast processing of bulk data, with laziness and streaming not being relevant concerns, Haskell offers true arrays in the vein of those found in C and elsewhere. Arrays are compact memory-wise, offer constant time random access and many blazingly fast operations (the main exceptions being those that require copying the arrays, such as immutable array concatenation), at the cost of a certain unwieldiness arising from the deep differences in behaviour between arrays and the purely functional data structures we normally deal with. There are several array libraries available; each of them generally providing a number of different kinds of arrays — from those whose usage do not feel very different from usual Haskell data structures to C-like mutable arrays of raw primitive values.[1] Here we will mention three of the most popular array libraries.
vector
is a good default choice if you are just starting with arrays, or if you do not have specialised needs covered by other libraries. It provides one-dimensional arrays with an interface reasonably similar to that found in other data structure libraries such ascontainers
.
array
, in contrast, has a more intimidating interface. As for features, it supports multi-dimensional arrays and custom indexing. Importantly, it is also part of the language standard, and is bundled with GHC, which makes it useful for library writers unwilling to incur additional dependencies. We provide an overview of standard arrays in a separate chapter, which can be used as an introduction to array-related terminology.
repa
is a sophisticated library providing state-of-the-art multi-dimensional parallelisable arrays. It is well suited for tasks such as image processing.
text, bytestring, and the problem with String
editA quick glance at the base
libraries might leave us with the impression that String
s are the preferred way of getting data into and out of a Haskell program. However, there are several issues with String
which make it a poor fit for such a role.
- The most obvious problem is performance. A
String
is just a list ofChar
. In applications that have to process even modestly large amounts of text or binary data, the advantages of general-purpose linked lists are overshadowed by the large losses of efficiency relative to what a specialised implementation would make possible.
- With binary data, a deeper issue is that a
Char
-based representation makes little sense, as we are actually dealing with raw bytes.
- Finally, while Haskell
Char
s are Unicode characters, overall the support for different encodings and internationalisation in thebase
libraries is somewhat lacking.
Those shortcomings of String
are addressed by the libraries text
and bytestring
. Both are de facto standards; pretty much all modern libraries whose functionality involves any significant volumes of data input or output use them. They have clearly separate use cases:
text
is for efficient processing of Unicode text. It supports conversion between encodings and, with the companiontext-icu
library, a wide assortment of Unicode services.
bytestring
is for efficient processing of binary data, in all of its guises - cases include network packets, raw image data, serialization (through libraries such asbinary
andcereal
); you name it.
The core types of both libraries, Text
and ByteString
, are implemented as specialised, monomorphic containers of Char
and Word8
(i.e. raw bytes) respectively. The internal representation is array-based and very compact. In terms of interfaces, both libraries are quite straightforward. The main subtlety to be aware of is that in both cases there are strict and lazy variants of the types. The strict versions are well-suited for processing large volumes of small pieces of data, while the lazy ones are processed in chunks, and therefore allow for streaming and processing of large pieces of monolithic data without memory consumption woes.
A convenience feature worth being aware of when dealing with String
replacements is that the OverloadedStrings
GHC extension makes it possible to have automatic, type-directed conversion of string literals to Text
or ByteString
. This can be quite helpful, especially for Text
.
- ↑ It goes without saying that mutable arrays must live in either the
IO
or theST
monad.