From Data to Insight
A Wikibookian believes this page should be split into smaller pages with a narrower subtopic. You can help by splitting this big page into smaller ones. Please make sure to follow the naming policy. Dividing books into smaller sections can provide more focus and allow each one to do one thing well, which benefits everyone. |
Introduction
edit“...what is the use of a book," thought Alice, "without pictures...?”
This book teaches the principles of transforming data into insights:
- All data represent one of 3 dimensions: WHAT | WHEN | WHERE?
- Every datagraphic visualizes a relationship among the dimensions.
- There is an appropriate visualization for every relationship.
- The visualization should be efficiently designed.
- The design should reveal the insight.
Data
editA common approach to the treatment of data is that they are at the bottom of a hierarchy:
- Wisdom
- Knowledge
- Information
- Data
whose lower levels are more 'dense' and less 'useful' and whose higher levels are more 'reduced' and abstract. This book deals mainly with the lowest three levels: how to transform data - largely through visualization - into insightful knowledge about the how the world seems to work.
Relationships
editNumbers
editGeometry
editThe Dimensions of Data
editPhenomena
editWe begin with the idea of discrete objects that are enumerable (can be counted) to yield a single number but perhaps measured in many ways. In what follows these objects won't be divided into their components.
As shown in the exhibit, the fundamental data distinction is between numbers and what could be called labels.
The simplest kinds of numbers are the integers I, which are used to count objects. As the number of objects becomes large (say > 100) the difference between their counts becomes relatively small so that these counts become nearly continuous. Truly continuous numbers—the real numbers R1—can assume all possible values.
Labels (sometimes called nominal values or factors) are used to distinguish objects without measuring an amount or any specific numeric quality, although labels can be ordered ("do not like", "indifferent", "like"). Counts can play the roles of labels when the values are small (e.g. single-person v. 2-person v. >2-person families). Combinatorics deals with how labeling can be used to collect the objects into various 'bins' whose counts will be discussed in data description below.
It is important to know what kind of measurements are being dealt with because there are descriptive and relational methods appropriate to each kind. Let's examine a few examples, in the present case from the field of environmental analysis.
- The number of trees in a quadrat is an integer ranging from 0 to less than infinity.
- The weight of a soil sample is a real number greater than 0 (although a sample could have zero weight, the question arises as to whether it is a sample at all, and, if so, whether there are therefore infinite such samples).
- The proportion of Nitrates in a water sample measured in parts per million is a real number resulting from the act of division.
- The magnitude of an earthquake is a real number, but because such numbers come from long-tailed distributions, logarithms are often used.
- A land cover class ('forest', 'water', 'cropland', 'urban') is a label associated with no number, although each class could be assigned an arbitrary integer (say 2, 4, 1, 3). Sometimes integers are used to label measurements, but these numbers (sometimes called 'keys' in database terminology) are identifiers and cannot be described further; moreover, they should be the first column of the data matrix.
A particularly complicated kind of measurement results in various kinds of categories, which require often sophisticated analytical techniques. One example would be assigning a human patient to the class of patients suffering from West Nile Virus, which can be considered an anatomical condition (brain inflammation), an infection by a specific Flavivirus, or a collection of symptoms (fever, etc.). Measurement problem - issues of reliability and confidentiality - are among the reasons why epidemiological data description is so challenging. It is useful to explore some of the unusual conditions that arise in the treatment of real numbers.
- Zero arises when there's nothing to count (as in a treeless region or an image pixel receiving no photons), or as the length of an instant of time ("2011-11-24 13:02:36.032 EST", where only milliseconds are shown).
- Negative numbers are needed in e.g. spatial data to show altitude below sea level or angle below the equator, in time data to show years before the present, or phenomena to show temperature below freezing.
- If some standard value is known then a 'rate' is computed most simply by dividing a measurement by that standard, as in a proportion of the whole or a percentage increase.
- Finally there often arise unusual cases, where a measurement is missing (NA = not available or not applicable) or infinite (a ratio resulting from the division by zero).
Critical to describing data is understanding how they behave. The exhibit shows that the fundamental continuum is between discrete and continuous measurements; if one keeps track of where a given measurement is on this continuum the appropriate descriptive approach should be clearer.
Description
editAlthough most of this book will discuss how to visualize relationships between different kinds of measurements, much of analysis is concerned with simple description. We begin with the basic data matrix, as in the table below
ID NOMINAL REGULAR INTEGER RATIONAL
-------------------------------------------
Sigma D 28 2 9.1
Gamma B 29 7 9.7
Delta B 30 6 7.6
Kappa C 31 8 7.5
Mu D 32 8 9.8
Beta A 33 3 4.2
Pi D 34 6 4.7
Epsilon D 35 4 4.8
Tau C 36 8 4.2
Lambda C 37 10 2.0
Alpha A 38 9 5.8
Rho A 39 1 4.2
which is a 12 x 5 matrix of measurements that illustrate the key distinctions discussed so far:
- The ID variable is a unique label that identifies each row.
- NOMINAL is a not necessarily unique label consisting of a single alphabetic character, although it could easily be a number, or longer name.
- REGULAR is a variable that increments in unit steps, and so could be either an index (the second row could be the 29th 'case) or a time stamp (year 1928).
- INTEGER is a non-unique whole number that might be counting something.
- RATIONAL is a real number that is a true measurement, say of wind speed in knots (and note that 'knots per hour' is incorrect unless you mean acceleration!).
The above table is a useful illustration of where to begin with data visualization. It might be the complete data matrix (although 12 rows isn't very much) that you want to present in a report, in which case this is a clear template to use. Or the matrix might just be a sample of the full dataset; and - provided this sampling is indicated - the format is also useful. Finally, the table might be presenting statistics for 12 variables. In each of these cases I have presented a simple and clearly organized template. It's obviously very important to understand the data and to be completely clear what the data matrix represents, particularly: the numbers of rows and columns, the role of each column, and what it represents. Once this is clear, you can turn to the problem of describing each of the columns or variables. Each kind of measurement allows of its own set of descriptive statistics. Descriptive statistics present basic facts about your data. They are of two types: nonparametric and parametric, and they are easy to understand.
Nonparametric statistics
editThe range is the simplest description of your data: what are the smallest and largest values attained? For the 3rd column of the data above it's easy to see that REGULAR ranges from 28 to 39, INTEGER from 1 to 10, and RATIONAL FROM 2.0 TO 9.8. For a larger dataset you can sort the numbers and choose the smallest and largest. These numbers will help determine the axes of any plots that are drawn.
The next statistic to examine is the mode, the most common value in the list. In a bar chart (the most appropriate visualization) the mode will have the longest bar.
The above exhibit is about the simplest visual description of the NOMINAL variable. Note that the vertical axis shows the values ordered by their frequencies, so you can easily see which values are the most and least common.