Data Science: An Introduction/A Mash-up of Disciplines

Data Science: An Introduction

Chapter 02: A Mash-up of Disciplines

Welcome to Data Science
Thinking about the World
Analyzing and Visualizing, Part One
- 13: Single Variable Analysis
- 14: Single Variable Tables and Plots
Setting up the Problem
Collecting, Ingesting, Transforming Data
Analyzing and Visualizing, Part Two
Emergent Answers to Free Form Problems
- 24: Non-Theory-Based Inquiry
- 25: Exploratory Analysis
Analyzing and Visualizing, Part Three
Presenting Results
Appendices

Chapter Summary edit

This is a very quick overview of the eight "parent" disciplines that contribute to the new Data Science discipline. It suggests generic questions that a data scientist should ask as they work through solving problems.

Discussion edit

As mentioned in Chapter 1, Data Science is a mash-up of several different disciplines. We also noted that an individual data scientist is most likely an expert in one or two of these disciplines and proficient in another two or three. There is probably no living person who is expert in all these disciplines, and an extremely rare person would be proficient in 5 or 6 of these disciplines. This means that data science must be practiced as a team where, across the membership of the team, there is expertise and proficiency across all the disciplines. Let us explore what these disciplines are and how they contribute to data science.

Data Engineering edit

(See Chapter 06: Thinking Like a Data Engineer for a more in-depth discussion)

Data Engineering is the data part of data science. According to the Wikipedia, Data Engineering involves acquiring, ingesting, transforming, storing, and retrieving data. Data engineering also includes adding metadata to the data. Because all these activities are inter-related, a data engineer must solve these issues as an integrated whole. For example, we must understand how we plan to store and retrieve the data in order to create a good ingestion process. Data engineering requires a thorough understanding of the general nature of the data science problems to be solved in order to formulate a robust data acquisition and management plan. Once the plan is well developed, data engineer can begin to implement it into data management systems.

Acquiring - This is the process of laying our hands on the data. The data engineer part of the data scientist needs to ask the questions, "where is the data coming from?" and "what does the data look like?" and "how does our team get access to the data?" The data could come from many places such as RSS feeds, a sensor network or a preexisting data repository. The data could be numbers, text documents, images, or video. The data can be collected by the team or purchased from a vendor. For example, if we are going to investigate highways, we could have sensors on a stretch of freeway that measures how fast cars are going. These sensors send us the data as text messages that include the date, time, lane, and speed of every car that crosses the sensors.

Ingesting - This is the process of getting the data from the source into the computer systems we will use for our analysis. The data engineer part of the data scientist needs to ask the questions, "how much data is coming?" and "how fast is it coming?" and "where are we going to put the data?" and "do we have enough disk space for the data?" and "do I need to filter the incoming data in any way?" Data is measured in bytes. A byte is roughly equivalent to one character of a written word. A one-page document is about 1,000 bytes or one kilobyte (1K). For example, if we are going to investigate highways, we could be receiving car speed data at a rate of 10,000 bytes per second for a 1-week period. There are 604,800 seconds in a week. This means you will receive 6,048,000,000 bytes (6 gigabytes) of data in one week. No problem. That will fit on a thumb drive.

Transforming - This is the process of converting the data from the form in which it was collected to the form it needs to be in for our analysis. The data engineer part of the data scientist needs to ask the questions, "what is the form of the raw data?" and "what does the form of the processed data need to be?" A common raw data format is comma-separated values (CSV) which looks like:

20120709,135214,157,3,57.4
20120709,135523,13,2,62.1

For example, if we are investigating highways, we might receive data that looks like the example above. The segments in the first row are: date, July 9, 2012; time, 1:52.14pm; sensor, #157; lane, #3; and speed, 57.4 mph. The data needs to be transformed from CSV format to something akin to a spreadsheet format like the following:

Year	Month	Day	24-Hour	Minute	Second	Sensor #	Lane #	MPH
2012	07	09	13	52	14	157	3	57.4
2012	07	09	13	55	23	13	2	62.1

Understanding the various "from" and "to" formats is very important for a data scientist.

Metadata - The Wikipedia says that metadata is commonly called data about data. In our case above, the data is the MPH and the Lane. The Sensor is a proxy for "where" on the surface of the earth the data was collected, and the date and time are data about "when" it was collected. We could add other metadata to our data, like weather conditions at the time and the quality of the road. We could derive other metadata, such as whether it was a weekday, holiday, or weekend, and whether it was rush hour or not. We might also add metadata that indicates who may see the data under what conditions, like "not for public dissemination until 1 year after collected." Metadata is often added both at ingestion time and at transformation time.

Storing - This is the process of putting the data into a data management system. The data engineer part of the data scientist needs to ask the questions, "what kind of a system is best for storing our data?" and "how fast will the system be?" and "how much extra space will this system need?" We can store data in files in a file system. File systems are generally very fast, but have very little functionality. We can store data in a database. These are often slower than a file system, but have much more functionality. For example, in our highway example, we might have 60 million lines of data in CSV format. (At 100 bytes per line, that would be about 6 gigabytes). We could store it in one big file in the file system. It would be fast to read it, but in that format we could not compute averages by time and location. Alternatively, we could store it in a database where it would be easy to compute averages by location and time, though it would take more time to read through the data.

Retrieving - This is the process of getting the data back out. The data engineer part the data scientist needs to ask the questions, "how will we ask questions of our data?" and "how will we display our data?' We can search the data through a query system and we can display subsets of the data in a table. For example, in our highway example, we might want to search for only those measurements from one sensor during morning rush hour. We might then want to display a table that shows the average rush hour speed by day. In this case, we would be better off if the data had been stored in a database. Thus, knowing what kind of analysis we want to perform, will help us with our data storage strategy.

Scientific Method edit

(See Chapter 09: Thinking Like a Scientist for a more in-depth discussion)

The Scientific Method is the science part of data science. According to the Wikipedia, the Scientific Method is a process for acquiring new knowledge by applying the principles of reasoning on empirical evidence derived from testing hypotheses through repeatable experiments. When a scientist hears someone make an assertion about a fact, they naturally want to know both what is the evidence and what is the standard of acceptance for that evidence.

Reasoning Principles - There are two general forms of logical reasoning: inductive, and deductive. Simply stated, inductive reasoning arrives at general principles from specific observations, while deductive reasoning arrives at specific conclusions based on general principles. Consider the following two examples:

Inductive argument:

Every life form that everyone knows of depends on liquid water to exist.
Therefore, all known life depends on liquid water to exist.

Deductive argument:

All men are mortal.
Socrates is a man.
Therefore, Socrates is mortal.

Most of scientific knowledge we have is based on inductive reasoning. The scientist part of the data scientist needs to ask the question, "what is the reasoning behind a particular conclusion?"

Empirical Evidence - Evidence that is empirical is data that produced by an observation or experiment. This is in contrast to data that is derived from logical arguments or conclusions that are propagated by myths and legends.

The classic example is the trial of Galileo. At the time (1633), the Catholic church held to Aristotle's logical argument that the earth was the center of the cosmos. Galileo's observations with his newly invented telescope provided evidence of Copernicus's assertion that the earth revolved around the sun. The outcome of the trial was that Galileo was sentenced to house arrest for heresy. In 2000, Pope John Paul II apologized for the injustice done to Galileo.

The scientist part of the data scientist needs to ask the question, "what is the evidence that leads to a particular conclusion?"

Hypothesis Testing - This process generally asserts two propositions, only one of which can be true. The scientist gathers empirical evidence for and against each proposition, and then accepts one and rejects the other. Often, one of the hypotheses is known as the null hypothesis, and the other as the alternative hypothesis. The null hypothesis is usually a proposition about the way we currently understand the universe to work. The alternative is a proposition about how we think the universe really works. A criminal trial is the classic analogy to understanding hypothesis testing.

A defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough charging evidence the defendant is convicted. In the start of the procedure, there are two hypotheses : "the defendant is not guilty", and "the defendant is guilty". The first one is called null hypothesis, and is accepted for the time being. The second one is called the alternative hypothesis. It is the hypothesis one tries to prove. The hypothesis of innocence is only rejected when an erroneous conviction is very unlikely, because one doesn't want to convict an innocent defendant.

The scientist part of the data scientist needs to ask the question, "what were the null and alternative hypotheses examined to come to a particular conclusion?"

Repeatable Experiments - According to the Wikipedia, an experiment is a methodical trial and error procedure carried out with the goal of verifying, falsifying, or establishing the validity of a hypothesis. Experiments vary greatly in their goal and scale, but always rely on repeatable procedure and logical analysis of the results. A child may carry out basic experiments to understand the nature of gravity, while teams of scientists may take years of systematic investigation to advance the understanding of a subatomic particles.

One prominent example is the "inclined plane," or "ball and ramp experiment." In this experiment Galileo used an inclined plane and several steel balls of different weights. With this design, Galileo was able to slow down the falling motion and record, with reasonable accuracy, the times at which a steel ball passed certain markings on a beam. Galileo disproved Aristotle's assertion that weight affects the speed of an object's fall. According to Aristotle's Theory of Falling Bodies, the heavier steel ball would reach the ground before the lighter steel ball. Galileo's hypothesis was that the two balls would reach the ground at the same time.

The scientist part of the data scientist needs to ask the question, "is there enough information about the methods and data of this experiment that I can replicate it?"

Math edit

(See Chapter 10: Thinking Like a Mathematician for a more in-depth discussion)

Mathematics (along with statistics) is the cerebral part of Data Science. According to the Wikipedia, mathematics is the study of quantity, structure, space, and change. When these are used to solve practical problems it called applied mathematics.

Quantity - By this we simply mean numbers. The mathematician part of the data scientist needs to ask the questions, "how will the thing I am interested in be represented by numbers?" and "what kind of numbers will best represent the thing I am interested in?" The numbers could be integers, fractions, real numbers, or complex numbers. For example, if we are going to investigate highways, we could measure the length of highways in miles as represented by integers. We also need to think about the kinds of operations we will perform on numbers. We use arithmetic to operate on and represent the quantities in our data.

Structure - Most sets of mathematical objects exhibit internal structure. The mathematician part of the data scientist needs to ask the questions, "what sort of internal structure does the thing I am interested in have?" and "what set of equations will expose the structure?" The structures could be a constant progression like $3,6,9,12,...$ , or a simple linear relationship like $Y=X+3$ . For example, if we are going to investigate highways, we might like to know the structure of speed limits or the structure of lane widths. We use algebra to operate on and represent the structure of our data.

Space - The things we investigate often have some relationship to two- or three-dimensional space. When thinking like a mathematician, a data scientist needs to ask the questions, "does the thing I am interested have a spatial component, either actual or theoretical?" and "how do I capture and represent that spatial component?" The spatial component could be latitude and longitude or it could have a surface that is important. For example, if we are going to investigate highways, we might like to know exactly where particular highway segments are located or how smooth the surface of the highway is. We use geometry and trigonometry to operate on and represent the spatial components of our data.

Change - The things we investigate often change—possibly over time or over distance. The mathematician part of the data scientist needs to ask the questions, "does the relationship between the things I am interested in change?" and "how will I describe the changing relationship?" The changes could be . . . For example, if we are investigating highways, the sharpness of curves in the road may change with the speed limit at that part of the highway, or the depth of the asphalt may change the number of cars per hour that may safely drive in that location. We use calculus to operate on and represent the changing relationships within our data.

Applied Math - This is math with specialized knowledge. Generally speaking, this is the kind of math that Data Scientists practice.

Statistics edit

(See Chapter 11: Thinking Like a Statistician for a more in-depth discussion)

Statistics (along with mathematics) is the cerebral part of Data Science. The Wikipedia states that statistics is the study of the collection, organization, analysis, and interpretation of data. It involves the methods for exploring data, discovering patterns and relationships, creating models, and making inferences about the future. Statistics is the discipline that has the straightest-line pedigree to data science. The statistician is responsible for understanding the analysis that will be done on the data, so that it can be collected and organized appropriately.

Collection - A statistician, working with data engineers, ensures that data generation and collection is undertaken in a way that allows valid conclusions to be drawn. The statistician creates the research design, including, if appropriate, the experimental design, that governs the collection of data. The statistician part of the data scientist needs to ask, "what research procedures will be used to generate the data?"

Organization - A statistician, working with data engineers, ensures that data is coded and archived so that information is retained and made useful not just for analysis internal to the project, but also for sharing with others. The statistician is responsible for creating a data dictionary, which is database neutral. A data engineer would create a database schema, which is database specific, based on the data dictionary compiled by the statistician. The data dictionary specifies the variables, the valid values, and the format of the data. The database schema describes how the particular database management system will store the data. The statistician part of the data scientist needs to ask, "are the data stored in such a way as to facilitate the statistical analysis that will be done?"

Analysis - A statistician, working with a mathematician, summarizes, aggregates, correlates, and creates models of the data. The statistician is an expert in analyzing data using descriptive and inferential statistics. This includes creating summaries of the data (such as averages) as well as testing for differences (is this average significantly higher than that average). The statistician part of the data scientist needs to ask, "given the data, which descriptive and inferential statistics ought to be used to test the hypotheses?"

Interpretation - A statistician, working with both a subject matter expert and a visual artist, reports of results and summarised data (tables and graphs) in ways that are comprehensible to those who need to make use of them. The statistician part of the data scientist needs to ask, "who is going to get the results, and what do they want to know?"

Advanced Computing edit

(See Chapter 08: Thinking Like a Programmer for a more in-depth discussion)

Advanced computing is the heavy lifting of data science. According to the Wikipedia, computer programming (often shortened to programming or coding) is the process of designing, writing, testing, debugging, and maintaining the source code of computer programs. This source code is written in one or more programming languages. The purpose of programming is to create a set of instructions that computers use to perform specific operations or to exhibit desired behaviors. The process of writing source code often requires expertise in many different subjects, including knowledge of the application domain, specialized algorithms and formal logic.

Software Design - According to the Wikipedia, software design is a process of turning the purpose and specifications of software into a plan that includes low-level components and algorithm implementations in an overall architectural view. Programmers implement the software design by writing source code. Software designers will often use a modeling language, such as UML to create designs. For example,

The programmer part of the data scientist needs to ask the question, "what components and algorithms do we need in order to solve the problem we are working on?"

Programming Language - According to the Wikipedia, a programming language is an artificial language designed to communicate instructions to a computer. Programming languages are used to create programs that control the behavior of the computer and external devices such as printers, disk drives, and robots. Programs also express algorithms precisely. Programming languages can be thought of a "low-level," such as "assembly languages" that have a near one-to-one correspondence to the machine language functions built in to the hardware central processing unit (CPU). More commonly, programmers use "high-level" languages, such as Java, Python, and C++, which aggregate many machine-level functions together into human-level functions such as "read data" and "print." The programmer part of the data scientist needs to ask the question, "which programming language should I use to solve the problem at hand?"

Source Code - According to the Wikipedia, source code is any collection of computer instructions (with comments) written using some human-readable computer language, usually as text. When executed, the source code is translated into machine code that the computer can directly read and execute. Programmers often use an integrated development environment (IDE) that allows them to type in, debug, and execute the source code. Here are examples of source code for the traditional "Hello World" program as written in Java and Python:

/**
 * Traditional "Hello World" program In Java
 */

class HelloWorldApp {
 public static void main(String[] args) {
 System.out.println("Hello World!"); // Display the string.
 }
}

#
# Traditional "Hello World" program in Python 2.x
#

print "Hello World!"

The programmer part of the data scientist needs to ask the question, "what source code already exists to help solve the problem we are working on?"

Visualization edit

(See Chapter 11: Thinking Like a Visual Artist for a more in-depth discussion)

Visualization is the pretty face of data science. According to Wikipedia, information visualization is the visual representations of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information. The Wikipedia also describes graphic design as a creative process undertaken in order to convey a specific message to a targeted audience. A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.

Creative Process - Wikipedia defines creativity as the process of producing something that is both original and worthwhile. The process includes divergent thinking, which involves the generation of multiple answers to a problem; conceptual blending, in which solutions arise from the intersection of two quite different frames of reference; and, honing, in which an acceptable solution emerges from iterating over many successive unacceptable versions of the solution. The visual artist part of the data scientists needs to ask, "what are several different ways we can show this data"" and "how can we improve this visualization over the next several iterations?"

Data Abstraction - Wikipedia, defines data abstraction as handling data bits in meaningful ways. This implies that we do not want to visualize all the raw data, but that we need to visualize manipulations of the data (aggregations, summarizations, correlations, predictions) that are meaningful in the context of the problem we are trying to solve. The visual artist part of the data scientists needs to ask, "how can we simplify the content of the data so it can be visualized meaningfully?"

Informationally Interesting - According to Wiktionary, humans pay attention to things that are interesting and/or attractive. Something that is attractive or beautiful is pleasing to the senses. While beauty is in the eye of the beholder, there are some more or less agreed upon principles of beauty, such symmetry and harmony. Surprise, within the context of harmony, is especially interesting to humans. The visual artist part of the data scientists needs to ask, "how can we visualize the content of the data so it is pleasing with a touch of surprise?"

Consider the following graphic. It is a partial map of the Internet early 2005. Each line represents two IP addresses. Notice that it abstracts only a subset of data about the internet. It clearly went through a number of iterations to arrive at such a harmonious color scheme. It has an overall symmetry, with some surprises in the details (the bright "stars"). Finally, it is meaningful in the context of understanding the World Wide Web.

Hacker mindset edit

(See Chapter 06: Thinking Like a Hacker for a more in-depth discussion)

Hacking is the secret sauce of data science. According to the Wikipedia, hacking is modifying one's own computer system, including building, rebuilding, modifying, and creating software, electronic hardware, or peripherals, in order to make it better, make it faster, give it added features, and/or make it do something it was never intended to do. For the data scientist hacking goes beyond the computer system to the whole enterprise of solving data problems. Think of it as an advanced do it yourself (DIY) mode of working.

Data science hacking involves inventing new models, exploring new data structures, and mashing the 8 parent disciplines in unconventional ways. Hacking requires boldness, creativity, vision, and persistence. Here are two examples. (Even though they involve hardware, they are presented because they are readily understandable in a few sentences. More complex data science examples are given in chapter four.)

A famous example is Steve Wozniak's hand-made Apple I computer. It was built from parts scrounged from Hewlett-Packard's trash and from electronic surplus supply stores. Wozniak wanted to give the plans away, but his partner, Steve Jobs, convinced him that they should sell ready-made machines. The rest, as they say, is history.
Another example is the Carnegie Mellon Internet Coke Machine.^[1] In the early days of the internet before the web, students at Carnegie Mellon instrumented and wired their local Coke Machine to the internet. The students could check to see which internal dispenser columns had been loaded most recently, so they could be sure to buy cold, not warm, sodas. This was important because the machine sold one Coke every 12 minutes and was re-loaded several times a day.

Data scientists often need the data-equivalent to a hackerspace, where they can congregate to help each other invent new analytic solutions. The hacker part of a data scientist needs to ask, "do we need to modify our tools or create anything new to solve our problem?" and "how do we combine our different disciplines to come up with an insightful conclusion?"

Domain Expertise edit

(See Chapter 12: Thinking Like a Domain Expert for a more in-depth discussion)

Domain Expertise is the glue that holds data science together. According to the Wikipedia, subject matter or domain expertise is proficiency, with special knowledge or skills, in a particular area or topic. Spoken references to subject matter experts sometimes spell out the acronym "SME" ("S-M-E") and other times it is voiced as a word ("smee"). Any domain of knowledge can be subject to a data science inquiry, including—but not limited to—medicine, politics, the physical and biological sciences, marketing, information security, demographics, and even literature. Every data science team must include at least one person who is a subject matter expert on the problem being solved.

Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know, and how best to package the knowledge so it can be readily absorbed by their customers. For example,

Edwin Chen, a data scientist at Twitter, computed and visualized the geographic distribution of tweets that refer to soft drinks as "soda," as "pop," and as "coke."^[2] Just observing that the Midwest uses "pop" and the Northeast uses "soda" is interesting but lacks an explanation. In order to understand WHY these geographic divisions exists, we would need to consult with domain experts in sociology, linguistics, US history, and maybe anthropology—none of whom may know anything about data science. Why do you think these geographic linguistic differences exist?

Nate Silver is a statistician and domain expert in US politics. His blog^[3] regularly combines both the data and an explanation of what it means. In his posting, "How Romney’s Pick of a Running Mate Could Sway the Outcome,"^[4] he not only tells us what the differences are based on his mathematical model, he explains why those outcomes fell out the way they did.

The domain expert part of the data science needs to ask, "what is important about the problem we are solving?" and "what exactly should our customers know about our findings?"

Assignment/Exercise edit

Become familiar with the R programming environment. Get into a group of 3 to 4 students from the class. Work in study sessions together as a team on the following items. See if you can explain to each other what you are doing. Help each other understand what is going on. You will have to try some things several ways until it works right. That is ok. Some of you will "get it" faster than others. Please help each other so you all "get it."

Print a copy and read over Google's R Style Guide.^[5] Right now, most of the guide will not make a lot of sense, but it will make more sense as we progress through the book. Keep the printed copy for future reference.
Search the web for "introduction to R," "R tutorial," "R basics," and "list of R commands." Pick four or five of these web sites to work on. Try working though the first few examples of each site. Many of the introductions go too fast or assume too much prior knowledge, so when it gets too confusing just try another site.
Try the commands:

library(help="utils")
library(help="stats")
library(help="datasets")
library(help="graphics")
demo()
demo(graphics)
demo(persp)

Write a short 5 to 7 line program that will execute without errors and save it. Be sure to include the names of all those who contributed in the comment section.
Make a list of the sites the team worked from, and indicate which was the most helpful.
Make a list of the top 10 unanswered questions the team has at the end of the study session.

References edit

↑ CS Department Coke Machine (14 February 2005). "The "Only" Coke Machine on the Internet". Carnegie Mellon University Computer Science. Retrieved 8 August 2012.
↑ Edwin Chen (6 July 2012). "Soda vs. Pop with Twitter". Edwin Chen's Blog. Retrieved 8 August 2012.
↑ Nate Silver. "FiveThirtyEight". Blog. New York Times. Retrieved 8 August 2012.
↑ Nate Silver (8 August 2012). "How Romney's Pick of a Running Mate Could Sway the Outcome". Blog. New York Times. Retrieved 8 August 2012.
↑ "R Style Guide". Google, Inc. Retrieved 6 July 2012.

Copyright Notice edit

You are free:

to Share — to copy, distribute, display, and perform the work (pages from this wiki)
to Remix — to adapt or make derivative works

Under the following conditions:

Attribution — You must attribute this work to Wikibooks. You may not suggest that Wikibooks, in any way, endorses you or your use of this work.
Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
Other Rights — In no way are any of the following rights affected by the license:

Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
The author's moral rights;
Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.

Notice — For any reuse or distribution, you must make clear to others the license terms of this work.The best way to do this is with a link to the following web page.

http://creativecommons.org/licenses/by-nc-sa/3.0/

[1] CS Department Coke Machine (14 February 2005). "The "Only" Coke Machine on the Internet". Carnegie Mellon University Computer Science. Retrieved 8 August 2012.

[2] Edwin Chen (6 July 2012). "Soda vs. Pop with Twitter". Edwin Chen's Blog. Retrieved 8 August 2012.

[3] Nate Silver. "FiveThirtyEight". Blog. New York Times. Retrieved 8 August 2012.

[4] Nate Silver (8 August 2012). "How Romney's Pick of a Running Mate Could Sway the Outcome". Blog. New York Times. Retrieved 8 August 2012.

[5] "R Style Guide". Google, Inc. Retrieved 6 July 2012.

[1]

[2]

[3]

[4]

[5]