Data Science: An Introduction/A History of Data Science

Data Science: An Introduction

Chapter 01: A History of Data Science

Welcome to Data Science
Thinking about the World
Analyzing and Visualizing, Part One
- 13: Single Variable Analysis
- 14: Single Variable Tables and Plots
Setting up the Problem
Collecting, Ingesting, Transforming Data
Analyzing and Visualizing, Part Two
Emergent Answers to Free Form Problems
- 24: Non-Theory-Based Inquiry
- 25: Exploratory Analysis
Analyzing and Visualizing, Part Three
Presenting Results
Appendices

Chapter Summary

Data Science is a composite of a number of pre-existing disciplines. It is a young profession and academic discipline. The term was first coined in 2001. Its popularity has exploded since 2010, pushed by the need for teams of people to analyze the big data that corporations and governments are collecting. The Google search engine is a classic example of the power of data science.

Discussion

Data science is a discipline that incorporates varying degrees of Data Engineering, Scientific Method, Math, Statistics, Advanced Computing, Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data Science is called a Data Scientist. Data Scientists solve complex data analysis problems.

Origins

The term "Data Science" was coined at the beginning of the 21st Century. It is attributed to William S. Cleveland^[1] who, in 2001, wrote "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics."^[2] About a year later, the International Council for Science: Committee on Data for Science and Technology^[3] started publishing the CODATA Data Science Journal beginning April 2002.^[4] Shortly thereafter, in January 2003, Columbia University began publishing The Journal of Data Science .^[5]

Development

During the "dot-com" bubble of 1998-2000, hard drives became really cheap. So corporations and governments started buying lots of them. One corollary of Parkinson's Law is that data always expands to fill the disk space available. The “disk-data” interaction is a positive exponential cycle between buying ever more disks and accumulating ever more data. This cycle produces big data. Big data is a term used to describe data sets so large and complex that they become awkward to work with using regular database management tools.

Once acquired, we have to do something with the big data besides just store it. We need big computing architectures. Companies like Google, Yahoo!, and Amazon invented the new computing architecture, which we call cloud computing. One of the most important inventions within cloud computing is called MapReduce. MapReduce has been codified into the software known as Hadoop. We use Hadoop to do big computing on big data in the cloud.

The normal computing paradigm is that we move data to the algorithm. For example, we read data off a hard drive and load it into a spreadsheet program to process. The MapReduce computing paradigm is just the opposite. The data are so big we cannot put it all into the algorithm. Instead, we push many copies of the algorithm out to the data.

It turns out that Hadoop is difficult to do. It requires advanced computer science capabilities. This opens up a market for the creation of analytics tools—with simpler interfaces—that run on top of Hadoop. This class of tools are called “Mass Analytic Tools”—that is, tools for the analysis of massive data. Examples of these are “recommender systems, ”machine learning,” and “complex event processing.” These tools, while having a simpler interface to Hadoop, have complex mathematical underpinnings, which also require specialization.

So, with the advent of mass analytic tools, we need people to understand the tools and actually do the analysis of big data. We call these people, “Data Scientists.” These people are able to tease out new analytic insights never before possible in the world of small data. The scale of problems that are solved by analyzing big data are such that no single person can do all the data processing and analytic synthesis required. Therefore, data science is best practiced in teams.

In sum, cheap disks --> big data --> cloud computing --> mass analytic tools -->
              --> data scientists --> data science teams --> new analytic insights.

Popularization

Mike Loukides,^[6] Vice President of Content Strategy for O'Reilly Media helped to bring Data Science into the mainstream vernacular in 2010 with his article "What is data science?"^[7] In the last few years, data science is increasingly being associated with the analysis of Big data. In the mid-2000s, DJ Patil at LinkedIn and Jeff Hammerbacher at Facebook created data science teams specifically to derive business value out of the extremely large data being generated at by their websites.^[8]^[9]

There are now several ongoing conferences devoted to big data and data science, such as O'Reilly's Strata Conferences^[10] and Greenplum's Data Science Summits.^[11]

The job title has similarly become very popular. On one heavily used employment site, the number of job postings for "data scientist" increased more than 10,000 percent between January 2010 and July 2012.^[12]

Academic Programs

Several universities have begun graduate programs in data science, such as at the Institute for Advanced Analytics at North Carolina State University,^[13] the McCormick School of Engineering at Northwestern University,^[14] and the now-discontinued six-week summer program at the University of Illinois.^[15]

Professional Organizations

A few professional organizations have sprung up recently. Data Science Central^[16] and Kaggle ^[17] are two such examples. Kaggle is an interesting case. They crowdsource data science solutions to difficult problems. For example, a company will put up a hard problem with Kaggle. Data scientists from around the world sign up with Kaggle, then compete with each other to find the best solution. The company then pays for the best solution. There are over 30,000 data scientists registered with Kaggle.

Case Study

In the mid- to late-1990s, AltaVista was the most popular search engine on the internet. It sent "crawlers" to extract the text from all the pages on the web. The crawlers brought the text back to AltaVista. AltaVista indexed all the text. So, when a person searched for a key word, Altavista could find the web pages that had that word. AltaVista then presented the results as an ordered list of web pages, with the pages that had the most frequent mentions of the term at the top. This is a straightforward computer science solution, though at the time, they solved some very difficult scaling problems.

In the late 1990s the founders of Google invented a different way to do search. They combined math, statistics, data engineering, advanced computation, and the hacker spirit to create a search engine that displaced AltaVista. The algorithm is known as PageRank. PageRank looks not only at the words on the page but the hyperlinks as well. PageRank assumes that an inbound hyperlink is an indicator that some other person thought the current page was important enough to put a link to it on their own page. Thus the pages with the most hyperlinks end up at the top of the list of search results. PageRank captures the human knowledge about web pages, in addition to the content.

Google not only crawled the web, they ingested the web. That is big data. They then have to calculate the PageRank algorithm across that big data. That requires massive compute. Then they have to make search and search results fast for everyone. Google search is a triumph of data science (though it was not called data science when it started).

Assignment/Exercise

(This section was imported from the R Programming Wikibook chapter on Settings and then modified.)

Get into groups of 2 or 3 students. Download and install the R programming language on to your computer. Help each other get R up and running.

Go to the R website: http://www.r-project.org/
Click on the CRAN mirror link
Click on the Linux, or Mac OSX, or Windows link

Linux

Installing R on Debian-based GNU/Linux distributions (e.g. Ubuntu or Debian itself) is as simple as to type in sudo aptitude install r-base or sudo apt-get install r-base, or installing the package r-base using your favourite package manager, for example Synaptic.

There is also a bunch of packages extending R to different purposes. Their names begin with r-. Take a closer look at the package r-recommended. It is a metapackage that depends on a set of packages that are recommended by the upstream R core team as part of a complete R distribution. It is possible to install R by installing just this package, as it depends on r-base.

Installation with apt-get (Debian, Ubuntu and all linux distributions based on Debian)

sudo apt-get install r-base
sudo apt-get install r-recommended

Installation with aptitude (Debian, Ubuntu and all linux distributions based on Debian)

sudo aptitude install r-base
sudo aptitude install r-recommended

Mac OS

Installation : Download the disk image (dmg file) and install R.

The default graphical user interface for Mac is much better than the one for Windows. It includes

a dataframe manager,
a history of all commands,
a program editor which supports syntax highlighting.

Windows

(This section was imported from the Wikiversity project: "How to use R" course chapter on installation.)

To install R under Windows operating system you have to download the binaries from the web. First go to the R-Project website (listed above) and click CRAN under download section on the left panel and select a mirror site, from where you could download the required content. The best idea is to pick a mirror closest to your actual geographical location, but other ones should work as well. Then click Windows and in subdirectories base. The windows binary is the exe file, in form R-x.x.x-win32.exe, where x denotes the actual version of the program. Regardless of the version the setup has the same steps.

As usual in Windows, if you just keep clicking the Next button, you will install the program without any problems. However, there are few things that you can alter.

On the welcome screen click Next.
Read or just notice the GNU license, and click Next.
Select the location, where R should be installed. In case you don't prefer a particular location on your hard disc, the default choice will be OK for you.
During the next step you can specify which parts of R you want to install. Choices are: User installation, Minimal user installation, Full installation and Custom installation. Notice the required space under the selection panel (varies between 20 and 66 MB). In case you are a beginner in R, choose the default User installation.
In this step you can choose between 2 ways. If you accept defaults, you skip the 3 "extra" steps during installation (see lower).
You can specify the Start menu folder.
In the next step you can choose, between shortcut possibilities (desktop icon and/or quick launch icon) and specify registry entries.

With these steps you can customize the R graphical user interface.

You can choose if you want an R graphic user interface covering the whole screen (MDI) or a smaller window (SDI).
You can select the style, how the Help screen is displayed in R. You will use help a lot, so this may be an important decision. It is up to you, which style you prefer. Please note, that the content of help file will be the same regardless of your choice. Here you specify just the appearance of that particular window.
In the next step you can specify, whether you want to use internet2.dll. If you are a beginner, pick the Standard option here.

Portable R for Windows

If you want to install R on your USB stick go to the Portable R^[18] website. This is useful if you don't have admin rights on a computer. The basic installation requires something like 115 mb but you may need more if you want to install add-on packages.

References

↑ William S. Cleveland. "Faculty Page". Retrieved 6 July 2012.
↑ Cleveland, W. S. (2001). "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". International Statistical Review / Revue Internationale de Statistique. 69 (1).
↑ "International Council for Science : Committee on Data for Science and Technology". Retrieved 6 July 2012.
↑ "CODATA Data Science Journal". Volume 1, Issue 1. Retrieved from Japan Science and Technology Information Aggregator. April 2002. Retrieved 6 July 2012.
↑ "The Journal of Data Science". Volume 1, Issue 1. Columbia University. January 2003. Retrieved 6 July 2012.
↑ "Mike Loukides". O'Reilly Media, Inc. Retrieved 7 July 2012.
↑ Mike Loukides (June 2010). "What is Data Science?". O'Reilly Media, inc. Retrieved 7 July 2012.
↑ Patil, DJ (2011). Building Data Science Teams. Sebastopol, CA: O’Reilly Media, Inc.
↑ DJ Patil (16 September 2011). "Building Data Science Teams". O’Reilly Media, Inc. Retrieved 7 July 2012.
↑ "Strata Conference 2012". O’Reilly Media, Inc. Retrieved 7 July 2012.
↑ "Data Science Summits". Greenplum, Inc. Retrieved 7 July 2012.
↑ "Data Science Job Trends". Indeed.com. Retrieved 7 July 2012.
↑ "Institute for Advanced Analytics". North Carolina State University. Retrieved 7 July 2012.
↑ "Master of Science in Analytics". Northwestern University. Retrieved 7 July 2012.
↑ "Data Sciences Summer Institute". University of Illinois at Urbana-Champaign. Retrieved 7 July 2012.
↑ "Data Science Central". Data Science Central. Retrieved 7 July 2012.
↑ "kaggle". onwards Kaggle Inc. Retrieved 13 July 2012.
↑ "Portable R". Retrieved 14 July 2012.

Copyright Notice

You are free:

to Share — to copy, distribute, display, and perform the work (pages from this wiki)
to Remix — to adapt or make derivative works

Under the following conditions:

Attribution — You must attribute this work to Wikibooks. You may not suggest that Wikibooks, in any way, endorses you or your use of this work.
Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
Other Rights — In no way are any of the following rights affected by the license:

Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
The author's moral rights;
Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.

Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to the following web page.

http://creativecommons.org/licenses/by-sa/3.0/

[1] William S. Cleveland. "Faculty Page". Retrieved 6 July 2012.

[2] Cleveland, W. S. (2001). "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics". International Statistical Review / Revue Internationale de Statistique. 69 (1).

[3] "International Council for Science : Committee on Data for Science and Technology". Retrieved 6 July 2012.

[4] "CODATA Data Science Journal". Volume 1, Issue 1. Retrieved from Japan Science and Technology Information Aggregator. April 2002. Retrieved 6 July 2012.

[5] "The Journal of Data Science". Volume 1, Issue 1. Columbia University. January 2003. Retrieved 6 July 2012.

[6] "Mike Loukides". O'Reilly Media, Inc. Retrieved 7 July 2012.

[7] Mike Loukides (June 2010). "What is Data Science?". O'Reilly Media, inc. Retrieved 7 July 2012.

[8] Patil, DJ (2011). Building Data Science Teams. Sebastopol, CA: O’Reilly Media, Inc.

[9] DJ Patil (16 September 2011). "Building Data Science Teams". O’Reilly Media, Inc. Retrieved 7 July 2012.

[10] "Strata Conference 2012". O’Reilly Media, Inc. Retrieved 7 July 2012.

[11] "Data Science Summits". Greenplum, Inc. Retrieved 7 July 2012.

[12] "Data Science Job Trends". Indeed.com. Retrieved 7 July 2012.

[13] "Institute for Advanced Analytics". North Carolina State University. Retrieved 7 July 2012.

[14] "Master of Science in Analytics". Northwestern University. Retrieved 7 July 2012.

[15] "Data Sciences Summer Institute". University of Illinois at Urbana-Champaign. Retrieved 7 July 2012.

[16] "Data Science Central". Data Science Central. Retrieved 7 July 2012.

[17] "kaggle". onwards Kaggle Inc. Retrieved 13 July 2012.

[18] "Portable R". Retrieved 14 July 2012.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]