The World of P2P: What is P2P, a Computer Science Perspective
From a Computer Science Perspective
Technically, a true peer-to-peer application must implement only peering protocols that do not recognize the concepts of "server" and "client". Such pure peer applications and networks are rare. Most networks and applications described as peer-to-peer actually contain or rely on some non-peer elements, such as DNS. Also, real world applications often use multiple protocols and act as client, server, and peer simultaneously, or over time.
P2P under a computer science perspective creates new interesting fields for research not on to the not so recent switch of roles on the networks components, but due to unforeseen benefits and resource optimizations it enables, on network efficiency and stability.
Peer-to-peer systems and applications have attracted a great deal of attention from computer science research; some prominent research projects include the Chord lookup service, the PAST storage utility, and the CoopNet content distribution system (see below for external links related to these projects).
It is also important to notice that the computer is primarily a information devices, whose primary function is to copy data from location to location, even more than performing other types of computations. This makes digital duplication something intrinsic to the normal function of any computer it is impossible to realize the goal of general purpose open computing with any type of copy protection. Enforcement of copyright in the digital era should not be seen as a technical issue but a new reality that society needs to adapt to.
Distributed systems are becoming key components of IT companies for data centric computing. A general example of these systems is the Google infrastructure or any similar system. Today most of the evolution of these systems if being focused on how to analyze and improve performance. A P2P system is also a distributed systems and share, depending on the implementation, the characteristics and problems of distributed systems (error/failure detection, aligning machine time, etc...).
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency.
Ganglia has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes. ( http://ganglia.info/ )
The basic premise behind distributed computation is to spread computational tasks between several machines distributed in space, most of the new projects focus on harnessing the idle processing power of "personal" distributed machines, the normal home user PC. This current trends is an exciting technology area that has to do with a sub set of distributed systems (client/server communication, protocols, server design, databases, and testing).
This new implementation of an old concept has it's roots in the realization that there is now a staggering number of computers in our homes that are vastly underutilized, not only home computers but there are few businesses that utilizes their computers the full 24 hours of any day. In fact seemingly active computers can be using only a small part of it processing power. Using a word processing, email, and web browsing, require very few CPU resources. So the "new" concept is to tap on this underutilized resource (CPU cycles) that can surpass several supercomputers at substantially lower costs since machines that individually owned and operated by the general public.
One of the most famous distributed computation project, , hosted by the Space Sciences Laboratory, at the University of California, Berkeley, in the United States. SETI is an acronym for the Search for Extra-Terrestrial Intelligence. SETI@home was released to the public on May 17, 1999.
In average it used hundreds of thousands of home Internet-connected computers in the search for extraterrestrial intelligence. The whole point of the programs is to run your free CPU cycles when it would be otherwise idle, the original project is now deprecated to be included into BOIC.
Boinc stands for Berkeley Open Infrastructure for Network Computing, a non-commercial (free/w:open source software), released under the LGPL, middleware system for volunteer computing, originally developed to support the SETI@home project and still hosted at ( http://boinc.berkeley.edu/ ), but intended to be useful for other applications in areas as diverse as mathematics, medicine, molecular biology, climatology, and astrophysics. an open-source software platform for computing using volunteered resources that extends the original concept and lets you donate computing power to other scientific research projects such as:
- Climateprediction.net: study climate change.
- Einstein@home: search for gravitational signals emitted by pulsars.
- LHC@home: improve the design of the CERN LHC particle accelerator.
- Predictor@home: investigate protein-related diseases.
- Rosetta@home: help researchers develop cures for human diseases.
- SETI@home: Look for radio evidence of extraterrestrial life.
- Folding@Home ( http://www.stanford.edu/group/pandegroup/folding/ ): to understand protein folding, misfolding, and related diseases.
- Cell Computing biomedical research. (Japanese; requires nonstandard client software)
- World Community Grid: advance our knowledge of human disease. (Requires 5.2.1 or greater)
As a "quasi-supercomputing" platform, BOINC has over 435,000 active computers (hosts) worldwide. BOINC is funded by the National Science Foundation through awards SCI/0221529, SCI/0438443, and SCI/0506411.
It is also used for commercial usages, as there are some private companies that are beginning to use the platform to assist in their own research. The framework is supported by various operating systems: Windows (XP/2K/2003/NT/98/ME), Unix (GNU/Linux, FreeBSD) and Mac OS X.
World Community Grid (WCG)
Created by IBM, World Community Grid ( http://www.worldcommunitygrid.org/ ) is similar to the above systems. Fourteen IBM servers serve as "command central" for WCG. When they receive a research assignment from an organization, they will scour it for security bugs, parse it into data units, encrypt them, run them through a scheduler and dispatch them out in triplicate to the army of volunteer PCs.
To be a volunteer one only needs to download a free, small software agent (similar to a screensaver).
Projects get selected based on the potential to benefit from WCG technology and address humanitarian concerns, and chosen by an independent, external board of philanthropists, scientists and officials.
The software is OpenSource (LGPL), C/C++ and wxWidgets and is available for Windows, Mac, or Linux.
Grids first emerged in the use of supercomputers in the U.S. , as scientists and engineers sought access to scarce high-performance computing resources that were concentrated at a few sites.
Open Science Grid
The Open Science Grid ( http://www.opensciencegrid.org/ ) was built and is operated by the OSG Consortium, it is a U.S. grid computing infrastructure that supports scientific computing via an open collaboration of science researchers and software developers from universities and national laboratories, storage and network providers.
The Globus Alliance ( http://www.globus.org/ ) is a community of organizations and individuals developing fundamental technologies behind the "Grid," which lets people share computing power, databases, instruments, and other on-line tools securely across corporate, institutional, and geographic boundaries without sacrificing local autonomy.
The Globus Alliance also provides the Globus Toolkit, an open source software toolkit used for building robust, secure, grid systems (peer-to-peer distributed computing on supercomputers, clusters, and other high-performance systems) and applications. A Wiki is available to the Globus developer community ( http://dev.globus.org/wiki/Welcome ).
High Throughput Computing (HTC)
As some scientists try extract more floating point operation per second (FLOPS) or minute from their computing environment, others concentrate on the same goal for larger time scales, like months or years, we refer these environments as High Performance Computing (HPC) environments.
The term HTC was coined in a seminar at the NASA Goddard Flight Center in July of 1996 as a distinction between High Performance Computing (HPC) and High Throughput Computing (HTC).
HTC focus is on the processing power and not on the network, but the systems can also be created over a network and so be seen as a Grid network optimized for processing power.
The goal of the Condor Project ( http://www.cs.wisc.edu/condor/ ) is to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources. Guided by both the technological and sociological challenges of such a computing environment, the Condor Team has been building software tools that enable scientists and engineers to increase their computing throughput.
IBM Grid Computing
IBM among other big fishes in the IT pond, spends some resources investigating Grid Computing. Their attempts around grid computing are listed in on the projects portal page ( http://www.ibm.com/grid ). All seem to attempt to leverage the enterprise position on server machines to provide grid services to costumers. The most active project is Grid Medical Archive Solution a scalable virtual storage solution for healthcare, research and pharmaceutical clients.
The traditional method of distributing large files is to put them on a central server. The server and the client can then share the content across the network using agreed upon protocols (from HTTP, FTP to an infinite number of variations) when using IP connections the data can be sent over TCP or UDP connection or a mix of the two, this all depends mostly on the requirements on the service, machines, network and many security considerations.
The advantages regarding optimization of speed, availability and consistency of service in regards to optimal localization is nothing unheard off. Akamai Technologies and Limelight Networks among other similar solutions have attempted to commercially address this issue, even Google has distributed the location of its data centers to increase the response of its services. This has addressed the requirement of large content and service distribution but is not is not a fully decentralization of the control structure.
P2P evolved in to solve a distinct problem, that central servers do not scale well. Bandwidth, space and CPU constitute are a point of failure, that can easily bring the function of a system to an end, as any centralization of services.
Conferences and papers
P2P is not, yet, a well established field of research or even a computer science specific field. P2P technology covers too many subjects, that it is yet hard to restrict all interactions as a field on itself. As a result much relevant information is hard to find.
For conferences one of the locations that has update information from a non-commercial and platform agnostic viewpoints is the list provided by the project GNUnet project at https://gnunet.org/conferences .