FOSS Education/Research using FOSS

FOSS Education

Preface — Acknowledgements — Introduction — Infrastructure — Administration — Teaching IT with FOSS — Open Content — Research using FOSS — Training in FOSS — Policy Issues — Glossary — Further Readings — About the Author — About APDIP — About IOSN

Traditionally, academic research is carried out in an open manner where the publication of research findings is preceded by a peer review process.All of the assumptions, calculations and experiments that lead to the results are scrutinized before the findings are accepted by journals for publication. The researchers do not usually acquire ownership of their findings and discoveries and they are expected to publish these.

Computer software is often used not only in Computer Science and ICT research but also in research in many other fields. "However, scientists rarely make their software available to other scientists for scrutiny-and even if they did, they often used closed-source programs in which the underlying source code is protected by copyright and trade secrecy claims. But this practice strikes at the heart of science, namely, the notion of verifiability. To be accepted as valid, all calculations and assumptions that go into a given scientific assumption must be open to public scrutiny. Yet closed-source software makes such scrutiny impossible."^[1]

In contrast, the open philosophy of FOSS is entirely consistent with the process of academic research since the source code of the software is also available for examination. Researchers should use FOSS as a tool in their work as far as possible. Bryan Pfaffenberger goes further to argue that "It's not enough for scientists to use open-source software; they must also use an open-source operating system."^[2]

As mentioned earlier, Free/Open Source programming languages, database systems, spreadsheet software and other applications that can be used for computations and data analysis in research are available. More specialized FOSS is also available. Some important examples are presented here:

Publishing

Many research papers in physics and mathematics are written using LaTeX.

Numerical Math

Numerical algorithms are applied to compute approximate solutions for problems where exact solutions are unknown or difficult to obtain. The output of numerical algorithms usually consists of many numbers, that represent variables of interest at different points in time or space.

Numerical values, that somehow belong together, are usually represented as arrays. Matrices and vectors from linear algebra and tensors can be conveniently expressed as arrays. Nearly all numerical algorithms are therefore written in terms of arrays.

Array Manipulation and Prototyping of Algorithms

Numerical algorithms are frequently developed with easy-to-learn, interpreted programming languages. These languages feature built-in array and matrix objects. They come with a big variety of subroutines for numerical computing, the creation of high quality graphics, and file input-output. Their ease of use results in a high development speed, that compensates their relatively slow speed of execution.

Missing: short capabilities

The interpreted languages mentioned here, are themselves 20 - 100 times slower than compiled languages (C, C++, Fortran). However the numerical subroutines are written in a compiled language and execute fast. Therefore from a certain matrix size on, the speed penalty of the interpreted languages becomes less pronounced. Time critical parts of an algorithm can also be written in a compiled language, which is reasonably easy. The speed comparisons here [1], [2] show a 5 - 10 times speed penalty, for a simple finite difference algorithm. The test here [3] compares the performance of linear algebra routines.

Matlab - Proprietary and Wide Spread

The standard in this field is the proprietary language Matlab. The language is specialized for array manipulation. The documentation is good and very extensive; it is also available online [4].

Matlab is proprietary software. It is sold as a base package and many separate toolboxes, that cost extra. Pricing information (commercial license, one user, country USA, November 2007): 1900$ for the base language, toolboxes range from several hundred dollars to several thousand dollars. There are academic discounts.

For a longer discussion of the language itself: look at the almost identical Octave language.

Scilab

Scilab is a special purpose language for matrix manipulation. The syntax is similar to the Matlab language but not identical. It comes with a very large collection of packages to solve problems from many scientific disciplines.

Since version 5, Scilab is free software, distributed under the CeCILL license. [5] [6]

Someone with knowledge of Scilab: Please expand this section!

Scilab's Homepage

Octave

Octave is a free clone of the proprietary Matlab language. It is a special purpose language for array manipulation, and the syntax is very well suited for the task. The language has operators for both elementwise and linear algebra operations. It is also easy to learn. Octave is only the core of Matlab, toolboxes are separate projects and many do not yet exist. The most common algorithms however do exist.

The documentation is fairly good. There exists a tutorial [7].

It is very easy to write extension modules for Octave in C++, C and Fortran. (Easier than for Python.) Octave comes with a C++ library for matrix manipulation, that is also used internally [8].

Octave has no (only obscure) object orientation. So organizing big projects is difficult.

Because Octave is a special purpose language it is very difficult to do something which is outside of Octave's domain. A frequently encountered problem is reading odd file formats. Octave has no good string processing facilities nor an XML reader. Graphical user interfaces would have to be written in a different language.

Octave does not come with an IDE, but many editors can do syntax coloring for the (identical) Matlab language. There is also an Emacs mode for Octave [9], which is very powerful; but only useful for people who like the, rather special, Emacs [10] editor.

Octave's Homepage

Numpy, Scipy, Matplotlib

Numpy, Scipy and Matplotlib are libraries for the Python programming language. Together they give Python the capabilities of a numerical prototyping language.

Numpy provides the array object and all basic array and matrix functionality.
Scipy is a collection of scientific algorithms (many from Netlib) with a Python wrapper.
Matplotlib provides 2D (and very simple 3D) plotting capabilities.

Python [11] is an object-oriented, general purpose, programming language. Nevertheless Python is very easy to learn, and Python programs are easy to understand. It has a large standard library.

R

R supports a wide variety of statistical and numerical techniques. It is also highly extensible through the use of packages, which are user-submitted libraries for specific functions or specific areas of study. Due to its S heritage, R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also eased by its permissive lexical scoping rules.

Another of R's strengths is its graphical facilities, which produce publication-quality graphs which can include mathematical symbols.

Although R is mostly used by statisticians and other practitioners requiring an environment for statistical computation and software development, it can also be used as a general matrix calculation toolbox with comparable benchmark results to GNU Octave and its proprietary counterpart, MATLAB (version < 7).

Libraries

Netlib

GNU Scientific Library

Object-Oriented Numerics

Symbolic Math (Computer Algebra Systems)

Symbolic math is formula manipulation, like you would do with pen and paper. (At least from an engineer's perspective.) Computer algebra systems (CAS) can for example: simplify expressions, derive, integrate, or solve systems of equations.

Maple, Mathematica - Proprietary

Wellknown proprietary packages are Maple and Mathematica. Both packages are very comprehensive, but useful for a general public. They both have good graphical user interfaces. They are very expensive, but there are student discounts.

Maxima

Maxima is a powerful CAS and can compete with Maple and Mathematica. It lacks however a good graphical front end.

Graphical frontends:

Links

[12] List of mathematical software from the Maxima people.
[13] Very big list of symbolical math software on Wikipedia

Bioinformatics

Bioinformatics, in general, is the use of computers to handle biological information. It is the use of computers to characterize the molecular components of living things (computational molecular biology). The most prominent achievement of bioinformatics is the Human Genome Project, an attempt to map the complete set of human genes. A tremendous amount of data needs to be handled in molecular biology and this is clearly possible only with the aid of computers and software.

FOSS features prominently in bioinformatics. Ewan ^[3] argues that "open source makes sense because it follows good and well-known scientific principles. Traditionally, scientific practice has involved openly sharing and discussing results, and providing enough information to allow third-party confirmation of results. Clearly open source software fits well into this model."The second reason for using FOSS is that the "actual data matters much more than the tools used to process it." Sharing the software used to conduct research reduces duplication of effort to develop the software.

The Bioinformatics Organization, Inc. ( http://www.bioinformatics.org ) was founded in 1999 to facilitate worldwide communications and collaborations in bioinformatics research and to provide free and open access to methods and materials in such work. Its website hosts extensive resources, including software and databases, and provides a forum for activities that facilitate the development of such resources

High-end Computing

GNU/Linux and FOSS have been used in projects to provide affordable high-end computing capabilities. This is done by combining the processing power of multiple low-cost servers and workstations into a system that can deliver supercomputer power. According to Cook, " The reason these systems are so effective is that there are a great many very big, very complicated problems that naturally break down into a bunch of iterations of the same, much simpler, problem. That describes everything from forecasting the weather to doing computer animation."^[4]

Beowulf is the name of the architecture used for building a massively parallel system constructed out of commercially available PCs. The computers used for building the system can be 486 systems, Pentium systems and Alpha computers; the computers need not be homogeneous. Even old PCs that would otherwise be discarded can be used to build such a system. In Oak Ridge National Laboratory in the US, the Stone SuperComputer was built using a combination of old PCs connected together using a standard Ethernet network and was used to solve a mapping problem.^[5] The system has a theoretical peak performance of 1.2 gigaflops (FLOPS stands for floating point operations per second. It is used as an approximate measure of computing speed. A gigaflops is one billion FLOPS).

Another example is the supercomputer launched by the State University of New York, which consists of over 2,000 computers running GNU/Linux to conduct drug research to combat cancer, Alzheimer 's disease and AIDS.

Footnotes

↑ Pfaffenberger, B., "Linux in Higher Education: Open Source, Open Minds, Social Justice", Linux Journal, March 02 2000; available from http://www.linuxjournal.com/article.php?sid=5071 .
↑ Pfaffenberger, B., "Linux in Higher Education: Open Source, Open Minds, Social Justice", Linux Journal, March 02 2000; available from http://www.linuxjournal.com/article.php?sid=5071 .
↑ Stewart,B.,"Ewan Birney's Keynote: A Case for Open Source Bioinformatics",O'Reilly Network, 2002; available from www.oreillynet.com/lpt/a/1511.
↑ Cook, R.,"Supercomputers on the cheap", April 2000; available from http://www.cnn.com/2000/TECH/computing/04/13/cheap.super.idg .
↑ Hargrove,W.W.,Hoffman, F. M. and Sterling,T., "The Do-It-Yourself Supercomputer", Scientific American.com,August 16, 2001; available from www.sciam.com/article.cfm?articleID=000E238B-33EC-1C6F-84A9809EC588EF21

[1] Pfaffenberger, B., "Linux in Higher Education: Open Source, Open Minds, Social Justice", Linux Journal, March 02 2000; available from http://www.linuxjournal.com/article.php?sid=5071 .

[2] Pfaffenberger, B., "Linux in Higher Education: Open Source, Open Minds, Social Justice", Linux Journal, March 02 2000; available from http://www.linuxjournal.com/article.php?sid=5071 .

[3] Stewart,B.,"Ewan Birney's Keynote: A Case for Open Source Bioinformatics",O'Reilly Network, 2002; available from www.oreillynet.com/lpt/a/1511.

[4] Cook, R.,"Supercomputers on the cheap", April 2000; available from http://www.cnn.com/2000/TECH/computing/04/13/cheap.super.idg .

[5] Hargrove,W.W.,Hoffman, F. M. and Sterling,T., "The Do-It-Yourself Supercomputer", Scientific American.com,August 16, 2001; available from www.sciam.com/article.cfm?articleID=000E238B-33EC-1C6F-84A9809EC588EF21

[1]

[2]

[3]

[4]

[5]