Next Generation Sequencing (NGS)/Ray

Contents edit

A basic knowledge of the UNIX command line is assumed.

In this tutorial, Ray will be installed in $HOME/software using its source code downloaded to $HOME/sources. A dataset will be downloaded to $HOME/datasets and it will be assembled de novo with Ray in $HOME/projects

Installing Ray edit

The first thing to do is to download the Ray tarball that contains its source code.

mkdir -p $HOME/sources
cd $HOME/sources
tar -xjf Ray-v2.1.0.tar.bz2

A MPI library is required to install Ray. On Ubuntu or Debian, the package names are: openmpi-bin, libopenmpi-dev, make, g++.

Optionally, native support for compressed files can be included in Ray. This requires zlib and/or libbz2. On Ubuntu or Debian, the package names are: zlib1g-dev libbz2-dev.

With MPI installed, Ray can now be installed:

mkdir -p $HOME/software/ray
cd $HOME/sources/Ray-v2.1.0
make HAVE_LIBZ=y HAVE_LIBBZ2=y PREFIX=$HOME/software/ray/2.1.0
make install

Obtaining data edit

The commands below fetch E. coli data.

mkdir -p $HOME/datasets/SRA001125
cd $HOME/datasets/SRA001125

Running Ray edit

It is a good habit to create a directory for each project. A directory will therefore be created for this tutorial.

mkdir -p $HOME/projects/Ray-tutorial
cd $HOME/projects/Ray-tutorial

Next, we create symbolic links to the data files so that long paths are not required.

ln -s $HOME/datasets/SRA001125/SRR001665_1.fastq.bz2
ln -s $HOME/datasets/SRA001125/SRR001665_2.fastq.bz2
ln -s $HOME/datasets/SRA001125/SRR001666_1.fastq.bz2
ln -s $HOME/datasets/SRA001125/SRR001666_2.fastq.bz2

An arbitrary number of Ray processes can be launched. In this example, 4 Ray processes are launched. These processes can be on several computers or on a single computer.

mpiexec -n 4 $HOME/software/ray/2.1.0/Ray \
-k 21 -o EcoliAssembly \
-p SRR001665_1.fastq.bz2 SRR001665_2.fastq.bz2 \
-p SRR001666_1.fastq.bz2 SRR001666_1.fastq.bz2 \

The -k parameter sets the length of k-mers.

Assessing the assembly edit

Ray writes files to a single directory. Ray does several automated quality control tests.

You can list the produced files with:

ls EcoliAssembly

The important files are these:

less EcoliAssembly/OutputNumbers.txt
less EcoliAssembly/Contigs.fasta
less EcoliAssembly/Scaffolds.fasta
less EcoliAssembly/CoverageDistribution.txt
less EcoliAssembly/LibraryStatistics.txt