The CGL library is packaged like a normal CPAN module (insofar as there is such a beast...) so installing it should be straightforward. There are a variety of sources within the perl community that describe the process, including one at CPAN and the perlmodinstall man/pod page (which should be available on your system, you should be able to read it with "perldoc perlmodinstall").
CGL depends on a few other Perl packages though, and they in turn have dependencies on other Perl packages and libraries. Getting everything in the right place and satisfying all of these dependencies can require a bit of attention and the process depends on the operating system you're using.
In particular, CGL itself depends on BioPerl and the Gnome project's XML parser. Depending on how you intend to use it, you'll also probably need the Chaos-xml toolset for creating the XML files that CGL uses as input and the Datastore module to help manage large collections of those files.
We'll discuss the process in detail and give some operating system specific hints below.
We can't really tell you the best way to acquire and install these prerequisites (but we have some helpful hints down below). In some cases you'll need to make a choice between downloading a precompiled package that's specific for your operating system vs. building the package from source. In other cases (e.g. perl modules in CPAN), you'll have to decide whether to fetch them and build them by hand or to use perl's CPAN module to automate it. Some of the perl modules (e.g. XML::LibXML) can't be built by any method until their underlying libraries have been installed and thinking about that brings you right back to this basic question. The FreeBSD ports system automates all of this beautifully while fink and darwinports make a pretty go stab at it for Mac OS X. Other operating systems offer similar systems but you might even choose and to build and install all of the dependencies yourself.
There's no right answer. The trick that's most expedient now may make upgrading later more difficult but sometime's the simplest path is the best. If you have any doubts about what will work well in your environment, you should check with your local gurus.
We've tried hard to describe the things that you need to do below, and give some operating system specific examples for simple situations. Hopefully it'll get you heading in the right direction.
We have been developing CGL on a variety of modern Unix-like systems using Perl 5.8 line of releases. We've tested this release with the following set of operating systems and perl releases:
- FreeBSD-stable on i386, with perl, v5.8.5 built for i386-freebsd-64int.
- Redhat's Fedora Core 3 Linux on i386 with perl, v5.8.5 built for i386-linux-thread-multi.
- Ubuntu 5.04 Linux on i386 with perl, v5.8.4 built for i386-linux-thread-multi.
- Mac OS X Panther with perl, v5.8.1-RC3 built for darwin-thread-multi-2level.
You'll need to install the underlying C libraries (from source, or using your platform's favorite package manager), then grab and install the XML::LibXML package and it's dependencies from CPAN.
CGL uses and/or extends a variety of modules provided by BioPerl. While all of the functionality that CGL needs is included in BioPerl's "stable" 1.4 release, the Chaos-xml tools for transforming various sources of genomic data into the formats that CGL expects require a newer version of the library.
We recommend that you use the BioPerl 1.5 "developers" release and give some pointers to obtaining and installing it below. We have also tested the process with bioperl-live from their CVS repository, but since it's something of a moving target, we won't go into particular details about installing it. If you're into bleeding edge stuff and notice any CGL problems when you're playing with bioperl-live, please drop us a note.
The BioPerl library provides an impressive array of functionality and in turn has a set of dependencies which may seem overwhelming. Since only a small portion of this functionality is used by the CGL or Chaos-XML libraries, we've concentrated on getting describing a minimalist BioPerl installation with the the functionality that we require. Installing a fully capable version is left as an exercise, there's lots of great information on the BioPerl web site and a very strong user community to lend a hand.
The Chaos-XML Library contains software and specifications for the Chaos XML format for representing biological sequences and sequence features. It grew out of a need for an annotation data exchange format within the Berkeley Drosophila Genome Project for an XML format that was capable of expressing the rich annotation data that they were representing in their Chado database.
More information about the design and implementation of the library is available at its web site.
The Datastore modules provide a graceful way to work with a large number of subdirectories (e.g. one subdirectory per gene in a genome). Many filesystems have performance problems with large numbers of subdirectories in a directory and even when the underlying filesystems handle things gracefully, access via network filesystems can be an issue.
The Datastore modules create a hiearchy of subdirectory layers, starting from a "base", and mapping end-user's identifiers to the corresponding subdirectory. There are provisions for a variety of ID to subdirectory mappings, we use a random mapping based on MD5 hashes (Datastore::MD5) exclusively, but the package includes an example that uses digits from a drosophila "CG" identifier. For example, a datastore with a depth of 2 and a root at /tmp would map "CG1040" to "/tmp/0A/70/CG1040".
The package includes library routines for cd'ing into a Datastore directory given it's ID, iterating over all of the ID's in a Datastore, iterating over some of the ID's in a Datastore, etc.... There are also a pair of scripts, ds_dir and ds_do, that provide shell-level access to Datastores.
You can download a CPAN style package containing the datastore modules below.
You'll need to go through the following steps to install the CGL library:
- Make sure that you understand how to build and install CPAN style perl modules on your platform. We recommend that you read through the 'perlmodinstall' document, you should be able to read it by running "perldoc perlmodinstall", or the current version of it is available on the web. If you're an Apple OS X user, you should follow the "Unix" guidelines, their "mac" guidelines refer to OS 9 and earlier.
- Think about what your setting for PREFIX should be. See the previous item if you don't know if/when/how you should set a PREFIX. You'll probably need to if you don't have permission to install stuff into the system directories, and you may want to if you're setting up your own private set of perl libraries.
Take care of the prerequisites:
- Chaos-xml [optional]
- Datastore [optional]
Unpack the CGL distribution and build it
- Use tar [and possibly gzip] to unpack the distribution, e.g. tar -xvzf CGL-1.00.gz.
- Change your current working directory to the top level of the distribution, e.g. cd CGL-1.0.
- And finally, build the Makefile from the Makefile.PL, setting PREFIX if you've decided you should, e.g. perl Makefile.PL.
- Test the distribution
Set the CGL_SO_SOURCE environment variable to point to
your copy of the sequence ontology source file. The CGL
distribution includes a copy of the version we used for
our testing in the sample_data directory.
If you use a C shell derivative (e.g. csh, tcsh), you'll want to use something like this:
setenv CGL_SO_SOURCE /usr/home/moose/src/CGL-1.0/sample_data/so.obo
setenv CGL_GO_SOURCE /usr/home/moose/src/CGL-1.0/sample_data/gene_ontology.obo
If you use a Bourne Shell derivative (e.g. sh, bash, ash), you'll need something like:
- Run make test, which should run to completion, without any failures.
- Set the CGL_SO_SOURCE environment variable to point to your copy of the sequence ontology source file. The CGL distribution includes a copy of the version we used for our testing in the sample_data directory.
- Install it
Run make install, possibly
as the root user, depending on your system and if/how you
set your PREFIX.
Operating system-specific and documentation
CGL installs a set of sample scripts that should help you become familiar with how to use the libraries. These scripts should have been installed when you installed the CGL libraries, in whatever directory perl normally installs such things in (/usr/local/bin, or PREFIX/bin if you set a PREFIX when you created the Makefile).
The samples include:
This script validates a chaos-XML data file, and is a good
way to validate that all of the XML parsing tools are correctly
installed. You can test it by running:
> cgl_validate [path to the CGL distribution]/sample_data/dmel.sample.chaos.xml
DOCUMENT IS VALID
This script provides examples of manipulating
annotations using CGL. Try:
> cgl_tutorial [path to the CGL distribution]/sample_data/dmel.sample.chaos.xml
This script demonstrates CGL BLAST and PHAT HSP
> cgl_phat_tutorial -t blastp [path to the CGL distribution]/sample_data/blastp.sample.report
Finally have a look at the CGL TUTORIAL PDF on the CGL web page. This document explains how to combine the Phat Hit and Phat HSP aspects of CGL with its ability to manipulate annotations. Together, these two aspects of CGL make it possible to align genes and sequences to one another in completely new ways.
Converting GenBank to Chaos.xml
This section discusses walks through downloading a genome and it's annotations from GenBank and converting those annotations to a datastore full of chaos.xml documents.
Download and install the Datastore modules
If you haven't already downloaded and installed the CGL Datastore modules, you should do so now. Grab the tarball from the Downloads section and install them (they use the traditional CPAN procedure):
> gunzip Datastore-0.xx.tar.gz > tar xvf Datastore-0.xx.tar > cd Datastore-0.xx > perl Makefile.PL > make test > sudo make install # you'll only need to sudo to install system-wide
The Bio-Chaos modules
Next you'll need to download and install the Bio-Chaos modules. Read the Chaos XML section of this page and follow the link to the Chaos XML site. After grabbing the library source tarball, install it in the traditional CPAN way. The Chaos libraries can make use of a perl module named XML::Parser::PerlSAX and will complain if you don't have it installed. We don't need it for our purposes though, so you can just keep going.
> gunzip Bio-Chaos-0.xx.tar.gz > tar xvf Bio-Chaos-0.xx.tar > cd Bio-Chaos-0.xx > perl Makefile.PL > make > make test > sudo make install # you'll only need sudo to install system-wide.
Now have a look in the Bio-Chaos-0.xx/bin directory. Note the script called cx-genbank2chaos.pl. If you run it with its "-h" option you can see the various options it supports:
Often, a GenBank file contains more than one gene. Using the -islands option will create one chaos.xml file for every gene in the GenBank flat file. This is the option you will want to use.
Generally, GenBank flat files have a *.gbk suffix. Try the command on some of the sample data contained in the Bio-Chaos sample-data directory:
>cx-genbank2chaos.pl -islands sample-data/AE003734.gbk
Note that the output of this command creates a directory named AE003644.3. Within this directory are chaos.xml documents describing each of the genes contained within the file AE0033734.gbk. You may notice that the files have funny names, for example:
Downloading an annotated genome from GenBank
Now download a genome from GenBank. Keep in mind GenBank sometimes moves things around. Basically you are looking for the "Genomes" division. To download the latest D. melanogaster assembly and its annotations use this command
>wget -r -np -nv ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster
When this finishes... you will find a directory beneath your current working directory entitled: ftp.ncbi.nih.gov; have a look inside:
> ls CHR_2 CHR_4 CHR_X RELEASE_4_1 CHR_3 CHR_Un README
Converting the genbank files to chaos.xml
Genomes are big things, so you might want to test things out before proceeding to attempt to make an entire datastore. As it turns out CHR_4 of Drosophila is quite small. This is therefore a good chromosome to test things on before you try the others, as it only contains about 80 genes.
Try this command to generate chaos.xml documents for every annotation on chromosome 4 (don't forget to check the errors file when it finishes):
>cx-genbank2chaos.pl -islands -ds_root chaos_datastore CHR_4/*.gbk >& errors
To dump the entire genome use this command:
>cx-genbank2chaos.pl -islands -ds_root chaos_datastore CHR_*/*.gbk >& errors
Creating a datastore index using ds_do and cgl_validate
If you read the documentation concerning the Datastore library you will find that the code comes with some helpful scripts to help you navigate a datastore. One of these is called "ds_do". This script can be used in a variety of ways to learn about the contents of a datastore. For example, typing :
>ds_do -all ls
ds_do can be used together with the scripts provided in the CGL script directory to find out even more about the contents of a datastore. Try using cgl_validate together with ds_do to create an index of a datastore
This command will create will accomplish two desirable tasks at once. First it will create a file containing tab-delimited file with the following columns:
Finally, the CGL cgl_tutorial provides a quick jumpstart to using CGL & Chaos.xml documents. Try it out and have a look at the code within for hints as to how to access different parts of an annotation.
Here's a tutorial (in pdf format) about using CGL.
Fleshing out the documentation for the various modules is still under way. Most of the modules have both class and method documentation, and we're in the process of finishing off the rest.