A recent publication on Galaxy proposes that it is the missing graphical interface for genomics. Let’s find out.
The tag line of Michael Schatz’s article inGenome Biology states, “The Galaxy package empowers regular users to perform rich DNA sequence analysis through a much-needed and user-friendly graphical web interface.” I would take this description and Schatz’s later comment, “the ambitious goal of Galaxy is to empower regular users to carry out their own computational analysis without having to be an expert in computational biology or computer science” to mean that someone, like a biologist, who does not have much bioinformatics or computer experience could use the system to analyze data from a microarray or next gen sequencing experiment.
The Galaxy package is a software framework running bioinformatics programs and assembling those programs into complex pipelines referred to as workflows. It employs a web-interface and ships with a collection of tools to get a biologist quickly up and running, with examples. Given that Galaxy targets bioinformatics, it is reasonable to assume that its regular users are biologists. So, the appropriate question would be, how much does a biologist have to know about computers to use the system?
To test this question I decided to install the package. I have a Mac, and as a biologist, whether I’m on a Mac or PC, I expect that, if I’m given a download option, the software will easy to download and can be installed using a double click installer program. Galaxy does not have this. Instead, it uses a command line tool (oh, I need to use terminal) that requires Mercurial (hg). Hmm, what’s Mercurial? Mercurial is a version control system that supports distributed development projects. This not quite what I expected, but I’ll give it a try. I go to the hg (someone has a chemistry sense of humor) site and without too much trouble find a Mac OS X package, which uses a double click installer program. I’m in luck - of course I’ll ignore the you might have to add export LC_ALL=en_US.UTF-8, and export LANG=en_US.UTF-8 to your ~/.profile file - hg installs and works.
Now back to my terminal, I ignore the python version check and path setup commands, and type hg clone http://www.bx.psu.edu/hg/galaxy galaxy_dist; things happen. I follow the rest of the instructions - cd galaxy_dist; sh setup.sh - finally I start galaxy with the sh run.sh command. I go to my web browser and type http://localhost:8080 and galaxy is running! Kudos to the galaxy team for making a typically complicated process relatively simple. I’m also glad that I had none of the possible documented problems. However, to get this far, I had to tap into my unix experience.
files greater than 2GB should be uploaded by an http/ftp URL, because I don’t know what they are talking about. Instead I’ll make a small test file with a few thousand reads. I’ll also ignore the URL/text box and choice to convert spaces to tabs and the genome menu that seems to have hundreds of genomes loaded as these options have nothing to do with a fastq file. I’ll assume “execute” means “save” and click it.
After clicking execute some activity appears in the right hand menu indicating that my file is being uploaded. After a few minutes, my NGS file is in the system. To look at quality information, I select the “NGS: QC and manipulation” menu to find a tool. There are 18 options for tools to split files, join files, convert files, and convert quality data in files; this stuff is complicated. Since all I want to do is start with creating some summary statics, I find and select "FASTQ summary statistics." This opens a page in the main window where I can select the file that I uploaded and click the execute button to generate a big 20 column table that contains one row per base in the reads. The columns contain information about the frequency of bases and statistical values derived from the quality values in the file. These data are displayed in a text table that is hard to read, so the next step is to graphically view the data in histogram and box plots.
Graphing tools are listed under a different menu, “Graph/Display Data.” I like box plots, so I’ll select that choice. In the main window I select my summary stats file, create a title for the plot, set the plot’s dimensions (in pixels), define x and y axes titles, and select the columns from the big table that contains the appropriate data. I click the execute button to create files containing the graphs. Oops, I get an error message. It says “/bin/sh gnuplot command not found.” I have to install gnuplot. To get gnuplot going I have to download source, compile the package, and install. To do this I will need developer tools installed along with gnuplot’s other dependencies for image drawing. This is getting to be more work than I bargained for ...
When Schatz said “regular user” he must have meant unix savvy biologist that understands bioinformatics terminology, file formats, and other conventions, and can install software from source code.
Alternatively, I can upload my data into GeneSifter, select the QC analysis pipeline, navigate to the file summary page, and click the view results link. After all, GeneSifter was designed by biologists for biologists.