Saturday, September 11, 2010

The Interface Needs an Interface

recent publication on Galaxy proposes that it is the missing graphical interface for genomics. Let’s find out.

The tag line of Michael Schatz’s article inGenome Biology states, “The Galaxy package empowers regular users to perform rich DNA sequence analysis through a much-needed and user-friendly graphical web interface.” I would take this description and Schatz’s later comment, “the ambitious goal of Galaxy is to empower regular users to carry out their own computational analysis without having to be an expert in computational biology or computer science” to mean that someone, like a biologist, who does not have much bioinformatics or computer experience could use the system to analyze data from a microarray or next gen sequencing experiment.

The Galaxy package is a software framework running bioinformatics programs and assembling those programs into complex pipelines referred to as workflows. It employs a web-interface and ships with a collection of tools to get a biologist quickly up and running, with examples. Given that Galaxy targets bioinformatics, it is reasonable to assume that its regular users are biologists. So, the appropriate question would be, how much does a biologist have to know about computers to use the system?

To test this question I decided to install the package. I have a Mac, and as a biologist, whether I’m on a Mac or PC, I expect that, if I’m given a download option, the software will easy to download and can be installed using a double click installer program. Galaxy does not have this. Instead, it uses a command line tool (oh, I need to use terminal) that requires Mercurial (hg). Hmm, what’s Mercurial? Mercurial is a version control system that supports distributed development projects. This not quite what I expected, but I’ll give it a try. I go to the hg (someone has a chemistry sense of humor) site and without too much trouble find a Mac OS X package, which uses a double click installer program. I’m in luck - of course I’ll ignore the you might have to add export LC_ALL=en_US.UTF-8, and export LANG=en_US.UTF-8 to your ~/.profile file - hg installs and works.

Now back to my terminal, I ignore the python version check and path setup commands, and type hg clone http://www.bx.psu.edu/hg/galaxy galaxy_dist; things happen. I follow the rest of the instructions - cd galaxy_dist; sh setup.sh - finally I start galaxy with the sh run.sh command. I go to my web browser and type http://localhost:8080 and galaxy is running! Kudos to the galaxy team for making a typically complicated process relatively simple. I’m also glad that I had none of the possible documented problems. However, to get this far, I had to tap into my unix experience.

With Galaxy running, I can now see if Schatz’s claims stand up. What should I do? The left hand menu gives me a huge number of choices. There are 31 categories that organize input/output functions, file manipulation tools, graphing tools, statistical analysis tools, analysis tools, NGS tools, and SNP tools, perhaps 200 choices of things to do. I’ll start with something simple like displaying the quality values in an Illumina NGS file. To do this, I click on “upload file” under the get data menu. Wow! There are 56 choices of file formats - and 17 have explanations. Fortunately there is an auto-detect. I leave that option, go the choose file button to select an NGS file on my hard drive and load it in. I’ll ignore the comment that files greater than 2GB should be uploaded by an http/ftp URL, because I don’t know what they are talking about. Instead I’ll make a small test file with a few thousand reads. I’ll also ignore the URL/text box and choice to convert spaces to tabs and the genome menu that seems to have hundreds of genomes loaded as these options have nothing to do with a fastq file. I’ll assume “execute” means “save” and click it.

After clicking execute some activity appears in the right hand menu indicating that my file is being uploaded. After a few minutes, my NGS file is in the system. To look at quality information, I select the “NGS: QC and manipulation” menu to find a tool. There are 18 options for tools to split files, join files, convert files, and convert quality data in files; this stuff is complicated. Since all I want to do is start with creating some summary statics, I find and select "FASTQ summary statistics." This opens a page in the main window where I can select the file that I uploaded and click the execute button to generate a big 20 column table that contains one row per base in the reads. The columns contain information about the frequency of bases and statistical values derived from the quality values in the file. These data are displayed in a text table that is hard to read, so the next step is to graphically view the data in histogram and box plots.

Graphing tools are listed under a different menu, “Graph/Display Data.” I like box plots, so I’ll select that choice. In the main window I select my summary stats file, create a title for the plot, set the plot’s dimensions (in pixels), define x and y axes titles, and select the columns from the big table that contains the appropriate data. I click the execute button to create files containing the graphs. Oops, I get an error message. It says “/bin/sh gnuplot command not found.” I have to install gnuplot. To get gnuplot going I have to download source, compile the package, and install. To do this I will need developer tools installed along with gnuplot’s other dependencies for image drawing. This is getting to be more work than I bargained for ...

When Schatz said “regular user” he must have meant unix savvy biologist that understands bioinformatics terminology, file formats, and other conventions, and can install software from source code.

Alternatively, I can upload my data into GeneSifter, select the QC analysis pipeline, navigate to the file summary page, and click the view results link. After all, GeneSifter was designed by biologists for biologists.

4 comments:

Anonymous said...

> Alternatively, I can upload my data into GeneSifter

Alternatively, you can upload your data to the public Galaxy server at http://main.g2.bx.psu.edu/

You are comparing *using* a webtool on an external server (GeneSifter) with *installing* your own server. Is this really fair?

Todd Smith said...

I agree, it is not fair to compare an installation experience to a web experience. They are different things. One of the points that I am making is that IT specialists are needed to get the system running. That's OK, but groups should be aware this is an investment they need to make.

However, I did test the web server as you suggested to make a more apples to apples comparison. Unfortunately, I had less success than the installed version. I could upload my file, but I could not select it from the history when I tried to do my analysis. If you let me know when the problem is fixed and I will happily perform a better comparison in a future post.

Anonymous said...

That's too bad. We've had a lot of success getting users up and running. Of course, most users don't need to install Galaxy, they'll run it from the web site.

But if you would like you can watch the introductory tutorial we have to orient people to Galaxy:

http://www.openhelix.com/galaxy

We find that with a little introduction people can get pretty far. Which is true of a lot of software.

Todd Smith said...

Thanks for the note Mary. I understood the software just fine my problems were: 1. the installation requires additional packages that I ddid not want to spend time installing, 2. the available web version appears to have a bug that prevented me from selecting my uploaded file.