Sunday, November 22, 2009

Supercomputing 09

Teraflops, exaflops, exabytes, exascale, extreme, high dimensionality, 3D Internet, sustainability, green power, high performance computation, 400 Gbps networks, and massive storage were just some of the many buzz words seen and heard last week at the 21st annual supercomputing conference in Portland, Oregon.

Supercomputing technologies and applications are important to Geospiza. As biology becomes more data intensive, Geospiza follows the latest science and technology developments by attending conferences like supercomputing. This year, we participated in the conference through a "birds of a feather" session focused on sharing recent progress in the BioHDF project.

Each year the Supercomputing (SC) conference has focus areas called "thrusts." This year the thrusts were 3D Internet, Biocomputing, and Sustainability. Each day of the technical session started with a keynote presentation that focused on one of the thrusts. Highlights from the keynotes are discussed below.

First thrust: the 3D Internet

The technical program kicked off with a keynote from Justin Rattner, VP and CTO at Intel. In his address, Rattner discussed the business reality that high performance computing (HPC) is an $8 billion business with little annual growth (3% AGR). The primary sources for HPC funding are government and trickle up technology from PC sales. To break the dependence on government funding, Rattner suggested that HPC needs a "killer app" and suggested that the 3D Internet might just be that app. He went on to elaborate on the kinds of problems, such as continuously simulating environments, realistic animation, dynamic modeling and continuous perspectives, that are solved with HPC. Also, because immersive and collaborative virtual environments can be created, the 3D Internet provides a platform for enabling many kinds of novel work.

To illustrate, Rattner was joined by Aaron Duffy, a researcher at Utah State. Rather, Duffy’s avatar joined us as his presentation was in the Science SIM environment. Science SIM is a virtual reality system that is used to model environments and how they respond to change. For example, Utah State is studying how ferns respond to and propagate genetic changes in FernLand. Another example included how 3D modeling can save time and materials in fashion design.

Next, Rattner introduced how the current 3D Internet resembles the early days of the Internet when people struggled with the isolated networks of AOL, Prodigy and Compuserve. It wasn't until Tim Berners-Lee, and Marc Andreessen introduced the World Wide Web http protocol and Mosiac web browser, that the Internet had a platform on which to standardize. Similarly, the 3D Internet needs such a platform. Rattner introduced OpenSim as a possibility. In the OpenSim platform, extensible modules can be used to create different worlds. Because these worlds are built with a common infrastructure, users could have an avatar that could move between worlds, rather than have a new avatar for each world as they do today.

Second thrust: biocomputing

Leroy Hood kicked off the second day with a keynote on how supercomputing can be applied to systems biology and personalized medicine. Hood expects that within 10 years diagnostic assays will be characterized by billions of measurements. We will have two primary kinds of information feeding these assays: the digital data of the organism and data from the environment. The challenge is measuring how the environment affects the organism. To make this work we need to integrate biology, technology, and computers in better ways then we do today.

In terms of personalized medicine, Hood described different kinds of analyses and their purpose. For example, global analysis - such as sequencing a genome, measuring gene expression, or comprehensive protein analysis - creates catalogs. These catalogs then form the foundation for future data collection and analysis. The goal of such analysis is to create predictive actionable models. Biological data however, are noisy, and meaningful signals can be difficult to detect, so improving the signal to noise ratio requires the ability to integrate large volumes of multi-scalar data with diverse data types including biological knowledge. As the goal is to develop predictive actionable models we need supercomputers capable of dynamically quantifying information.

As an example, Hood presented work showing how disease states result in perturbations in regulated networks. In prion disease, the expression of many genes change over time as non-disease states move toward disease states. Interestingly, as disease progression is followed in mouse models, one can see expression levels change in genes that were not thought to be involved in prion disease. More importantly, these genes show expression changes before the physiological effects are observed. In other words, by observing gene expression patterns, one can detect a disease much earlier than they would by observing symptoms. Because diseases detected early are easier to treat, early detection can have beneficial consequences for reducing health care costs. However, measuring gene expression changes by observing changes in RNA levels is currently impractical. The logical next step is to see if gene expression can be measured by detecting changes in the levels of blood proteins. Of course, Hood and team are doing that too, and he showed data, from the prion model, that this is a feasible approach.

Using the above example, and others from whole genome sequencing, Hood painted a picture of future diagnostics where we will have our genomes sequenced at birth and each of us will have a profile created of organ specific proteins. In Hood's world, this would require 50 measurements from 50 organs. Blood protein profiles will be used as controls in regular diagnostic assays. In other assays, like cancer diagnostics, 1000’s of individual transcriptomes will be measured simultaneously in single assays. Similarly, 10,000 B-cells or T-cells could be sequenced to asssess immune states and diagnose autoimmune disorders. In the not too distant future, it will be possible to interrogate databases containing billions of data points from 100's of millions of patients.

With these possibilities on the horizon, there are a number of challenges that must be overcome. Data space is infinite, so queries must be constructed carefully. The data that need to be analyzed have high dimensionality, so we need new ways to work with these data. Finally multi-scale datasets must be integrated together and data analysis systems must be interoperable. Meeting these final challenges requires that standards for working with data be developed and adopted. Finally, Hood made the point that groups like his can solve some of the scientific issues related to computation, but not the infrastructure issues that must also be solved to make the vision a reality.

Fortunately, Geospiza is investigating technologies to meet current and future biocomputing challenges through the company’s product development and standards initiatives like the BioHDF project.

Third thrust: sustainability

Al Gore gave the third day’s keynote address and much of his talk addressed climate change. Gore reminded us that 400 years ago, Galileo collected the data that supported Copernicus’ theory that the earth’s rotation creates the illusion of the sun moving across the sky. He went on to explain how Copernicus reasoned that the illusion is created because the sun is so far away. Gore also explained how difficult it was for people of Copernicus', or Galileo’s, time to accept that the universe does not rotate around the earth.

Similarly, when we look into the sky we see an expansive atmosphere that seems to go on for ever. Pictures from space however tell a different story. Those pictures show us that our atmosphere is a thin band, only 1/1000 the size of the earth’s volume. The finite volume our atmosphere explains how we can change our climate when we pump billions of tons of CO[2] into the atmosphere as we are doing now. It is also hard for many conceptualize that the CO[2] is affecting the climate when they do not see or feel direct or immediate effects. Gore added the interesting connections that the first oil well, drilled by “Colonel” Edwin Drake in Pennsylvania, and discovery, by John Tyndall, that CO[2] absorbs infrared radiation both occurred in 1859. 150 years ago we not only had the means to create climate change, but understood how it would work.

Gore outlined a number of ways in which supercomputing and the supercomputing community can help with global warming. Climate modeling and climate prediction are two important areas where supercomputers are used. Conference presentations and and demonstrations on the exhibit floor made this clear. Less obvious applications involve modeling new electrical grids and more efficient modes of transportation. Many of the things we rely on daily are based on infrastructures that are close 100 years old. From our internal combustion engines to our centralized electrical systems, inefficiency can be measured in billions of dollars that are lost annually to system failures or energy consumption that is not effective.

Gore went on to remind us that Moore’s law is a law of self-fulfilling expectations. When first proposed, it was a recognition of design and implementation capabilities with an eye to the future. Moore’s law worked because R&D funding was established to stay on track. We now have an estimated one billion transistors for every person on the planet. If we commit similar efforts to improving energy efficiency in ways analogous to Moore’s law, we can create a new self fulfilling paradigm. The benefits of such a commitment would be significant. As Gore pointed out, our energy, climate, and economic crises are intertwined. Much of our national policy is in reaction to oil production disruption or the threat of disruption, and the costs of our policies are significant.

In closing, Gore stated that supercomputing is the most powerful technology we have today and represents the third form of knowledge creation. The first two being inductive and deductive reasoning. With supercomputers we can collect massive amounts of data, develop models and use simulation to develop predictive and testable hypotheses. Gore noted that humans have a low bit rate, but high resolution. This means that while our ability to absorb data is slow, we are very good at recognizing patterns. Thus computers, with their ability to store and organize data, can be programmed to convert data into information and display information in new ways to give us new insights for solutions to the most vexing problems.

This last point resonated through all three keynotes. Computers are giving us new ways to work with data and understand problems; they are also providing new ways to share information and communicate with each other.

Geospiza is keenly aware of this potential and a significant focus of our research and development is directed toward solving data analysis, visualization, and data sharing problems in genomics and genetic analysis. In the area of Next Generation Sequencing (NGS), we have been developing new ways to organize and visualize the information contained in NGS datasets to easily spot patterns amidst the noise.

Sunday, November 8, 2009

Expeditiously Exponential: Data Sharing and Standardization

We can all agree that our ability to produce genomics and other kinds of data is increasing at exponential rates. Less clear, is understanding the consequences for how these data will be shared and ultimately used. These topics were explored in last month's (Oct. 9, 2009) policy forum feature in the journal Science.

The first article, listed under the category "megascience," dealt with issues about sharing 'omics data. The challenge being that systems biology research demands that data from many kinds of instrument platforms (DNA sequencing, mass spectrometry, flow cytometry, microscopy, and others) be combined in different ways to produce a complete picture of a biological system. Today, each platform generates its own kind of "big" data that, to be useful, must be computationally processed and transformed into standard outputs. Moreover, the data are often collected by different research groups focused on particular aspects of a common problem. Hence, the full utility of the data being produced can only be realized when the data are made open and shared throughout the scientific community. The article listed past efforts in developing sharing policies and the central table included 12 data sharing policies that are already in effect.

Sharing data solves half of the problem, the other aspect is being able to use the data once shared. This requires that data be structured and annotated in ways that make it understandable by a wide range of research groups. Such standards typically include minimum information check lists that define specific annotations, and which data should be kept from different platforms. The data and metadata are stored in structured documents that reflect a community's view about what is important to know with respect to how data were collected and the samples the data were collected from. The problem is that annotation standards are developed by diverse groups and, like the data, are expanding. This expansion creates new challenges with making data interoperable; the very problem standards try to address.

The article closed with high-level recommendations for enforcing policy through funding and publication requirements and acknowledged that full compliance requires that general concerns with pre-publication data use and patient information be addressed. More importantly, the article acknowledged that meeting data sharing and formatting standards has economic implications. That is, researches need time-efficient data management systems, the right kinds of tools and informatics expertise to meet standards. We also need to develop the right kind of global infrastructure to support data sharing.

Fortunately complying with data standards is an area where Geospiza can help. First, our software systems rely on open, scientifically valid tools and technologies. In DNA sequencing we support community developed alignment algorithms. The statistical analysis tools in GeneSifter Analysis Edition utilize R and BioConductor to compare gene expression data from both microarrays and DNA sequencing. Further, we participate in the community by contributing additional open-source tools and standards through efforts like the BioHDF project. Second, the GeneSifter Analysis and Laboratory platforms provide the time-effiecient data management solutions needed to move data through its complete life cycle from collection, to intermediate analysis, to publishing files in standard formats.

GeneSifter lowers researcher's economic barriers of meeting data sharing and annotation standards keep the focus on doing good science with the data.

Sunday, November 1, 2009

GeneSifter Laboratory Edition Update

GeneSifter Laboratory Edition has been updated to version 3.13. This release has many new features and improvements that further enhance its ability to support all forms of DNA sequencing and microarray sample processing and data collection.

Geospiza Products

Geospiza's two primary products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), form a complete software system that supports many kinds of genomics and genetic analysis applications. GSLE is the LIMS (Laboratory Information Management System) that is used by core labs and service companies worldwide that offer DNA sequencing (Sanger and Next Generation), microarray analysis, fragment analysis and other forms of genotyping. GSAE is the analysis system researchers use to analyze their data and make discoveries. Both products are actively updated to keep current with latest science and technological advances.

The new release of GSLE helps labs share workflows, perform barcode-based searching, view new data reports, simplify invoicing, and automate data entry through a new API (application programming interface).

Sharing Workflows

GSLE laboratory workflows make it possible for labs to define and track their protocols and data that are collected when samples are processed. Each step in a protocol can be configured to collect any kind of data, like OD values, bead counts, gel images and comments, that are used to record sample quality. In earlier versions, protocols could be downloaded as PDF files that list the steps and their data. With 3.13, a complete workflow (steps, rules, custom data) can be downloaded as an XML file that can be uploaded into another GSLE system to recreate the entire protocol with just a few clicks. This feature simplifies protocol sharing and makes it possible for labs to test procedures in one system and add them to another when they are ready for production.

Barcode Searching and Sample Organization

Sometimes a lab needs to organize separate tubes in 96-well racks for sample preparation. Assigning each tube's rack location can be an arduous process. However, if the tubes are labeled with barcode identifiers, a bed scanner can be used to make the assignments. GSLE 3.13 provides an interface to upload bed scanner data and assign tube locations in a single step. Also, new search capabilities have been added to find orders in the system using sample or primer identifiers. For example, orders can be retrieved by scanning a barcode from a tube in the search interface.

Reports and Data

Throughout GSLE, many details about data can be reviewed using predefined reports. In some cases, pages can be quite long, but only a portion of the report is interesting. GSLE now lets you collapse sections of report pages to focus on specific details. New download features have also been added to better support access to those very large NGS data files.

GSLE has always been good at identifying duplicate data in the system, but not always as good at letting you decide how duplicate data are managed. Managing duplicate data is now more flexible to better support situations where data need to be reanalyzed and reloaded.

The GSLE data model makes it possible to query the database using SQL. In 3.13, the view tables interface has been expanded so that the data stored in each table can be reviewed with a single click.


Core lab's that send invoices will benefit from changes that make it possible to download many PDF formatted orders and invoices into a single zipped folder. Configurable automation capabilities have also been added to set invoice due dates and generate multiple invoices from a set of completed orders.

API Tools

As automation and system integration needs increase, external programs are used to enter data from other systems. GSLE 3.13 supports automated data entry through a novel self-documenting API. The API takes advantage of GSLE's built in data validation features that are used by the system's web-based forms. At each site, the API can be turned on and off by on-site administrators and its access can be limited to specific users. This way, all system transactions are easily tracked using existing GLSE logging capabilities. In addition to data validation and access control, the API is self-documenting. Each API containing form has a header that includes key codes, example documentation, and features to view and manually upload formatted data to test automation programs and help system integrators get their work done. GSLE 3.13 further supports enterprise environments with an improved API that is used to query external password authentication servers.