Sunday, November 22, 2009

Supercomputing 09

Teraflops, exaflops, exabytes, exascale, extreme, high dimensionality, 3D Internet, sustainability, green power, high performance computation, 400 Gbps networks, and massive storage were just some of the many buzz words seen and heard last week at the 21st annual supercomputing conference in Portland, Oregon.

Supercomputing technologies and applications are important to Geospiza. As biology becomes more data intensive, Geospiza follows the latest science and technology developments by attending conferences like supercomputing. This year, we participated in the conference through a "birds of a feather" session focused on sharing recent progress in the BioHDF project.

Each year the Supercomputing (SC) conference has focus areas called "thrusts." This year the thrusts were 3D Internet, Biocomputing, and Sustainability. Each day of the technical session started with a keynote presentation that focused on one of the thrusts. Highlights from the keynotes are discussed below.

First thrust: the 3D Internet

The technical program kicked off with a keynote from Justin Rattner, VP and CTO at Intel. In his address, Rattner discussed the business reality that high performance computing (HPC) is an $8 billion business with little annual growth (3% AGR). The primary sources for HPC funding are government and trickle up technology from PC sales. To break the dependence on government funding, Rattner suggested that HPC needs a "killer app" and suggested that the 3D Internet might just be that app. He went on to elaborate on the kinds of problems, such as continuously simulating environments, realistic animation, dynamic modeling and continuous perspectives, that are solved with HPC. Also, because immersive and collaborative virtual environments can be created, the 3D Internet provides a platform for enabling many kinds of novel work.

To illustrate, Rattner was joined by Aaron Duffy, a researcher at Utah State. Rather, Duffy’s avatar joined us as his presentation was in the Science SIM environment. Science SIM is a virtual reality system that is used to model environments and how they respond to change. For example, Utah State is studying how ferns respond to and propagate genetic changes in FernLand. Another example included how 3D modeling can save time and materials in fashion design.

Next, Rattner introduced how the current 3D Internet resembles the early days of the Internet when people struggled with the isolated networks of AOL, Prodigy and Compuserve. It wasn't until Tim Berners-Lee, and Marc Andreessen introduced the World Wide Web http protocol and Mosiac web browser, that the Internet had a platform on which to standardize. Similarly, the 3D Internet needs such a platform. Rattner introduced OpenSim as a possibility. In the OpenSim platform, extensible modules can be used to create different worlds. Because these worlds are built with a common infrastructure, users could have an avatar that could move between worlds, rather than have a new avatar for each world as they do today.

Second thrust: biocomputing

Leroy Hood kicked off the second day with a keynote on how supercomputing can be applied to systems biology and personalized medicine. Hood expects that within 10 years diagnostic assays will be characterized by billions of measurements. We will have two primary kinds of information feeding these assays: the digital data of the organism and data from the environment. The challenge is measuring how the environment affects the organism. To make this work we need to integrate biology, technology, and computers in better ways then we do today.

In terms of personalized medicine, Hood described different kinds of analyses and their purpose. For example, global analysis - such as sequencing a genome, measuring gene expression, or comprehensive protein analysis - creates catalogs. These catalogs then form the foundation for future data collection and analysis. The goal of such analysis is to create predictive actionable models. Biological data however, are noisy, and meaningful signals can be difficult to detect, so improving the signal to noise ratio requires the ability to integrate large volumes of multi-scalar data with diverse data types including biological knowledge. As the goal is to develop predictive actionable models we need supercomputers capable of dynamically quantifying information.

As an example, Hood presented work showing how disease states result in perturbations in regulated networks. In prion disease, the expression of many genes change over time as non-disease states move toward disease states. Interestingly, as disease progression is followed in mouse models, one can see expression levels change in genes that were not thought to be involved in prion disease. More importantly, these genes show expression changes before the physiological effects are observed. In other words, by observing gene expression patterns, one can detect a disease much earlier than they would by observing symptoms. Because diseases detected early are easier to treat, early detection can have beneficial consequences for reducing health care costs. However, measuring gene expression changes by observing changes in RNA levels is currently impractical. The logical next step is to see if gene expression can be measured by detecting changes in the levels of blood proteins. Of course, Hood and team are doing that too, and he showed data, from the prion model, that this is a feasible approach.

Using the above example, and others from whole genome sequencing, Hood painted a picture of future diagnostics where we will have our genomes sequenced at birth and each of us will have a profile created of organ specific proteins. In Hood's world, this would require 50 measurements from 50 organs. Blood protein profiles will be used as controls in regular diagnostic assays. In other assays, like cancer diagnostics, 1000’s of individual transcriptomes will be measured simultaneously in single assays. Similarly, 10,000 B-cells or T-cells could be sequenced to asssess immune states and diagnose autoimmune disorders. In the not too distant future, it will be possible to interrogate databases containing billions of data points from 100's of millions of patients.

With these possibilities on the horizon, there are a number of challenges that must be overcome. Data space is infinite, so queries must be constructed carefully. The data that need to be analyzed have high dimensionality, so we need new ways to work with these data. Finally multi-scale datasets must be integrated together and data analysis systems must be interoperable. Meeting these final challenges requires that standards for working with data be developed and adopted. Finally, Hood made the point that groups like his can solve some of the scientific issues related to computation, but not the infrastructure issues that must also be solved to make the vision a reality.

Fortunately, Geospiza is investigating technologies to meet current and future biocomputing challenges through the company’s product development and standards initiatives like the BioHDF project.

Third thrust: sustainability

Al Gore gave the third day’s keynote address and much of his talk addressed climate change. Gore reminded us that 400 years ago, Galileo collected the data that supported Copernicus’ theory that the earth’s rotation creates the illusion of the sun moving across the sky. He went on to explain how Copernicus reasoned that the illusion is created because the sun is so far away. Gore also explained how difficult it was for people of Copernicus', or Galileo’s, time to accept that the universe does not rotate around the earth.

Similarly, when we look into the sky we see an expansive atmosphere that seems to go on for ever. Pictures from space however tell a different story. Those pictures show us that our atmosphere is a thin band, only 1/1000 the size of the earth’s volume. The finite volume our atmosphere explains how we can change our climate when we pump billions of tons of CO[2] into the atmosphere as we are doing now. It is also hard for many conceptualize that the CO[2] is affecting the climate when they do not see or feel direct or immediate effects. Gore added the interesting connections that the first oil well, drilled by “Colonel” Edwin Drake in Pennsylvania, and discovery, by John Tyndall, that CO[2] absorbs infrared radiation both occurred in 1859. 150 years ago we not only had the means to create climate change, but understood how it would work.

Gore outlined a number of ways in which supercomputing and the supercomputing community can help with global warming. Climate modeling and climate prediction are two important areas where supercomputers are used. Conference presentations and and demonstrations on the exhibit floor made this clear. Less obvious applications involve modeling new electrical grids and more efficient modes of transportation. Many of the things we rely on daily are based on infrastructures that are close 100 years old. From our internal combustion engines to our centralized electrical systems, inefficiency can be measured in billions of dollars that are lost annually to system failures or energy consumption that is not effective.

Gore went on to remind us that Moore’s law is a law of self-fulfilling expectations. When first proposed, it was a recognition of design and implementation capabilities with an eye to the future. Moore’s law worked because R&D funding was established to stay on track. We now have an estimated one billion transistors for every person on the planet. If we commit similar efforts to improving energy efficiency in ways analogous to Moore’s law, we can create a new self fulfilling paradigm. The benefits of such a commitment would be significant. As Gore pointed out, our energy, climate, and economic crises are intertwined. Much of our national policy is in reaction to oil production disruption or the threat of disruption, and the costs of our policies are significant.

In closing, Gore stated that supercomputing is the most powerful technology we have today and represents the third form of knowledge creation. The first two being inductive and deductive reasoning. With supercomputers we can collect massive amounts of data, develop models and use simulation to develop predictive and testable hypotheses. Gore noted that humans have a low bit rate, but high resolution. This means that while our ability to absorb data is slow, we are very good at recognizing patterns. Thus computers, with their ability to store and organize data, can be programmed to convert data into information and display information in new ways to give us new insights for solutions to the most vexing problems.

This last point resonated through all three keynotes. Computers are giving us new ways to work with data and understand problems; they are also providing new ways to share information and communicate with each other.

Geospiza is keenly aware of this potential and a significant focus of our research and development is directed toward solving data analysis, visualization, and data sharing problems in genomics and genetic analysis. In the area of Next Generation Sequencing (NGS), we have been developing new ways to organize and visualize the information contained in NGS datasets to easily spot patterns amidst the noise.

No comments: