Moving Toward a Parallel Universe
High-performance Computing for Geospatial Data Processing
Massively Parallel Technologies Inc.
Philosophically, I’m attuned to the way Yogi Berra, the great Yankee catcher, looks at the world. He is a source of unique wisdom that is much deeper in reality than it may initially appear. One of my favorite Yogiisms is “You can observe a lot by watching.”
At the recent ASPRS/MAPPS (American Society of Photogrammetry and Remote Sensing/Management Association for Private Photogrammetric Surveyors) conference in Charleston, South Carolina, I noticed there were fewer of the MAPPS guys with the white beards who built the aerial survey firms from the bottom up and more of the guys with short, black goatees and European accents. Only one exhibitor displayed a system for exploiting hardcopy source data. It struck me that, just as a new generation of people are beginning to drive the geospatial industry, we are at a transition point with a proliferation of new, high-resolution sensors and digital processing methods.
The future is roaring in like a jet fighter in full afterburner. Sensors such as high resolution satellite imagery, hyperspectral imagery, LIDAR, SAR, and IFSAR provide us with information at a level of detail we only dreamed about a few years ago. The operators of these sensors promised that great things could be done with their data, but many of the promises have gone unfulfilled. As Yogi said, “It's hard to make predictions, especially about the future.”
Studies such as the year 2000 Report of the Independent Commission on the National Imagery and Mapping Agency (reference: http://www.fas.org/irp/agency/nima/commission/toc.htm) identified both Department of Defense and commercial geospatial data collection agencies as being collection centric and as neglecting data exploitation capabilities. There is a pressing need to improve processing capabilities and the use of geospatial data. Three high-performance converging technologies, when employed together, hold the promise of significantly improving both the way geospatial data is processed and how users are able to exploit the data. These technologies are mass storage, high-speed data communications, and high-productivity computing.
Over the last few years the geospatial industry has moved from individual workstations with limited hard disk storage to large, inexpensive RAID configurations and from there to mass storage media that can be configured as Network Attached Storage or highly competent Storage Area Networks capable of handling terabytes and exabytes of data. But the full extent of mass storage is more capably realized through a combination of very responsive RAID coupled with cost efficient, robotic tape libraries linked by automated file management systems that minimize file management issues.
While the dot-com failures resulted in delays in providing widespread access to high-speed data systems, the market is recovering and technological advances are improving capabilities. Connection speeds have not kept pace with processing power, with an early 2000 report estimating that computing power had grown 15 times faster than connection speeds in the preceding eighteen-month period. More recent reports bemoan the fact that other countries are ahead of the United States in deploying broadband infrastructure, achieving speeds ten times or more greater than are available in this country. Even so, after a slow start, the United States is catching up, mostly because the cable industry has picked up the ball.
The most obvious convergence of these two technologies in the geospatial world are the data warehouses provided by companies such as Pixxures, i-cubed, and GIS Data Depot, or by the Microsoft Terraserver. These provide ready access to data with easy-to-use interfaces, but provide only limited capabilities for users to manipulate and analyze data across the Net.
The technology with the greatest potential to change the way we process and work with geospatial data is high productivity computing. In the past, high performance was considered the realm of supercomputers. Supercomputers were expensive to build and complex to operate. To a great extent, access was limited to large, government-driven research programs and sophisticated industrial applications. The focus now is moving from high-performance to high-productivity systems that open the door to a whole new range of uses and users.
High-productivity computing systems have the potential to greatly reduce both processing times and costs while improving time-to-market timelines. If such systems are deployed in conjunction with advanced storage and data transmission capabilities, the way users interact with geospatial data can be significantly improved. Users could access data in data warehouses across the Internet and manipulate/analyze the data using very powerful tools. The data need not reside on the user’s workstation, and only the results need be downloaded as required. The potential exists to gain access to current data more quickly and to perform interactive data manipulations or advanced modeling or simulations that allow users to explore a range of analyses that provide both best and optimum solutions.
To many geospatial operators, the convergence of the three technologies for advanced geospatial operations may appear to be a costly, complex process with significant development risks. It may not even appear to be realistic within the foreseeable future. However, in the geophysical industry the three technologies have converged to provide some proven and very powerful seismic data exploitation capabilities.
It is interesting to analyze why the geophysical industry has accomplished this ahead of the geospatial industry. Two reasons may be that the oil exploration firms own the data process, from data collection through data exploitation, and that they see the need to merge the technologies to stay competitive. The geospatial industry has taken a more incremental approach to technology integration. The industry is not as homogeneous, with separate data collectors and data conversion companies and with more of the exploitation processing relegated to users. Geospatial companies assume more risk in making the investment because they don’t control parts of the data flow chain. Nevertheless, the geospatial industry can benefit from the experience and development efforts of the geophysical industry.
The movement in both the geospatial and geophysical markets is toward a parallel universe. Over the past decade, tremendous advances have been made in parallel processing, with considerable research funding assistance from the Department of Energy Advanced Scientific Computing Initiative (ASCI) for example, to support nuclear studies. The bulk of the systems listed on the “Top 500” supercomputer list are no longer made up of large-memory high-performance processors. Today’s supercomputing systems are made up of thousands of less proprietary lower-cost processors, running free operating systems such as Linux. With this trend, the differences between supercomputers and other computers are becoming less distinct.
Linux is popular because it runs on multiple hardware platforms and users can develop software without being locked into a single vendor. The open-source approach also means that scientists can share application software and improvements. Advances in memory and computing power, along with developments like blade servers, increasingly point to cluster technologies as leading players in the high productivity computing market.
Yogi reminded us of another simple truth when he said, “We made too many wrong mistakes.” Maturation of parallel processing systems and cluster technologies has now enabled us to learn from early implementers who paid the price of making “too many wrong mistakes.” George Spix of Microsoft has an excellent webcast on High Performance Computing Essentials at http://www.microsoft.com/hpc. Before deciding to invest in a parallel processing system, buyers need to understand at least two key elements of parallel systems: Amdahl’s Law, and the total cost of ownership.
Amdahl’s Law establishes the definitive equation for determination of efficiency, speedup, and scalability of parallel processing systems. It defines the maximum theoretical speedup as a function of the amount of parallel activity within the system. In other words, the effect of any serial activity must be minimized as much as possible to limit degradation of the system’s speedup capacity. Unless a problem is embarrassingly parallel, where each processor works independently of the other processors, most current parallel processing systems achieve a maximum parallel activity of only about 95%, regardless of the number of processors used in the cluster. This translates to a maximum potential speedup of 20x, a barrier inherent in traditional approaches to parallel processing.
The second element is total cost of ownership. Since current cluster systems are limited in achievable speedup capacity by the degree of parallel activity, the incremental benefits gained from adding more processors is both expensive and time consuming. The benefits gained from a 1000-node cluster are generally not more than twice those achievable with a 100-node cluster, making it difficult to justify the 10x cost differential. Continued improvements in processor performance have exposed I/O subsystems as a significant bottleneck. Conventional cluster systems attempt to achieve greater efficiency through expensive I/O systems which typically cost 3 times or more the cost of the processing hardware. Geophysical companies have built clusters of thousands of nodes, yet have only achieved the efficiencies dictated by Amdahl’s Law and are just beginning to address I/O problems.
Cost drivers are both the need to maintain a staff with the expertise to program and operate a parallel processing system, and the need to replace components as more powerful components are introduced. In addition, general operating costs increase significantly due to the massive amounts of power and cooling required for large clusters.
The general state of high-performance computing is best summarized by an August, 2002 statement by Ron Brachman of the Defense Advanced Research Projects Agency (DARPA) Information Processing Technology Office. He stated that “computational performance was increasing, but productivity and effectiveness were not keeping up. System complexity may actually be reversing the information revolution. The cost of building and maintaining systems is growing out of control. Systems have short life spans with decreasing return on investment. Demands on expertise of users are constantly increasing. Users have to adapt to system interfaces, rather than vice versa.” While this may seem harsh, it provides a roadmap for improving the situation and improvements may be on the horizon.
The development in Japan in the spring of 2002 of the NEC Earth Simulator, the world’s most powerful supercomputer, has spurred a new round of development in the United States. DARPA’s latest effort is their High Productivity Computing System (HPCS) program. Currently, three competitors are participating in a three-year Phase II effort. Cray will work on new processor architectures, processor-in-memory technology and software models. IBM is working on an effort dubbed Productive, Easy-to-use, Reliable Computing Systems, or PERCS. Sun Microsystems is working on a simplified, single-system architecture that would reduce the cumbersome programming efforts clustered systems require. While a commercially viable Phase III system might not be available until near the end of the decade, development efforts should provide benefits to the users of high-productivity computing systems before then.
Massively Parallel Technologies (MPT, Louisville, Colo.), is also participating in the HPCS program through a Small Business Innovative Research grant. MPT has developed a revolutionary approach to parallel computing that achieves a degree of parallel activity in excess of 99.9%, which results in speedups in the 100x to 200x range and beyond. MPT’s initial system is very cost efficient, running on Windows NT-based off-the-shelf computers with 100 baseT connectivity. The technology bears some resemblance to clusters in topology, but functions more like a supercomputer. I/O efficiency is an integral part of MPT’s solution. Much of MPT’s early development work for DARPA involved image processing. MPT is beginning commercial operations in the first quarter of 2004.
The rapid growth of high-resolution geospatial sensors is creating a growing demand for high-performance computing to process large datasets. Many of the image-processing functions are compute and I/O intensive. Parallel processing clusters using commodity computers can provide a good ratio of price to performance to meet evolving needs. You don’t have to be a “Top 500” supercomputer site to use parallel processing to improve geospatial operations, but you do have to be aware of the issues involved in implementing a parallel processing solution. After all, as Yogi said, “If you don’t know where you are going, you may not get there.”