Horst Simon (LBNL)

Using HPC to Drive Innovation and Knowledge Discovery from Petascale Data

Virtually all fields of scientific endeavor base hypothesis testing methodologies on some form of data analysis. Scientific disciplines vary in how they produce data (via observation or simulation), in how they manage data (storage, retrieval, archiving, indexing, summaries, sharing across the science team), and in how they analyze data and communicate results. It is widely agreed that one of the primary bottlenecks in modern science is managing and discovering knowledge in light of the tsunami of data resulting from increasing computational capacity and the increasing fidelity of scientific observational instruments[1]. Further, as data becomes too large to move, we are evolving towards a model where data-intensive services are centrally located[2]. These services span a diverse set of activities that form the basis of the planned "NERSC Data" effort, including but not limited to: community-oriented data repositories; browsing, exploration, and analysis capabilities that operate on the centrally located community repositories; and providing and maintaining the centrally located hardware and software infrastructure that enables these capabilities. In this presentation I will discuss how an HPC Center such as NERSC will have to address the future data challenge, and what the potential scientific pay-offs we will be able to derive from efficiently dealing with scientific data at the Petascale level.

[1] Richard P. Mount, ed., The Office of Science Data-Management Challenge: Report from the DOE Office of Science Data-Management Workshops, March-May 2004; http://www.sc.doe.gov/ascr/ProgramDocuments/Final-report-v26.pdf.
[2] Gordon Bell, Jim Gray, and Alex Szaley, "Petascale Computational Systems", IEEE Computer 39(1), January 2006.