Archive for November, 2007

The Horror of Recognition

November 26, 2007

[harvard college, 1981]

I came across the above photo as I was browsing through old yearbook photos on my alma mater’s web site several months ago.

My first reaction was, “Oh my, they look incredibly geeky.”

My second reaction was, “Hey, I recognize the guy with the sideburns!”

My third reaction was, “Oh, wait. Crap–that’s me sitting next to him! What a pair.”

Inexplicably, my fourth reaction was “Hey, I should post this on the blog.” So there you go–a glimpse into computing and attire at Harvard in the early eighties. And eyeglass styles– let’s not forget that.

HPC Podcast: Innovating@Sun

November 23, 2007

Hal Stern and I recently discussed Sun and High Performance Computing on his podcast show, Innovating@Sun. We talked about Sun’s Constellation System components for HPC, trends and challenges in HPC, and Sun’s history in HPC. We also discussed software, including Solaris for HPC. Check it out on blogs.sun.com or on iTunes. Running time, 17 minutes.

Cool Math: Pick’s Theorem

November 22, 2007

[pick's theorem examples]

You see on the left three simple polygons (a polygon is simple if its boundary does not cross itself.) How would you determine the areas of these shapes? The rectangle is easy. The blue polygon is a little more tedious since you need to count the number of interior grid squares. But how would you find the area of the red polygon?

As it turns out, you can easily compute the area of any simple polygon whose vertices are aligned on a regular, square grid using Pick’s Theorem, which says the area of such a polygon can be found as I + B/2 – 1 where I is the number of grid points on the interior of the polygon and B is the number of grid points lying along the boundary of the polygon. I find it amazing that this works for any simple polygon.

We can see by inspection that the green rectangle has area 42 (6×7.) Let’s apply Pick’s Theorem. There are 30 grid points in the interior and 26 grid points on the boundary. 30 + 26/2 – 1 = 42. Magic. 🙂

Now let’s try the blue polygon. I = 25, B = 52, so Pick’s Theorem says the area is 25 + 52/2 – 1 = 50, which is correct by inspection. By my count, the red polygon’s area is 70 + 24/2 – 1 = 81. My lines are a little fat so I made some (consistent) judgement calls about “in” or “on”–your count may be slightly different than mine.

Visit this page to explore Pick’s Theorem with an interactive Java applet. See this page for one proof of the theorem.

HPC Consortium Presentations now Online

November 21, 2007

Most of the slides used at the HPC Consortium meeting in Reno are now posted to the consortium web site. These include customer talks, many or all of the partner talks, and some of the Sun talks. I counted 29 presentations as of this writing. Go here.

Lustre Update

November 20, 2007

Peter Braam, founder of Cluster File Systems Inc. and now VP of Lustre at Sun, gave an update on Lustre to the 150+ Sun HPC customers who attended the HPC Consortium meeting in Reno prior to the Supercomputing ’07 conference.

He spoke briefly about Lustre’s place in the HPC market, citing example of its use in the TOP500. Subsequent to his talk we learned that the most current list (which was released at Supercomputing) shows that Lustre is used on 7 of the 10 largest supercomputers in the world. While it is used at the very high end, Lustre also has a strong presence in Oil & Gas, in digital animation, in EDA (electronic design automation), and at several large ISPs to name a few areas.

The current Lustre release is 1.6, which has undergone some major usability improvements from 1.4, which is still in use by some customers today. Version 1.8 is targeted for the 2nd quarter of 2008 and will include support for ZFS and Solaris on the storage server. Version 2.0 is scheduled for the 4th quarter of 2008 and will add a clustered metadata capability and server network striping. These are the plans. Insert standard caveats about engineering plans here.

Peter talked about the post-acquistion integration into Sun and described it as smooth. While it was disruptive to some (notably, Peter himself) the team has continued largely structured in the same way. I do know that some Sun managers and engineers have joined the Lustre team, which I think is a great way to help the Lustre team continue to transition into Sun. We’ve also paired Lustre engineers and managers with old Sun hands as mentors to help easy the transition. It helps, I’m sure, that some of these mentors themselves came into Sun from small companies through acquisition. From all I’ve heard, the integration seems to be going quite well.

In terms of the business ramifications of the acquisition, continuity is the theme. Still open source, still the same model for customer support, and we will continue business with Lustre’s various OEMs. And, of course, Linux continues to be a focus while we also work to expand Solaris support.

So, what’s up with ZFS and Lustre? Lustre servers today are built on the Linux ext3 local filesystem and CFS was able to achieve extreme performance with it. Version 1.8 will add support for ZFS with the intent of hardening Lustre and driving for even higher levels of scalability. The servers will be in user space using user space ZFS code and there will be server migration tools available for those customers wishing to migrate from an ext3-based server to one based on ZFS.

Version 1.8 will also see the additional of a network request scheduler to improve I/O scheduling, based on work done at Oak Ridge National Laboratory (ORNL), a Lustre Center of Excellence, on Jaguar, their 8000-client HPC cluster.

One funny point Peter made: Sun is back in Phase III of the DARPA HPCS program. Recall (perhaps) that Sun was not selected to proceed from Phase II to Phase III–but CFS was to supply the file system for Cray’s solution, and so Sun is back. 🙂 As part of our involvement in HPCS Phase III, there will be significant future enhancements to Lustre to support some fairly daunting requirements on file creation rates, client bandwidths, and extremely large file counts. All good news for the HPC community at large.

ClusterTools 7.1 Now Available for Free Download

November 19, 2007

ClusterTools 7.1, which includes the latest version of Sun’s MPI library for Solaris x86/x64 and SPARC, is now available for free download here. This release adds support for 32- and 64-bit Intel-based platforms, improves support for 3rd-party parallel debuggers, includes improved memory usage for communication, adds PBS Pro validation, and bundles additional bug fixes contributed by the Open MPI community.

University of Warsaw: New HPC Perspectives and Prospects

November 12, 2007

Marek Niezgodka, Director of the Interdisciplinary Centre for Mathematical and Computational Modelling at the University of Warsaw spoke this weekend at the HPC Consortium meeting in Reno.

ICM is a high-end computng center for research and applications in Poland, a national laboratory in computational and informational sciences, and a partner and leader on multiple grid projects.

ICM research focuses in several areas, including:

  • Distributed information systems and grids, including healthcare infrastructure and distributed high-end computing
  • Quantitative biology of systems, including bioinformatics, functional proteomics, protein engineering, etc.
  • Design and characterization of functional materials, which includes nanomaterials and bionanomaterials.
  • Biomedical modelling of blood circulation and physiology, tissue engineering, and imaging.
  • Non-linear process dynamics of complex networks

In addition to research activities, the Center is heavily involved in delivering wide-area services. For example, numerical weather prediction for central Europe at a 4km horizontal resolution and additional prediction for the northern Atlantic and Asia. ICM also functions as a knowledge repository, a healthcare grid for cardiology, and it offers large-scale data processing and analysis for industry and the public sector.

ICM is currently undergoing a significant infrastructure expansion, including a doubling of staff to approximately 300 by 2010, data expansion to between 5 and 10 Petabytes by 2009. Compute capabilities will be expanded to a total capacity of approximately 100 TFLOPs. This deployment is currently underway and will be completed in 2008. The core of this system is built with Sun Constellation components, including Thumper (X4500) storage.

Multicore Performance Analysis Tools from Academia

November 12, 2007

Karl Fuerlinger, from the Innovative Computing Laboratory at the University of Tennessee at Knoxville spoke about multicore performance analysis tools at the HPC Consortium meeting here in Reno yesterday. He focused on tools available from academia rather than vendor-supplied tools.

In Karl’s view, the vendor tools are powerful, commercially supported, and typically limited to the vendor platform, while academic tools are generally cross-platform, often include advanced or experimental techniques like automated performance analysis and often focus more on high levels of scalability.

Popular academic tools include:

  • PAPI, which supports platform-independent access to hardware counters. PAPI has recently been expanded to support access to additional counter types beyond CPU counters. Temperature sensors, HW events on NICs, and instrumentation on memory interfaces are examples. It is possible to generate composite displays showing time-lines of FLOP rates, system temperature, etc.
  • TAU has extensive support for tracing and profiling and is considered by many to be the swiss army knife of profiling tools.
  • KOJAK/SCALASCA, which offers trace-baed automatic performance analysis capabilities. It does this by automatically searching for patterns of inefficiences in traces with demonstrated scalability to 22K processes.
  • Vampir, a tracefile visualization tool for MPI that has applicability f or other programming models as well.
  • ompP, a profiling tool for OpenMP and the focus of Karl’s work. ompP us es a source-based instrumentation approach to gain independence from specific compilers and runtimes. It is tested and supported on Linux, Solaris, AIX, and with the Pathscale, PGI, gcc, IBM, and Sun compilers. Codes are instrumented to understand how much time is spent in imbalance, synchronization, limited parallelism, and thread management states. Incremental and continuous profiling are supported.

Karl pointed out that these academic tools tend to generally interoperate with each other. For example, PAPI can be used by most of the above tools to access performance counter information. Profiles can be gathered by several of these tools and then visualized with TAU. And trace data collected with these tools can be fed into the KOJAK/SCALASCA automatic trace analysis capabilities. Traces generated from TAU or KOJAK/SCALASCA can be visualized with Vampir.

100 TFLOPs Insufficient?

November 12, 2007

James Leylek, Executive Director of the Clemson University Computational Center for Mobility Systems, spoke at the HPC Consortium meeting about the computational requirements for simulation of vehicle-related phenomena.

A main point of Dr. Leylek’s talk was that unsteady simulations are required to adequately model the physical behavior of mobility systems. There are many cases in which unsteady or turbulent mechanisms dominate in this class of problems. There are boundary layer issues, laminar to turbulent transitions, so-called Type II transient flows, etc. They key, though, is finding appropriate numeric techniques to perform these simulations.

Typical mobility application areas include formula-1 race cars, airplane wing design, engine fan design, aircraft carriers, submarines, engine block cooling, and blood flow through artificial hearts.

As an example of the problems sizes in this space, Leylek described what is required to simulate the aerodynamics of a Formula-1 race car. It requires 300M finite element volumes, with eight equations per volume for a total of about 2.4B equations to be solved. And because of the unsteady nature of flows around these bodies, the simulations must be run for tens of thousands of time steps. This essentially means that dedicating even 100 TFLOPs to one team would not be sufficient to allow the dozens of “what if” experiments needed during the vehicle design phase. When one realizes that aerodynamics is just one of a number of attributes that must be simulated for this one application area, the situation becomes even more daunting.

There are a number of numerical methods that can be used to perform these simulations. Full unsteady simulation is impractical for the time being until much larger computational facilities are available at a more affordable cost. In the meantime, what to do? The Computational Center for Mobility Systems at Clemson brings together a large amount of Sun HPC gear and the algorithmic expertise to team with companies and other organizations to perform these simulations using the unique capabilities of semi-deterministic stress model (SDSM) techniques to deliver value to their partners in the shorter term. The point is to be smarter about how these problems should be solved and not be intimidated by the computational requirements predicted by extrapolations based on brute-force methodologies.

UltraSPARC T2 for HPC: A Customer Assessment

November 12, 2007

Dieter an Mey, HPC Team Lead at RWTH Aachen’s Center for Computing and Communication, presented an evaluation of the suitability of Sun’s UltraSPARC T2 processor for High Performance Computing at the HPC Consortium meeting in Reno.

The Aachen study compares systems with the T2 processor against systems with Sun’s UltraSPARC IV processor, with AMD Opteron processors, and with Intel Woodcrest and Clovertown processors. The test cases used were representative of a range of applications and attributes that are important to users at Aachen.

I will briefly summarize the results here and recommend those interested in more detail visit this page for a full explanation of the methodology and to view the detailed results.

Aachen examined several performance kernels: memory bandwidth, LINPACK, and sparse matrix-vector multiplication. They also examined results for several applications, including TFS, which used to model nasal flow for computer-aided surgery. This code can be run in several ways using OpenMP for parallelization. They also ran FLOWer and a code does contact analysis of bevel gears. In addition to these application tests, Aachen ran multiple instances of applications simultaneously to assess the throughput capabilities of each system. A power and performance/power analysis was also done.

The results showed that a combination of T2-based systems and x64/x86 systems would be ideal for Aachen. Very cache-friendly codes did not benefit as much from the N2 architecture and these performed better on the Intel and AMD based systems. The bevel gear code is an example of such a code. TFS, on the other hand, performed better in throughput mode on the T2 system. In both cases the best results were 2X better than the altenative. That is, the Intel/AMD systems generally did about 2X better than the T2 system on cache-friendly codes while the T2 system was 2X better in cases where memory bandwidth was a limiting factor.