Archive for June, 2007

If you think High Performance Computing (HPC) is a Small Market….

June 30, 2007

Then consider the fact that at IDC‘s breakfast briefing at ISC in Dresden this week, we were told HPC currently accounts for 19% of ALL worldwide server sales and 25% of (server) processors. With “niches” like that…

We also learned that while overall worldwide server sale growth rates are flattening at around 3%, HPC continues to grow at a healthy 9% rate. I’ll post more details after I receive the briefing materials from IDC.

Advertisements

HPC Consortium: A Brief History of Solaris

June 30, 2007

[wicked bible]

Phil Harman from Sun’s Solaris group gave an informative and amusing talk at the HPC Consortium meeting in Dresden this week titled, “A Brief History of Solaris.” I’m hoping the full talk will be posted on the Consortium site at some point.

Phil began his history of Solaris by reminding us of some of the “prehistoric” innovations in SunOS. For example, who but Sun was doing open network computing back in the 1980s with innovations like NFS, NIS, the automounter, XDR, and RPC? How about the STREAMS abstraction? mmap? ld.so?

He then moved to innovations done by Sun “within living memory.” His list included loadable, configurable kernels; dynamic system domains; /proc; truss; the p-tools; and /etc/nsswitch.conf. Not to mention “audacious” SMP scalability, and a compatible 32/64 bit transition strategy that maintained binary investments through our transition to 64-bit computing. Oh yes, and there was that Java thing as well…

Innovations done “just yesterday” included Hierarchical Lgroup Support (HLS), Multiple Page Size Support (MPSS), containers, Service Management Facility (SMF), zones, BrandZ, ZFS, and DTrace.

He finished with some comments on ZFS, which he motivated with the graphic I’ve placed at the top of this blog post. It illustrates the problems of single-bit errors. In this case, a printer was fined by the King of England for what amounted to a life’s wages for making this error in a 1631 edition of the King James bible (known as the Wicked Bible). “Got checksums?”, asks Phil as he noted that ZFS protects the datapath all the way from the rotating rust (the disk) to memory.

Does the “I” in RAID mean “Inexpensive” or “Independent”? The former is correct, so why do some in our industry prefer the “independent” interpretation? Phil explained why during his talk and also in this blog entry.

HPC Consortium: Big SMPs in Education

June 30, 2007

[bernd dammann, dtu]

Bernd Dammann, Associate Professor at the Technical University of Denmark, spoke this week at Sun’s HPC Consortium meeing in Dresden. The title of his talk was “Using Large SMP machines for research and education — some experiences from the Technical University of Denmark.”

As part of his introduction, Bernd mentioned that the University was founded in 1829 by H.C. Orsted. The school was relocated in the 1960s to the site of a former airport, which is evident if you look at the site layout.

The University is strong in a number of areas, notably wind turbine design and materials optimization–e.g. how much material can be cut away from a jet to reduce weight while still maintaining safety and structural integrity. Work has also been done on magnetic earth imaging via satellite and we were told that students at the University are very involved in corporate-sponsored eco-vehicle design contests. DTU is a Sun Center of Excellence in interval arithmetic and dynamic systems.

The HPC Center at DTU has a large amount of Sun “big iron”–large SMP machines– around which the Center’s computational capabilities are centered. They also have several other Sun hardware models in their machine room. Through a series of acquisitions, DTU now has onsite two Sun Fire E25Ks with 96 and 72 cores, three Sun Fire E6900s with 48 cores each, 10 V440s and a Sun Fire T2000 system.

All of their SPARC machines are kept at the same revision of Solaris (currently S10 11/06), which makes the complex easy to maintain and administer with two part-time system administrators.

In addition to their central compute infrastructure, DTU has the largest deployment of Sunray thin clients in Scandinavia with over 600 in use. Of these, they use about 24 as part of a “mobile classroom” that can be deployed on short notice in locations for temporary use. Students love the thin clients and appreciate the ability to access their desktop sessions from any Sunray on campus using their smartcards.

Bernd made several interesting points with respect to their Sun compute infrastructure. First, the variety of SPARC implementations and system architectures in their compute complex is used to advantage in their High Performance Computing course to expose students to a range of systems. They are also able to explore both OpenMP and MPI on their systems. In addition, because the use this single environment for both education and research, those students who move on to become researchers already have familiarity with the full range of scientific and productivity tools deployed on the HPC compute infrastructure.

He summarized the value proposition as follows. They don’t have wasted desktop cycles tied up in thick clients. They have lower administration costs, they have consolidated their software licenses onto their central infrastructure. In addition, they can do centralized deployment of software, they can ensure that students do not tamper with configuratons while still allowing them the freedom to install their own software in $HOME and they do not have virus issues.

As a drawback, Bernd mentioned that their thin client environment was unfortunately not suitable for supporting heavy OpenGL 3D graphics for their users. After the talk, I introduced Bernd to Linda Fellingham, the engineering manager in charge of Sun’s shared and scalable visualization products. As it turns out, Sun has a solution for DTU that will allow them to install an existing, but noisy, high-end graphics workstation in their central machine room and then route 3D graphics output directly to Sunrays in a seamless way. It’s a pretty slick software solution (read more here. When I asked Bernd later if he found the Consortium meeting useful, he cited this interaction as an example of how meeting with Sun’s engineering and other employees at such meetings is very useful for him.

Dresden at Night

June 28, 2007

[dresden at night, view 1]

[dresden at night, view 2]

HPC Consortium: Making Solaris Transparent with DTrace

June 28, 2007

Thomas Nau, self-described Solaris Geek and head of the Infrastructure Department at the University of Ulm, gave a talk this week on DTrace at the HPC Consortium meeting in Dresden. In his view, DTrace is the tool of choice for understanding system and application performance issues.

Thomas described briefly how DTrace works to support dynamic, lightweight instrumentation of both kernel and user code with some 40,000 probe points available for use within Solaris and the D scripting language to perform custom processing. He particularly likes DTrace’s aggregation facilities that support gathering and condensing data for easier interpretation (for example, creating simple, text-based histograms of data value collected during a run.)

He also pointed out that, contrary to what some say, DTrace does not need to be run as root. Instead one can use Solaris RBAC (role-based access control) facility to grant particular, DTrace-specific privileges to users. These privileges are dtrace_proc, dtrace_user, and dtrace_kernel. See here for more details.

For those wanting to go beyond the simple ASCI text output and graphics created by DTrace, Thomas recommended a utility called the Chime Visualization Tool, which is available on the OpenSolaris community website.


Chime sample output.

Thomas mentioned that /usr/demo/dtrace contains lots of D scripts that can help the new user learn DTrace. But the Solaris Dynamic Tracing (DTrace) Guide available on docs.sun.com remains his favorite reference.

HPC System: Ranger System at TACC

June 28, 2007

As they say, things are just bigger in Texas.

On Monday, several members of the staff from the Texas Advanced Computing Center (TACC) joined the HPC Consortium meeting in Dresden by remote link. The audio was unfortunately not very good, but we managed to hear most of what was said. Speakers were Jay Boisseau (TACC Director) and (I believe) Tommy Minyard (Assistant Director). If there was a third speaker, I apologize–as I said, the audio was not good.

Ranger will be a 504 TFLOPs system, built using the Sun Constellation System architecture with two ultra-dense switches and almost 4000 Sun four-socket, quad-core, nodes using close to 16000 AMD Barcelona processors. With 2GB of memory per core, there will be over a hundred Terabytes of memory total in the system along with 72 Sun “Thumper” storage systems with a total of 1.7 PBytes of raw storage. The InfiniBand interconnect is a 7-stage, non-blocking Clos network with latencies and bandwidths of approximately, 2.3us and 950 Mbytes/sec, respectively.

Physically, the system will reside in about 90 racks in six rows. It will require about 3.4 MW of power.

Ranger will run Linux and the OpenFabrics InfiniBand stack. It will use Lustre as its cluster file system and will run two MPI libraries: MVAPICH and Open MPI (the code base on which Sun’s MPI for Solaris is based.) TACC will use multiple compiler suites, including Sun Studio. Sun Grid Engine will be used as Ranger’s distributed resource management system.

HPC Consortium: TSUBAME

June 27, 2007

On Monday Professor Satoshi Matsuoka updated the HPC Consortium attendees on TSUBAME, Asia’s largest supercomputer center and the largest HPC system to date built with Sun technology. His talk was titled, “TSUBAME Update — The People’s Supercomputer at Tokyo Institute of Technology.”

Since its installation, the TSUBAME user population has grown to 10,000 user registrations, which includes 1200 registered supercomputer-class users. The system has had a lot of media exposure in Japan, including several magazine cover articles and coverage on national television.

As part of creating the vision of a people’s supercomputer that supports a wide variety of users, TSUBAME implements multiple usage models from best-effort to higher quality service which is billed per unit of use. Professor Matsuoka gave us a brief overview of some of the work they’ve done in this area and recommended a detailed Sun Blueprint titled Sun N1 Grid Engine Software and the Tokyo Institute of Technology Supercomputer Grid for further reading.

Usage continues to increase. One of Professor Matsuoka’s deputies has been known to launch 20,000 simultaneous Gaussian jobs on TSUBAME! Not surprising then that in under 12 months over 1.3M ISV application runs have been done on the machine.

Work was done recently to improve TSUBAME’s LINPACK performance number, which is how a machine’s rankings in the TOP500 list is determined. TSUBAME moved from #9 to #14 on the list in spite of having submitted a new LINPACK run demonstrating an additional 1.5 TFLOPs, wich indicates just how competitive this list is. TSUBAME remains the largest supercomputer in Asia.

PetaScale Unveiled: Photos From Dresden

June 26, 2007

A few photos from the unveiling of the ultra-dense components of Sun’s new Constellation System architecture tonight in Dresden.


Andy Bechtolsheim, Marc Hamilton, and a magnum of champagne


Behold The Switch!


The ultra-dense switch with cable management


Overall booth shot: crowd, press, etc.


A sea of switch ports…

Sun Constellation System: Petascale Computing Done Right

June 26, 2007

Today Sun is revealing in a technology preview at the International Supercomputing Conference in Dresden the approach we will use to build TACC‘s 500+ TFLOP Ranger system later this year. And other large machines that have not yet been announced.

We call systems built with this approach Sun Constellation Systems. Such systems can scale from TeraFLOPS up into the PetaFLOPs range. To my mind, Sun’s approach to Petascale starts with this:

[sun constellation system connector]

Yes, it’s a connector. Specifically, this connector allows three 4X InfiniBand links to be run across a single cable, rather than three separate cables. The cable is also both higher quality and significantly less bulky than the three separate cables taken together. The connector itself is also mechanically and electrically superior to standard InfiniBand connectors. Sun (by which I mean AndyB and others working closely with him) has put a lot of thought into this because to effectively build a petascale system one needs to closely examine current approaches and assess whether just “doing more of the same” really gets you where you need to be or whether new thinking is needed.

The above dense cabling approach complements our ultra-dense blade server and ultra-dense InfiniBand switch components. The blade server supports 48 blades (768 cores) in a single rack/chassis (the rack is the chassis in this case) and the switch sports over 3000 4X Infiniband ports with all of the complexity of a multi-stage network compressed into a single chassis, removing the need for a huge number of cables and a large number of intermediate, discrete InfiniBand switches. Compared to a conventional approach to building a large cluster, the Sun Constellation system uses 1/6 the cables, 20% smaller footprint, and one switch rather than 300 (not a typo.)

The switch component is a double-wide chassis and it looks like this:

[ultra-dense switch]

The blade chassis looks like this:

[ultra-dense switch]

I’m walking over to the show floor at ISC here in Dresden in an hour so and will post some photos later tonight.

HPC Consortium: Shared Memory Parallelization on Multi-Core Processors

June 26, 2007

[dieter an mey, rwth aachen]

While I enjoyed Barton and Ruud’s talks about the Niagara 2 processor yesterday at Sun’s HPC Consortium meeting in Dresden, I always get more of a kick from customer presentations. In this case, Dieter an Mey from RWTH Aachen University gave a nice talk about the pitfalls and benefits of multi-core processors for programmers. While it was delivered here at a High Performance Computing event, the observations and lessons are applicable to anyone interested in application performance on multi-core processors. Given the direction of our industry, that encompasses a lot of programmers.

Dieter and his colleague Christian Terboven examined performance on several systems based on a variety of processors:: the UltraSPARC IV, UltraSPARC T2 (Niagara 2), Intel Woodcrest and Clovertown, and a quad-core AMD Opteron. For each of these systems, they measured achievable aggregate bandwidth over a variety of active thread counts, processor bindings, and memory placement. I’ve included a few of his slides with Niagara 2 performance results removed (sorry.) As Dieter said, the results aren’t surprising once you look carefully at the non-uniformities in the underlying system architectures. If, however, programmers do not understand these issues, they will likely achieve very sub-optimal application performance. My own concern is that as multi-core and multi-threaded processors become the norm across the computer industry, programmers will not understand these issues in the way a seasoned HPC programmer might. This is one small part of the challenge the software industry faces in helping programmers achieve high performance on these new processors.

In addition to running the above bandwidth tests, Dieter and Christian ran two applications on each of these systems. The first was a very cache-friendly code used to compute contact interactions between bevel gears. The second was a Navier-Stokes code that put high stress on the memory sub-system due its manipulations of sparse data structures. They also did additional throughput tests, running multiple copies of these applications on each system.

The results for Niagara 2 demonstrated the value of the CMT approach in hiding latency for throughput workloads and being able to do so with floating-point intensive codes with the increased FP capabilities of this new processor.

Oh, and one more thing. The graph below shows performance results for the bevel gear code run with different numbers of threads. Look at Columns 5 and 6. These results were generated on Solaris and Linux using the same compilers and the exact same hardware. Higher is better. 🙂