Archive for April, 2008

Winter Into Spring

April 26, 2008

Two final aerial photos from this Winter’s travels as we head now into Spring…

HPC User Forum: Operating System Panel

April 26, 2008

As I mentioned in an earlier entry, I participated in the HPC interconnect panel discussion at IDC‘s HPC User Forum meeting in Norfolk, Virginia last week. I also sat on the Operating System panel, the subject of this blog entry.

Because the organizers opted against panel member presentations, the following slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.

[os panel slide 1]

My fellow panelists during this session were Kenneth Rozendal from IBM, John Hesterberg from SGI, Benoit Marchand from eXludus, Ron Brightwell from Sandia National Laboratory, John Vert from Microsoft, Ramesh Joginpaulli from AMD, and Richard Walsh from IDC.

The framing topic areas suggested by Richard Walsh prior to the conference were used to guide the discussion:

  • Ensuring scheduling efficiency on fat nodes
  • Managing cache and bandwidth resources on multi-core chips
  • Linux and windows: strengths, weaknesses and alternatives
  • OS scalability and resiliency requirements of petascale systems

As you’ll see, we covered a wider array of topics during the course of the panel session.

[os panel slide 2]

Beowulf clusters have been popular with the HPC community since about 1998. The idea arose in part as a reaction against expensive, “fat” SMP systems and proprietary, expensive software. Typical Beowulf clusters were built of “thin” nodes (typically one or two single-CPU sockets), commodity ethernet, and an open source software stack customized for HPC environments.

With multi-core and multi-threaded processors now becoming the norm, nodes are maintaining their svelte one or two rack unit form factors, but become much beefier internally. As an extreme example, consider Sun’s new SPARC Enterprise T5140 server, which crams 128 hardware threads, 16 FPUs, 64 GB of memory, and close to 600 GBs of storage into a single rack unit (1.75″) form factor. Or the two rack-unit version (the T5240) that doubles the memory to 128 GB and supports up to almost 2.4 TB of local disk storage in the chassis. I call nodes like these Sparta nodes because they are slim and trim…and very powerful. Intel and AMD’s embracing of multicore ensures that future systems will generally become more Spartan over time.

Clusters need an interconnect. While traditional Beowulf clusters have used commodity Ethernet, they have often done so at the expense of performance for distributed applications that have significant bandwidth and/or latency requirements. As was discussed in the interconnect panel session at the HPC User Forum, InfiniBand (IB) is now making significant inroads into HPC at attractive price points and will continue to do so. Ethernet will also continue to play a role, but commodity 1 GbE is not at all in the same league with IB with respect to either bandwidth or latency. And InfiniBand currently enjoys a significant price advantage over 10 GbE, which does offer (at least currently) comparable bandwidths to IB, though without a latency solution. The use of IB in Sparta clusters allows the nodes to be more tightly coupled in the sense that a broader range of distributed applications will perform well on these systems due to the increased bandwidth and much lower latencies achievable with InfiniBand OS bypass capabilities.

This trend towards beefier nodes will have profound effects on HPC operating system requirements. Or said in a different way, this trend (and others discussed below) will alter the view of the HPC community towards operating systems. The traditional HPC view of an OS is one of “software that gets in the way of my application.” In this new world, while we must still pay attention to OS overhead and deliver good application performance, the role of the OS will expand and deliver significant value for HPC.

[os panel slide 3]

The above is a photo I shot of the T5240 I described earlier. This is the 2RU server that recently set a new two-socket SPEComp record as well as a SPECcpu record. Details on the benchmarks are here. If you’d like a quick walkthrough of this system’s physical layout, check out my annotated version of the above photo here.

[os panel slide 4]

The industry shift towards multicore processors has created concern within the HPC community and more broadly as well. There are several challenges to be addressed if the value and power of these processors are to be realized.

The increased number of CPUs and hardware threads within these systems will require careful attention be paid to operating system scalability to ensure that application performance does not suffer due to inefficiencies in the underlying OS. Vendors like Sun, IBM, SGI, etc, who have worked on OS scaling issues for many years have experience in this area, but there will doubtless be continuing scalability and performance challenges as these more closely coupled hardware complexes become available with ever large memory configurations and with ever faster IO subsystems.

There was some disagreement within the panel session over the ramifications to application architectures of these beefier nodes when they are used as part of an HPC cluster. Will users continue to run one MPI process per CPU or thread, or will a fewer number of MPI processes be used per node with each process then consuming additional on-node parallelism via OpenMP or some other threading model? I am of the opinion that the mixed/hybrid style (combined MPI and threads) will be necessary for scaling to very large-size clusters because at some point scaling MPI will become problematic. In addition, regardless of the cluster size under consideration, using MPI within a node is not very efficient. MPI libraries can be optimized in how they use shared memory segments for transferring message data between MPI processes, but any data transfers are much less efficient than using a threading model which takes full advantage of the fact that all of the memory on a node is immediately accessible to all of the threads within one address space.

The tradeoff is that this shift from pure MPI programming to hybrid programming does require application changes and the mixed model can be more difficult since it requires thinking about two levels of parallelism. If this shift to multi-core and multi-threaded processors were not such a fundamental sea change, I would agree that recoding would not be worthwhile. However, I do view this as profound a shift as that which caused the HPC community to move to distributed programming models with PVM and MPI and to recode their applications at that time.

Another challenge is that of efficient use of available memor
y bandwidth, both between sockets and between sockets and memory. As more computational power is crammed into a socket, it becomes more important for 1) processor and system designers to increase available memory bandwidth, 2) operating system designers to provide efficient and effective capabilities to allow applications to make effective use of bandwidth, and 3) for tool vendors to provide visibility into application memory utilization to help application programmers optimized their use of the memory subsystem. In many case, memory performance will become the gating factor on performance rather than CPU.

As the compute, memory, and IO capacities of these beefier nodes continues to grow, the resiliency of these nodes will become a more important factor. With more applications and more state within a node, downtime will be less acceptable within the HPC community. This will be especially true in commercial HPC environments where many ISV applications are commonly used and where these applications may often be able to run within a single beefy node. In such circumstances, OS capabilities like proactive fault management which identifies nascent problems and takes action to avoid system interruption become much more important to HPC customers. An interesting development, since capabilities like fault management have traditionally been developed for the enterprise computing market.

The last item–interconnect pressures–is fairly obvious. As nodes get beefier and perform more work, they put a larger demand on the cluster interconnect, both for compute-related communication and for storage data transfers. InfiniBand, with its aggressive bandwidth roadmap, and an ability to construct multi-rail fabrics, will play an important role in maintaining system balance. Well-crafted low level software (OS, IB stack, MPI) will be needed to handle the larger loads at scale.

[os panel slide 5]

Beyond the challenges of multi-core and multi-threading, there are opportunities. For all but the highest end customers, beefier nodes will allow node counts to grow more slowly, allowing datacenter footprints to grow more slowly, decreasing the rate of complexity growth and scaling issues.

Much more important, however, is that with more hardware resources per node it will now be possible to dedicate some small to moderate amount of processing power to handling OS tasks while minimizing the impact of that processing on application performance. The ability to fence off application workload from OS processing using concepts like processor sets, processor bindings, and the ability to turn off device interrupt processing, etc. should allow applications to be run with reduced jitter while still supporting a full, standard OS environment on each node, and not having to resort to microkernel or other approaches to deliver low jitter. Using standard OSes (possible stripped down by removing or disabling unneeded functions) is very important for several reasons.

As mentioned earlier, OS capabilities are becoming more important, not less. I’ve mentioned fault management and scalability. In a few slides we’ll talk about power management, virtualization and other capabilities that will be needed for successful HPC installations in the future. Attempt to build all of that into a microkernel and you end up with a kernel. You might as well start with all of the learning and innovation that has accrued to standard OSes and minimize and improve where necessary rather than building one-off or few-off custom software environments.

I worry about how well-served the very high end of the HPC market will be in the future. It isn’t a large market or one that is growing like other segments of HPC. While it is a segment that solves incredibly important problems, it is also quite a difficult market for vendors to satisfy. The systems are huge, the software scaling issues tremendously difficult, and, frankly, both the volume and the margins are low. This segment has argued for many years that meeting their scaling and other requirements guarantees that a vendor will be well-positioned to satisfy any other HPC customer’s requirements. It is essentially the “scale down” argument. But that argument is in jeopardy to the extent that the high-end community embraces a different approach than is needed for the bulk of the HPC market. Commercial HPC customers want and need a full instance of Solaris or Linux on their compute nodes because they have both throughput and capability problems to run and because they run lots of ISV applications. They don’t want a microkernel or some other funky software environment.

I absolutely understand and respect the software work being done at our national labs and elsewhere to take advantage of the large-scale systems they have deployed. But this does not stop me from worrying about the ramifications of the high-end delaminating from the rest of the HPC market.

[os panel slide 6]

The HPC community is accustomed to being on the leading/bleeding edge, creating new technologies that eventually flow into the larger IT markets. InfiniBand is one such example that is still in process. Parallel computing techniques may be another as the need for parallelization begins to be felt far beyond HPC due to the tailing off of clock speed increases and the emergence of multi-core CPUs.

Virtualization is an example of the opposite. This is a trend that is taking the enterprise computing markets by storm. The HPC community to date has not been interested in a technology that in their view simply adds another layer of “stuff” between the hardware and their applications, reducing performance. I would argue that virtualization is coming and will be ubiquitous and the HPC community needs to actively engaged to 1) influence virtualization technology to align it with HPC needs, and 2) find ways in which virtualization can be used to advantage within the HPC community rather than simply being victimized by it. It is coming, so let’s embrace it.

The two immediate challenges are both performance related: base performance of applications in a virtualized environment and virtualizing the InfiniBand (or Ethernet) interconnect while still being able to deliver high performance on distributed applications, including both compute and storage. The first issue may not be a large one since most HPC codes are compute intensive and such code should run at nearly full speed in a virtualized environment. And early research on virtualized IB, for example by DK Panda‘s Network-based Computing Laboratory at OSU, has shown promising results. In addition, the PCI-IOV (IO Virtualization) standard will add hardware support for PCI virtualization that should help achieve high performance.

What about the potential benefits of virtualization for HPC? I can think of several possibilities:

  • Coupling live migration [PDF] with fault management to dynamically shift a running guest OS instance off of a failing node, thereby avoiding application interruptions.
  • Using the clean interface between hypervisor and Guest OS instances to perform checkpointing of a guest OS instance (or many instances in the case of an MPI job) rather than attempting to checkpoint individual processes within an OS instance. The HPC community has tried for many years to create the latter capability, but there are always limitations. Perhaps we can do a more complete job by working at this lower level.
  • Virtualization can enable higher utilization of available resources (those beefy nodes again) while maintaining a security and failure barrier
    between applications and users. This is ideal in academic or other service environments in which multi-tenancy is an issue, for example in cases where academic users and industry partners with privacy or security concerns share the same compute resources.
  • Virtualization can also be used to decrease the administrative burden on system administration staff and allow it to be more responsive to the needs of its user population. For example, a Solaris or Linux-based HPC installation could easily allow virtualized Windows-based ISV applications to be run dynamically on its systems without having to permanently maintain Windows systems in the environment.

They key point is that virtualization is coming and we as a community should find the best ways of using the technology to our advantage. The above are some ideas–I’d like to hear others.

[os panel slide 7]

The power and cooling challenges are straightforward; the solutions are not. We must deliver effective power management capabilities for compute, storage, and interconnect that support high performance, but also deliver significant improvements over current practice. To do this effectively requires work across the entire hardware and software stack. Processor and system design. Operating system design. System management framework design. And it will require a very comprehensive review of coding practices at all levels of the stack. Polling is not eco-efficient, for example.

Power management issues are yet another reason why operating systems become more important to HPC as capabilities developed primarily for the much larger enterprise computing markets gain relevance for HPC customers.

[os panel slide 8]

Here I sketch a little of Sun’s approach with respect to OSes for HPC. First, we will offer both Linux and Solaris-based HPC solutions, including a full stack of HPC software on top of the base operating systems. We recognize quite clearly the position currently held by Linux in HPC and see no reason why we should not be a preferred provider of such systems. At the same time, we believe there is a strong value proposition for Solaris in HPC and that we can deliver performance along an array of increasingly relevant enterprise-derived capabilities that will benefit the HPC community. We also realize it is incumbent upon us to prove this point to you and we intend to do so.

I will finish by commenting on one bullet on this final slide. For the other products and technologies, I will defer to future blog posts. The item I want to end with is Project Indiana and OpenSolaris due to its relevance to HPC customers.

In 2005, Sun created an open source and open development effort based on the source code for the Solaris Operating System. Called OpenSolaris, the community now numbers well over 75,000 members and it continues to grow.

Project Indiana is an OpenSolaris project whose goal is to produce OpenSolaris binary distributions that will be made freely available to anyone with optional for-fee support available from Sun. An important part of this project is a modernization effort that moves Solaris to a network-based package management system, updates many of the open-source utilities that are included in the distro, and adds open-source programs and utilities that are commonly expected to be present. To those familiar with Linux, the OpenSolaris user experience should become much more familiar as these changes roll out. In my view, this was a necessary and long-overdue step towards lowering the barrier for Linux (and other) users, enabling them to more easily step into the Solaris environment and benefit from the many innovations we’ve introduced (see slide for some examples.)

In addition to OpenSolaris binary distros, you will see other derivative distros appearing. In particular, we are working to define an OpenSolaris-based distro that will include a full HPC software stack and will address both developers and deployers. This effort has been running within Sun for awhile and will soon transition to an OpenSolaris project so we can more easily solicit community involvement in this effort. This Solaris HPC distro is meant to complement similar work being done by a Linux-focused engineering team within our expanded Lustre group, which is also doing its work in the open and also encourages community involvement.

There was some grumbling at the HPC User Forum about the general Linux community and its lack of focus or interest in HPC. While clearly there have been some successes (for example, some of the Linux scaling work done by SGI), there is frustration. One specific example mentioned was the difficulty in getting InfiniBand support into the Linux kernel. My comment on that? We don’t need to ask Linus’ permission to put HPC-enabling features into Solaris. In fact, with our CEO making it very clear that HPC is one of the top three strategic focus areas for Sun [PDF, 2MB], we welcome HPC community involvement in OpenSolaris. It’s free, it’s open, and we want Solaris to be your operating system of choice for HPC.

Microsoft Releases Windows Explodé

April 26, 2008

Last week, Microsoft quietly released a new Vista feature called Windows Assured Shutdown (WAS) that allows users to press F9 on their keyboard and force a guaranteed, controlled and immediate shutdown of Vista. Microsoft suggests this feature be used in situations in which the system appears hung or otherwise unresponsive and has apparently implemented this in reaction to numerous instances of data loss due to frustrated users simply powering off their machines when they appear to hang.

Of course, the Apple community immediately leaped on this and dubbed the feature Windows Explodé, a riff on Apple’s Exposé feature in Mac OS X. More details on WAS here.

HPC User Forum: Interconnect Panel

April 21, 2008

Last week I attended the IDC HPC User Forum meeting in Norfolk, Virginia. The meeting brings together HPC users and vendors for two days of presentation and discussion about all things HPC.

The theme of this meeting (26th in the series) was CFD — Computational Fluid Dynamics. In addition to numerous talks on various aspects of CFD, there were a series of short vendor talks and four panel sessions with both industry and user members. I sat on both the interconnect and operating system panels and prepared a very short slide set for each session. This blog entry covers the interconnect session.

Because the organizers opted against panel member presentations at the last minute, these slides were not actually shown at the conference, though they will be included in the conference proceedings. I use them here to highlight some of the main points I made during the panel discussion.

[interconnect panel slide #1]

I participated on the interconnect panel as a stand-in for David Caplan who was unable to attend the conference due to a late-breaking conflict. Thanks to David for supplying a slide deck from which I extracted a few slides for my own use. Everything else is my fault. 🙂

Fellow members of the interconnect panel included Ron Brightwell from Sandia National Lab, Patrick Geoffray from Myricom, Christian Bell from QLogic, Gilad Shainer from Mellanox, Larry Stewart from SiCortex, Richard Walsh from IDC, and (I believe–we were at separate tables at it was hard to see) Nick Nystrom from PSC, and Pete Wyckoff from OSC.

The framing questions suggested by Richard Walsh prior to the meeting were used to guide the panel’s discussion. The five areas were:

  • competing technologies — five year look ahead
  • unifying the interconnect software stack
  • interconnects and parallel programming models
  • interconnect transceiver and media trends
  • performance analysis tools

I had little to say about media and transceivers, though we all seemed to agree that optical was the future direction for interconnects. There was some discussion earlier in the conference about future optical connectors and approaches (Luxtera and Hitachi Cable both gave interesting presentations) and this was not covered further in the panel.

As a whole, the panel did not have much useful to say about performance analysis. It was clear from comments that the users want a toolset that allows them to understand at the application level how well they are using their underlying interconnect fabric for computational and storage transfers. While much low-level data can be extracted from current systems, from the HCAs, and from fabric switches, there seemed to be general agreement that the vendor community has not created any facility that would allow these data to be aggregated and displayed in a manner that would be intuitive and useful to the application programmer or end user.

[interconnect panel slide #2]

Moving on to my first slide, there was little disagreement that both Ethernet and InfiniBand will continue as important HPC interconnects. It was widely agreed that 10 GbE in particular will gain broader acceptance as it becomes less expensive and several attendees suggested that point begins to occur when vendors integrate 10 GbE on their system motherboards rather than relying on separate plug-in network interface cards. As you will see on a later slide, the fact that Sun is already integrating 10 GbE on board, and in some cases on-chip, in current systems was part of one my themes during this panel.

InfiniBand (IB) has both a significant latency and bandwidth advantage over 10 GbE and it has rapidly gained favor within the HPC community. The fact that we are seeing penetration of IB into Financial Services is especially interesting because that community (which we do consider “HPC” from a computational perspective) is beginning to adopt IB not as an enabler of scalable distributed computing (e.g. MPI), but rather as an enabler of the higher messaging rates needed for distributed trading applications. In addition, we see an increased interest in InfiniBand for Oracle RAC performance using the RDS protocol.

The HPC community benefits any time its requirements or its underlying technologies begin to see significant adoption in the wider enterprise computing space. As the much larger enterprise market gets on the bandwagon, the ecosystem grows and strengthens as vendors invest more and as more enterprise customers embrace related products.

While I declared myself a fan of iWARP (RDMA over Ethernet) as a way of broadening the appeal of Ethernet for latency-sensitive applications and for reducing the load on host CPUs, there was a bit of controversy in this area among panel members. The issue arose when the QLogic and Myricom representatives expressed disapproval of some comments made by the Mellanox representative related to OpenFabrics. The latter described the OpenFabrics effort as interconnect independent, while the former were more of a mind that iWARP, which had been united with the OpenIB effort to form what is now called OpenFabrics, was not well integrated in that certain API semantics were different between IB and Ethernet. I need to learn more from Ted Kim, our resident iWARP expert.

Speaking of OpenFabrics, a few points. First, Sun is a member of the OpenFabrics Alliance and as offerors of Linux-based HPC solutions we will support the OFED stack. Second, Sun does fully support InfiniBand on Solaris as an essential part of our HPC on Solaris program. We do this by implementing the important OpenFabrics APIs on Solaris and even sharing code for some of the upper-level protocols.

Second, there was some additional disagreement among the same three panel members regarding the OpenFabrics APIs and the semantics of InfiniBand generally. Both the QLogic and Myricom representatives felt strongly that extra cruft had accrued in both standards that is unnecessary for the efficient support of HPC applications. We did not have enough time to delve into the specifics of their disagreement, but I would certainly like to hear more at some point.

[interconnect panel slide #3]
[interconnect panel slide #4]

The above two slides (courtesy of David Caplan) illustrate several points. First, they show that attention to seemingly small details can yield very large benefits at a large scale. In this case, the observation that we could run three separate 4x InfiniBand connections within one single 12x InfiniBand cable allowed us to build the 500 TFLOP Ranger system at TACC in Austin with essentially one (though, in actuality two) very large central InfiniBand switches. The system used 1/6 the cables required for a conventional installation of this size and many fewer switching elements. And, s
econd, to build successful, effective HPC solutions one must view the problem at a system level. In this case, Sun’s Constellation system architecture leverages the cabling (and connector) innovation and combines it with both the ultra-dense switch product as well as an ultra-dense blade chassis with 48 four-socket blades to create a nicely-packaged solution that scales into the petaflop range.

[interconnect panel slide #5]

I took this photo the day we announced our two latest CMT systems, the T5240 and T5140. I included it in the deck for several reasons. First, it allowed me to make the point that Sun has very aggressively embraced 10 GbE, putting it on the motherboard in these systems. Second, it is another example of system-level thinking in that we have carefully impedance-matched these 10 GbE capabilities to our multi-core and multi-threaded processors by supporting a large number of independent DMA channels in the 10 GbE hardware. In other words, as processors continue to get more parallel (more threads, more cores) one must keep everything in balance. Which is why you see so many DMA channels here and why we greatly increased the off-chip memory bandwidth (about 50 GB/s) and coherency bandwidth (about the same) of these UltraSPARC T2plus based systems over that available in commodity systems. And it isn’t just about the hardware: our new Crossbow networking stack in OpenSolaris allows application workloads to benefit from the underlying throughput and parallelism available in all this new hardware. At the end of the day, it’s about the system.

[interconnect panel slide #6]

This last side again emphasizes the value of being able to optimize at a system level. In this case, the fact that we’ve been able to integrate dual 10 GbE interfaces on-chip in our UltraSPARC T2-based systems because it made sense to tightly couple the networking resources with computation for best latency and performance. You might be confused if you’ve noticed I said earlier that our very latest CMT systems have on-board 10 GbE, but not on-chip 10 GbE. This illustrates yet another advantage held by true system vendors — the ability to shift optimization points at will to add value for customers. In this case, we opted to shift the 10 GbE interfaces off-chip in the UltraSPARC T2plus to make room for the 50 GB/s of coherency logic we added to allow two T2plus sockets to be run together as a single, large 128-thread, 16 FPU system.

Another system vendor advantage is illustrated by our involvement in the DARPA-funded program UNIC, which stands for Ultraperformance Nanophotonic Intrachip Communications. This project seeks to create macro-chip assemblies that tie multiple physical pieces of silicon together using extremely high bandwidth and low latency optical connections. Think of the macro-chip approach as a way of creating very large scale silicon with the economics of small-scale silicon. The project is exploring the physical-level technologies that might support one or two orders of magnitude bandwidth increase well into the multi-terabyte range. From my HPC perspective, it remains to be seen what protocol would run across these links: Coherent links lead to macro-chips; Non coherent or perhaps semi-coherent links lead to some very interesting clustering capabilities with some truly revolutionary interconnect characteristics. How we use this technology once it has been developed remains to be seen, but the possibilities are all extremely interesting.

To wrap up this overly-long entry, I’ll close with the observation that one of the framing questions used for the panel session was stated backwards. In essence, the question asked what programming model changes for HPC would be engendered by evolution in the interconnect space. As I pointed out during the session, the semantics of the interconnect should be dictated by the requirements of the programming models and not the other way ’round. This lead to a good, but frustrating discussion about future programming models for HPC.

It was a good topic because it is an important topic that needs to be discussed. While there was broad agreement that MPI as a programming model will be used for HPC for a very long time, it is also clear that new models should be examined. New models are needed to 1) help new entrants into HPC, as well as others, more easily take advantage of multi-core and multi-threaded processors, 2) create scalable, high-performance applications, and 3) deal with the unavoidable hardware failures in underlying cluster hardware. It is, for example, ironic in the extreme that the HPC community has embraced one of the most brittle programming models available (i.e. MPI) and is attempting to use it to scale applications into the petascale range and beyond. There may be lessons to be learned from the enterprise computing community which has long embedded reliability in their middleware layers to allow e-commerce sites, etc., to ride through failures with no service interruptions. More recently, they are doing so increasingly large scale computational and storage infrastructures. On the software front, Google’s MapReduce, the open-source Hadoop project, Pig, etc. have all embraced resiliency as an important characteristic of their scalable application architectures. While I am not suggesting that these particular approaches be adopted for HPC with its very different algorithmic requirements, I do believe there are lessons to be learned. At some point soon I believe resiliency will become more important to HPC than peak performance. This was discussed in more detail during the operating system panel.

It was also a frustrating topic because as a vendor I would like some indication from the community as to promising directions to pursue. As with all vendors, we have limited resources and would prefer to place our bets in fewer places with increased chances for success. Are the PGAS languages (e.g. UPC, CaF) the right approach? Are there others? Is there only one answer, or are the needs of the commercial HPC community different from, for example, those of the highest-end of the HPC computing spectrum? I was accused at the meeting of saying the vendors weren’t going to do anything to improve the situation. Instead, my point was that we prefer to make data-driven decisions where possible, but the community isn’t really giving us the data. In the meantime, UPC does have some uptake with the intelligence community and perhaps more broadly. And Cray, IBM, and Sun have all defined new languages for HPC that aim to deliver good, scalable performance with higher productivity than existing languages. Where this will go, I do not know. But go it must…

My next entry will describe the Operating System Panel.

My Life as a Banker: An Update

April 13, 2008

[kiva logo]

As I’ve described before, I signed up almost a year ago with Kiva to fund micro-loans for people in need. I started with two loans totaling $200 and have now expanded that to eight active loans totaling $475. I mention the amounts to show that a modest investment can really make a dfference to the people seeking these loans. I’m funding eight loans because I really like the micro-loan concept. You can start with a single loan and much less money and still make a difference.

You can view my loan portfolio and find out more about the businesses and people I’ve lent to here. While there is always a risk of default, the current default rate on my loans is zero, as is the delinquency rate.

As loans are repaid into my Kiva account, I retarget the money to other outstanding loan requests, helping other people. Having made the decision to put my money into the Kiva system, I like the idea that I can use it indefinitely to help more people over time.

If you would like to learn more about Kiva, go here. For more information about microfinance, go here.

Pecha Kucha Boston

April 11, 2008

Twenty seconds per slide. Twenty slides. Six minutes and forty seconds per presentation. This is the essence of Pecha Kucha, a presentation / performance style that originated in the Tokyo design community in 2003, and which has since spread across the world as a series of <a href="Pecha Kucha Nights, most often featuring fourteen speakers per event.

Pecha Kucha Night Boston Volume #4 was held last night at the Harvard Graduate School of Design in Cambridge. The event was free and open to the public. There were seventeen speakers, including several who were visiting from Iran. Talks covered a range of topics including slums, Mayan roadways, Icelandic urban development, aging, and a meditation on the similarities and differences between the garden city of Isfahan and Newark, New Jersey. The presentation styles varied from earnest efforts to speak as quickly as possible in the allotted time, to a Zen-like display of a series of design images accompanied by 5 seconds of description and 15 seconds of silence per slide. The latter worked better than the former and the best speakers presented at a casual rate that matched the cadence of their slides without seeming effort. These were the presentations that you know must have been practiced in advance. And that really seems to be the point: Think about what you want to say, pare it down, and then match visuals to the message. You can’t do a good Pecha Kucha presentation with conventional, text-bloated slides.

This was my first Pecha Kucha event and I observed that it worked well as an evening event with low lighting, some nice ambient music as background, liquid refreshments and periodic intermissions for mingling. I would definitely attend future events. I also think this would be a fun session to include in Sun’s next internal technology conference, Innovation@Sun 2008.

Pecha Kucha Night Boston Volume #5 will be held in June. See http://www.pecha-kucha.org/ for more information about events in Boston and other cities.

Maramba (T5240) Unplugged: A Look Under the Hood

April 10, 2008

Aside from being interesting machines for HPC, our latest systems are beautiful to behold. At least to an engineer, I suppose. We had a fun internal launch event (free ice cream!) on Sun’s Burlington campus yesterday where we had a chance to look inside the boxes and ask questions of the hardware guys who designed these systems. Here is my annotated guide to the layout of a T5240 system.

[annotated T5240 system]

A. Software engineers examining the Sun SPARC Enterprise T5420. Note the uniforms. Sometimes I think the hardware guys are better dressers than the software folks.

B. Front section of the machine. Disks, DVD, etc, go here.

C. Ten hot-swappable fans with room for two more. The system will compensate for a bad fan and continue to run. The system also automatically adjusts the fan speeds to control the amount of cooling needed, depending on demand. For example, when the mezzanine (K) is in place, more cooling is needed. If it isn’t, why spend the energy spinning the fans faster than you need?

D. This is a plexiglass air plenum which is hard to see in the photo because it is clear though the top part near the fans does have some white-background diagrams on it. The plenum sits at an angle near the top of the photo and then runs horizontally over the memory DIMMs (F). It captures the air coming out of the fans and forces it past the DIMMS for efficient cooling. The plenum is removed when installing the mezzanine.

E. These are the two heat sinks covering the two UltraSPARC T2plus processors. The two procs are linked with four point-to-point coherency links which are also buried under that area of the board.

F. The memory DIMMs. On the motherboard you can see there are eight DIMM slots per socket for a maximum of 64 GB of memory with current FBDIMMs. But see (K) if 64 GB isn’t enough memory for you.

G. There are four connectors shown here with the rightmost connector filled with an air controller as the others would be in an operating system. The mezzanine assembly (K) mates to the motherboard here.

H. PCIe bridge chips are here in this area, supporting external IO connectivity.

I. This is the heat sink for Neptune, Sun’s multithreaded 10 GbE technology. It supports two 10 GbE links or some number of slower links and has a large number of DMA engines to offer a good impedance match with our multithreaded processors. On the last generation T2 processors, the Neptune technology was integrated onto the T2 silicon. In T2plus, we moved the 10 GbE off-chip to make room for the coherency logic that allows us to create a coherent 128-thread system with two sockets. As a systems company, we can make tradeoffs like this at will.

J. This is where the service processor for the system is located, included associated SDRAM memory. The software guys got all tingly when they saw where all their ILOM code actually runs. ILOM is Sun’s unified service processor that provides lights out management (LOM.)

K. And, finally, the mezzanine assembly. This plugs into the four sockets described above (G) and allows the T5240 to be expanded to 128 GB…all in a two rack-unit enclosure. The density is incredible.

Kicking Butt with OpenMP: The Power of CMT

April 10, 2008

[t5240 internal shot]

Yesterday we launched our 3rd generation of multicore / multithreaded SPARC machines and again the systems should turn some heads in the HPC world. Last October, we introduced our first single-socket UltraSPARC T2 based systems with 64 hardware threads and eight floating point units. I blogged about the HPC aspects here. These systems showed great promise for HPC as illustrated in this analysis done by RWTH Aachen.

We have now followed that with two new systems that offer up to 128 hardware threads and 16 floating point units per node. In one or two rack units. With up to 128 GB of memory in the 2U system. These systems, the Sun SPARC Enterprise T5140 and T5240 Servers, both contain two UltraSPARC T2plus processors, which maintain coherency across four point-to-point links with a combined bandwidth of 50 GB per second. Further details are available in the Server Architecture whitepaper.

An HPC person might wonder how a system with a measly clock rate of 1.4 GHz and 128 hardware threads spread across two sockets might perform against a more conventional two-socket system running at over 4 GHz. Well, wonder no more. The OpenMP threaded parallelization model is a great way to parallelize codes on shared memory machine and good illustration of the value proposition of our throughput computing approach. In short, we kick butt with these new systems on SPEComp and have established a new world record for two-socket systems. Details here.

Oh, and we also just released world record two-socket numbers for both SPECint_rate2006 and SPECfp_rate2006 using these new boxes. Check out the details here.

Apple Software Does it Again

April 2, 2008

Someone recently asked on Sun’s internal Mac mailing list for advice on how to merge two PDF files together. Many suggestions were offered, all of which involved using some third-party free or for-fee utility to do the job.

Finally, someone mentioned that if you open both PDF files with Preview and turn on the thumbnail view, you can simply drag the pages from one file into the other and then save the merged file to disk. A neat solution that is also far more general than a simple merge utility.

Yet another instance where Apple software surprises and delights me…and helps me remain patient, waiting for a fix, as my MacBook Pro continues to have wake-from-sleep problems (as of the recent, post-10.5.2 graphics update) that require me to power cycle my machine to reset it. I cringe every time I do it. Read more about the problem here.