Archive for April, 2009

Oracle is No IBM

April 20, 2009

I see some very exciting synergies in Oracle’s acquisition of Sun, many of which are covered in this joint presentation and also this FAQ on the Oracle website.

Very unlike the rumored IBM acquisition of Sun, which I think would have gone like this.

Tickless Clock for OpenSolaris

April 15, 2009

I’ve been talking a lot to people about the convergence we see happening between Enterprise and HPC IT requirements and how developments in each area can bring real benefits to the other. I should probably do an entire blog entry on specific aspects of this convergence, but for now I’d like to talk about the Tickless Clock OpenSolaris project.

Tickless kernel architectures will be familiar to HPC experts as one method for reducing application jitter on large clusters. For those not familiar with the issue, “jitter” refers to variability in the running time of application code due to underlying kernel activity, daemons, and other stray workloads. Since MPI programs typically run in alternating compute and communication phases and develop a natural synchonization as they do so, applications can be slowed down significantly when some nodes arrive late at these synchronization points. The larger the MPI job, the more likely the this type of noise will cause a problem. Measurements have shown surprisingly large slowdowns associated with jitter.

Jitter can be lessened by reducing the number of daemons running on a system, by turning off all non-essential kernel services, etc. Even with these changes, however, there are other sources of jitter. One notable source is the clock interrupt used in virtually all current operating systems. This interrupt, which fires 100 times per second, is used to periodically perform housekeeping chores required by the OS. This interrupt is a known contributor to jitter. It is for this reason that IBM has implemented a tickless kernel on their Blue Gene systems to reduce application jitter.

Sun is starting a Tickless Clock project in OpenSolaris to completely remove the clock interrupt and switch to an event-based architecture for OpenSolaris. While I expect this will be very useful for HPC users of OpenSolaris, HPC is not the primary motivator of this project.

As you’ll hear in the video interview with Eric Saxe, Senior Staff Engineer in Sun’s Kernel Engineering group, the primary reasons he is looking at Tickless Clock are power management and virtualization. For power management, it is important that when the system is idle, it really IS idle and not waking up 100 times per second to do nothing since this wastes power and will prevent the system from entering deeper power saving states. For virtualization, since multiple OS instances may share the same physical server resources, it is important that guest OSes that are idle really do stay idle. Again, waking up 100 times per second to do nothing will steal cycles from active guest OS instances, thereby reducing performance in a virtualized environment.

While it is true I would argue that both power management and virtualization will become increasingly important to HPC users (more of that convergence thing), it is interesting to me to see that these traditional enterprise issues are stimulating new projects that will benefit both enterprise and HPC customers in the future.

Interested in getting involved with implementing a tickless architecture for OpenSolaris? The project page is here.

MacBook Pro: Many Screws, All Tiny

April 15, 2009

This weekend I upgraded my 2.2 GHz Core 2 Duo MacBook Pro’s internal hard disk from 160 GB to 320 GB following the excellent instructions at iFixit. The two dodgy steps are freeing the top-case assembly and carefully prying loose the ribbon cables that are attached to the top of the existing hard drive. For the latter you definitely need some sort of thin plastic tool to work gently underneath the ribbons to detach them. To keep track of the variety of tiny screws encountered (both phillips and torx) I organized them according to their iFixit disassembly step.

I chose the 7200RPM Hitachi Travelstar 320GB 16MB SATA drive (model HTS723232L9A360) as a replacement. Since my 160 GB disk was also a 7200 RPM drive, I didn’t experience the noticeable performance improvements some people have reported when moving to a faster disk. If you do an upgrade, you should definitely use a 7K drive. I bought mine at Other World Computing.

Before replacing the drive, I did a full back up onto an external Firewire disk and then swapped my new Travelstar into the external drive enclosure and did another full, bootable backup onto the new disk. Both backups were done using Carbon Copy Cloner. After booting from the now externally-attached Travelstar to verify that the backup had worked correctly, I removed the Travelstar from the external disk enclosure and then inserted it into the MBP following the iFixit instructions. Once done, my machine booted with no problem. I now have lots of space for my growing collection of RAW photos, which eat disk space at an alarming clip.


You Say Nehalem, I Say Nehali

April 14, 2009

Let me say this first: NEHALEM, NEHALEM, NEHALEM. You can call it the Intel Xeon Series 5500 processor if you like, but the HPC community has been whispering about Nehalem and rubbing its collective hands together in anticipation of Nehalem for what seems like years now. So, let’s talk about Nehalem.

Actually, let’s not. I’m guessing that between the collective might of the Intel PR machine, the traditional press, and the blogosphere you are already well-steeped in the details of this new Intel processor and why it excites the HPC community. Rather than talk about the processor per se, let’s talk instead about the more typical HPC scenario: not about a single Nehalem, but rather piles of Nehali and how to best to deploy them in an HPC cluster configuration.

Our HPC clustering approach is based on the Sun Constellation architecture, which we’ve deployed at TACC and other large-scale HPC sites around the world. These existing systems house compute nodes in the Sun Blade 6000 System chassis which holds four blade shelves, each with twelve blade systems for a total of 48 blades per chassis. Constellation also includes matching InfiniBand infrastructure, including InfiniBand Network Express Modules (NEMs) and a range of InfiniBand switches that can be used to build petascale compute clusters. You can see several of the 82 TACC Ranger Constellation chassis in this photo, interspersed with (black) inline cooling units.

As part of our continued focus on HPC customer requirements, we’ve done something interesting with our new Nehalem-based Vayu blade (officially, the Sun Blade X6275 Server Module): each blade houses two separate nodes. Here is a photo of the Vayu blade:

Each of the nodes is a diskless, two-socket Nehalem system with 12 DDR3 DIMM slots per node (up to 96 GB per node) and on-board QDR InfiniBand. It’s actually not quite correct to call this node diskless because it does include two Sun Flash Module slots (one per node) that each provide up to 24 GB of FLASH storage through a SATA interface. I am sure our HPC customers will use what amounts to an ultra-fast disk for some interesting applications.

Using Vayu blades, each Sun Constellation chassis can now support a total of 2 nodes/blade * 12 blades/shelf * 4 shelves/chassis = 96 nodes with a peak floating-point performance of about 9 TFLOPs. While a chassis can support up to 96 GB / node * 96 nodes = 9.2 TB of memory, there are some subtleties involved in optimizing and configuring memory for Nehalem systems so I recommend reading John Nerl’s blog entry for a detailed discussion of this topic.

For a quick visual tour of Vayu, see the annotated photo below. The major components are: A = Nehalem 4-core processor; B = Memory DIMMs; C = Mellanox Connect-X QDR InfiniBand ASIC; D = Tylersburg I/O chipset; E = Base Management Controller (BMC) / Service Processor (one per node); F = Sun Flash Modules (SATA, 24GB per node.) The connector at the top right supports two PCIe2 interfaces, two InfiniBand interfaces, and 2 GbE interfaces. The GbE logic is hiding under the BMC daughter board.

To accommodate this high-density blade approach we’ve developed a new QDR InfiniBand NEM (officially called the SunBlade 6048 QDR IB Switched NEM), which is shown below. This Network Express Module plugs into a blade shelf’s midplane and forms the first level of InfiniBand fabric in a cluster configuration. Specifically, the two on-board 36-port Mellanox QDR switch chips act as leaf switches from which larger configurations can be built. Of the 72 total switch ports available, 24 are used for the Vayu blades and 9 from each switch chip are used to interconnect the two switches, leaving a total of 72 – 24 – 2*9 = 30 QDR links available for off-shelf connections. These links leave the NEM through 10 physical connectors, each of which carries three QDR X4 links over a single cable. As we discussed when the first version of Constellation was released, aggregating 4X links into 12X cables results in significant customer benefits related to reliability, density, and complexity. In any case, these cables can be connected to Constellation switches to form larger, tree-based fabrics. Or they can be used in switchless configurations to build torus-based topologies by using the ten cables to carry X, Y, and Z traffic between shelves and across chassis. As mentioned, for example, here. The NEM also provides GbE connectivity for each of the 24 nodes in the blade shelf.

Looking back at the TACC photo, we can now double the compute density shown there with our newest Constellation-based systems using the new Vayu blade. Oh, and by the way, we can also remove those inline cooling units and pack those chassis side-by-side for another significant increase in compute density. I’ll leave how we accomplish that last bit for a future blog entry.

smart…boston style!

April 5, 2009

Seen on Route 128 in the Boston area. Wicked!

HPC in Second Life (and Second Life in HPC)

April 3, 2009

We held an HPC panel session yesterday in Second Life for Sun employees interested in learning more about HPC. Our speakers were Cheryl Martin, Director of HPC Marketing; Peter Bojanic, Director for Lustre; Mike Vildibill, Director of Sun’s Strategic Engagement Team (SET); and myself. We covered several aspects of HPC: what it is, why it is important, and how Sun views it from a business perspective. We also talked about some of the hardware and software technologies and products that are key enablers for HPC: Constellation, Lustre, MPI, etc.

As we were all in-world at the time, I thought it would be interesting to ponder whether Second Life itself could be described as “HPC” and whether we were in fact holding the HPC meeting within an HPC application. Having viewed this excellent SL Architecture talk given by Ian (Wilkes) Linden, VP of Systems Engineering at Linden Lab, I conclude that SL is definitely an HPC application. Consider the following information taken from Ian’s presentation.

As you can see, the geography of SL has been exploding in size over the last 5-6 years. As of Dec 2008 that geography is simulated using more than 15K instances of the SL simulator process that in addition to computing the physics of SL also run an average of 30 million simultaneous server-side scripts to create additional aspects of the SL user experience. And look at the size of their dataset: 100TB is very respectable from an HPC perspective. And a billion files! Many HPC sites are worrying what will happen when they get to that level of scale while Linden Lab is already dealing with it. I was surprised they aren’t using Lustre, since I assume their storage needs are exploding as well. But I digress.

The SL simulator described above would be familiar to any HPC programmer. It’s a big C++ code. The problem space (the geography of SL) has been decomposed into 256m X 256m chunks that are each assigned to once instance of the simulator. Each simulator process runs on its own CPU core and “adjacent” simulator instances exchange edge data to ensure consistency across sub-domain boundaries. And it’s a high-level physics simulation. Smells like HPC to me.