HPC Consortium: University of Oslo

The last customer talk at the Sun HPC Consortium here is Dresden was given by Jostein Sundet, who is Program Director for Research Computing Services and Adjunct Professor in Atmospheric Sciences, University of Oslo. He spoke about the University’s HPC requirements and infrastructure and about how their new Sun Constellation System enables both capability and capacity computing for the University and its customers.

The University of Oslo is the largest university in Norway. And they now have the largest InfiniBand switch in the world — the Sun Datacenter Switch 3456 (a.k.a. the Magnum switch,) which is currently used to connect 96 Sun Blade 6048 systems to create a Sun Constellation System. In addition, they have seven Sun Fire x4600 8-way systems with up to 256 GB memory, 450 Sun Fire X2200 m2 nodes with 2*AMD quad core processors, 96 Dell 1425 systems, and about 1 PetaByte of storage.

The user base includes about 250 active users, including national (advanced) users and local University of Oslo users who may not be highly skilled in computing and come from a wide range of disciplines, including politics. The site also acts as a backup site for the Norwegian MET-Office’s operational weather forecast and functions as a Tier1 and Tier2 facility for CERN, while also participating in the National GRID.

Operational goals of the facility are to allow easy access for all users, both the elevated top (capability) users and the deepening base of capacity users. To do this, the university has acquired their Sun Constellation System, a homogeneous, InfiniBand fat-tree based HPC cluster system and coupled it with advanced job scheduling capabilities that supports backfilling, suspend/resume/migrate, and resource pooling.

Backfilling is critical for supporting increased system utilization in a mixed-use environment. By introducing such a capability, utilization was raised from about 60% to over 80% since smaller jobs can now be scheduled ahead of larger jobs while the larger jobs are waiting for resources. This has allowed users to temporarily exceed their job quota as in the case in which a user was able to temporarily run 1263 concurrent jobs, well above their usual limit of 384 jobs.

Titan, the Sun Constellation System to be used for both capacity and capability workloads, is a 4000-core cluster built with AMD quad-core processors using the Sun Datacenter 3456 centralized switch to provide InfiniBand connectivity for the system. The parallel file system is centralized and implemented with IBM’s GPFS.

Initial performance measurements demonstrated 965 MB/sec between nodes (point-to-point) and 780 MB/sec per node when performing a 96-node all-to-all MPI communication. Latency was measured at about 2 ms.

Primary challenges included some early firmware issues that were solved by Sun and connector and cable issues, an issue also mentioned by Karl Schulz in an earlier Consortium talk.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: