Teamwork Key to Meeting Tight Development Deadline

With over 500 HPC deployments completed, the X-ISS team knows there is only one real constant – each cluster is unique and presents its own set of challenges. A recent deployment of a new 1400-node Dell cluster was no different. The client had a tight deadline and despite the size, deployment, configuration and testing were to be completed in only seven weeks.

“We take great pride in the HPC expertise our personnel bring to every project,” said X-ISS President and CEO Deepak Khosla. “In this case, however, the ability of our expert staff to work together as a coordinated team was the key to bringing this project to fruition in the allotted time period.”

The client, a large manufacturer, selected Dell to design and build a new cluster with 1400 InfiniBand-connected compute nodes for deployment into an existing HPC environment. Although the new cluster would operate separately from the others, it would share existing storage capacity. Dell partnered with X-ISS to install and configure the cluster because of X-ISS’s reputation and skills with the technologies involved.

The assignment required validation testing before handing it over in phases to internal client personnel tasked with installation of applications. X-ISS was also asked to set up a separate testing environment for the client to experiment new graphical user interfaces for the platform cluster manager.

X-ISS assigned three network engineers to the project, each with extensive experience in Dell systems and HPC software stack technology. Although all three were rarely at the client site at the same time, they relied on best practices, standardized deployment templates and other tools to guide their work and ensure consistency, regardless of who was performing any given task.

Deploying the Cluster

Already a sophisticated user of HPC technology, the client specified the use of xCAT software as its cluster management system of choice. X-ISS’s extensive experience with this robust tool dictated the utilization of custom scripts to standardize the configuration of all the nodes. The scripts also automated and accelerated the configuration process, reducing by approximately 90 percent the time  required to configure 1400 nodes manually.

Ensuring consistent configuration of nodes is crucial to the efficiency of an InfiniBand-connected system. The advantage of InfiniBand is that it enables compute nodes to communicate at a much higher rate than otherwise possible. But for this data interconnect to operate with maximum efficiency, the nodes themselves must be tightly synchronized. Just one node running the wrong firmware version or loaded with different settings can slow down the entire cluster, which is why the xCAT scripts were so important.

Another advantage of the xCAT scripts is they served as templates to standardize the overall deployment process. With three X-ISS network engineers assigned to the project, they traveled to the client site in shifts, often staying for a week or two. Regardless of who was onsite working on the deployment, the results were consistent because of the standardized details written into the scripts.

“As our engineers traveled between the client site and X-ISS headquarters in Houston, they remained in communication with each other,” said Khosla. “We set up remote access to the client cluster so our team members could assist each other directly whether they were onsite or not.”

Meeting the Deadline

Once the xCAT installation was completed, the team created deployment images for the compute and infrastructure nodes. These images allowed the compute nodes to be rapidly deployed throughout the process. Later, an additional image was created for visualization nodes.

At the request of the client, X-ISS released the cluster in phases as multiple independent compute environments. During the deployment of each individual environment, the team validated the BIOS and firmware versions on the compute node. A tool was used to enforce BIOS settings on the nodes, again for system-wide consistency.

Finally, the X-ISS engineers ran high performance LINPACK (HPL) benchmark tests on the cluster to help identify and resolve any issues related to misconfigurations or hardware failures typical in such large setups. Many minor hardware issues occur during shipping and can be easily sorted out between the engineers and the hardware vendor during installation.

In many projects, X-ISS installs and validates application software, but this client maintained an internal team that performed that work. The internal group also ran a series of its own tests on the new system before putting it into full production mode. The deployment and configuration done by X-ISS passed all acceptance testing by the client – all within the seven-week deadline.

Download this case study: Teamwork.CaseStudy12

X-ISS Sets Up Diskless Windows HPC Cluster for Secure Military Environment

A Department of Defense site needed a powerful Microsoft Windows HPC cluster to run mission critical simulation applications. At 196 nodes, the cluster was relatively large, and due to security constraints, it had to be diskless.

In a diskless cluster, a central storage area network is typically loaded with a small number of physical hard drives storing files that serve as virtual hard drives to boot the compute nodes. Diskless Linux HPC systems were already relatively common at the time, but a diskless Windows HPC deployment was not.

The DoD site chose Dell to deliver this system, and since X-ISS had already been a long-time HPC-delivery partner, Dell called on X-ISS to assist with the job.

Proud of the platform-neutral reputation built over the past 15 years, X-ISS quickly dispatched a Senior Windows Analyst to the Dell integration facility to assist with building the cluster from the ground up. Specifically, X-ISS was tasked with customizing and installing the Windows cluster management system and testing the cluster.

“Starting with basic system architecture, we had to figure out how to make this work,” said Deepak Khosla. “Diskless booting with Windows is complex. It requires detailed planning to ensure all hardware is configured and set up to meet specific requirements.”

Making Diskless Windows Work

From a practical and financial perspective, diskless clusters make a lot of sense for any secure facility, Deepak Khosla explained. Organizations handling classified information deal with stringent security protocols for their computer networks. Among these is the mandated periodic DoD-grade wiping or outright destruction of disk drives containing sensitive data. For the military customer, this would have meant time-consuming and expensive cleansing – or destruction – of the 392 drives required for a standard 196-node system.

To customize the diskless Windows cluster, X-ISS interfaced extensively with Dell, Microsoft and the client.

After several conversations with Microsoft, X-ISS concluded that differencing disk technology would be key to a diskless system which met the military base’s requirement for system speed while also minimizing the number of hard drives. The differencing disks would enable the client to minimize the physical drive count and run and modify the simulation numerous times without ever changing the master boot image. Each change, or simulation modification, is saved to a differencing disk on a virtual drive.

Rather than set up hundreds of virtual drives, each taking up 15 gigabytes of space, the team created an equivalent number of differencing disks against a single 15GB virtual drive. The savings in disk space was enormous, and the system speed was not impaired.

Download this case study: DisklessWindows.CaseStudy2