Teamwork Key to Meeting Tight Development Deadline
With over 500 HPC deployments completed, the X-ISS team knows there is only one real constant – each cluster is unique and presents its own set of challenges. A recent deployment of a new 1400-node Dell cluster was no different. The client had a tight deadline and despite the size, deployment, configuration and testing were to be completed in only seven weeks.
“We take great pride in the HPC expertise our personnel bring to every project,” said X-ISS President and CEO Deepak Khosla. “In this case, however, the ability of our expert staff to work together as a coordinated team was the key to bringing this project to fruition in the allotted time period.”
The client, a large manufacturer, selected Dell to design and build a new cluster with 1400 InfiniBand-connected compute nodes for deployment into an existing HPC environment. Although the new cluster would operate separately from the others, it would share existing storage capacity. Dell partnered with X-ISS to install and configure the cluster because of X-ISS’s reputation and skills with the technologies involved.
The assignment required validation testing before handing it over in phases to internal client personnel tasked with installation of applications. X-ISS was also asked to set up a separate testing environment for the client to experiment new graphical user interfaces for the platform cluster manager.
X-ISS assigned three network engineers to the project, each with extensive experience in Dell systems and HPC software stack technology. Although all three were rarely at the client site at the same time, they relied on best practices, standardized deployment templates and other tools to guide their work and ensure consistency, regardless of who was performing any given task.
Deploying the Cluster
Already a sophisticated user of HPC technology, the client specified the use of xCAT software as its cluster management system of choice. X-ISS’s extensive experience with this robust tool dictated the utilization of custom scripts to standardize the configuration of all the nodes. The scripts also automated and accelerated the configuration process, reducing by approximately 90 percent the time required to configure 1400 nodes manually.
Ensuring consistent configuration of nodes is crucial to the efficiency of an InfiniBand-connected system. The advantage of InfiniBand is that it enables compute nodes to communicate at a much higher rate than otherwise possible. But for this data interconnect to operate with maximum efficiency, the nodes themselves must be tightly synchronized. Just one node running the wrong firmware version or loaded with different settings can slow down the entire cluster, which is why the xCAT scripts were so important.
Another advantage of the xCAT scripts is they served as templates to standardize the overall deployment process. With three X-ISS network engineers assigned to the project, they traveled to the client site in shifts, often staying for a week or two. Regardless of who was onsite working on the deployment, the results were consistent because of the standardized details written into the scripts.
“As our engineers traveled between the client site and X-ISS headquarters in Houston, they remained in communication with each other,” said Khosla. “We set up remote access to the client cluster so our team members could assist each other directly whether they were onsite or not.”
Meeting the Deadline
Once the xCAT installation was completed, the team created deployment images for the compute and infrastructure nodes. These images allowed the compute nodes to be rapidly deployed throughout the process. Later, an additional image was created for visualization nodes.
At the request of the client, X-ISS released the cluster in phases as multiple independent compute environments. During the deployment of each individual environment, the team validated the BIOS and firmware versions on the compute node. A tool was used to enforce BIOS settings on the nodes, again for system-wide consistency.
Finally, the X-ISS engineers ran high performance LINPACK (HPL) benchmark tests on the cluster to help identify and resolve any issues related to misconfigurations or hardware failures typical in such large setups. Many minor hardware issues occur during shipping and can be easily sorted out between the engineers and the hardware vendor during installation.
In many projects, X-ISS installs and validates application software, but this client maintained an internal team that performed that work. The internal group also ran a series of its own tests on the new system before putting it into full production mode. The deployment and configuration done by X-ISS passed all acceptance testing by the client – all within the seven-week deadline.
Download this case study: Teamwork.CaseStudy12