X-ISS Upgrades xCAT Cluster Manager with Custom BitTorrent Deployment

X-ISS successfully upgraded a client’s cluster manager from xCAT 2.3 to xCAT 2.8 without any system downtime. Along with the upgrade, X-ISS designed and implemented BitTorrent as the compute node image transfer technology, possibly the first such deployment on an xCAT cluster. The BitTorrent utilization has resulted in faster boot times with fewer service nodes.

The client was ION GX Technology (GXT), a leader in providing advanced seismic data processing and visualization services to oil and gas clients through a series of worldwide geophysical centers. The computing power that drives these processing services comes from a large internal cloud-based HPC data center hub in Houston with over 300 racks containing over 7,000 nodes, networking and over 10 petabytes of storage.

“GXT is an image processing company that can’t afford cluster downtime,” said X-ISS CEO Deepak Khosla. “We completed this upgrade with no service disruptions on their cluster.”

The primary challenge in upgrading to the newest version of xCAT was the large upgrade gap between versions 2.3 and 2.8. This meant some of the company’s legacy hardware components, older IBM blades, were not officially supported by the latest software. X-ISS wrote custom Perl modules so xCAT 2.8 could accommodate the blades as compute nodes.

As a long-time service provider to GXT, X-ISS had performed much of the code customization for xCAT 2.3 several years ago. This was requested at the time by the client to allow for remote console support for unsupported hardware, which was functionality GXT wanted to maintain in the latest cluster manager. The X-ISS team copied data and updated the custom code to work with xCAT 2.8.

In the course of planning the xCAT upgrade, the X-ISS team realized that significant performance enhancements, including faster scaling of compute nodes, could be achieved in the cluster by implementing BitTorrent. X-ISS saw a major advantage to BitTorrent – it spreads out the heavy load of transferring the compute image from more than just the service nodes, and as a result allows more compute nodes to boot at the same time. This would save money for the client by reducing the number of new services nodes that had to be purchased while improving overall performance. A win-win situation for the client.

However, there was no record of BitTorrent being implemented with any version of xCAT in the past. To achieve this first-of-its-kind implementation, X-ISS added several components that would work together for quick download of the compute image by the compute nodes. These technologies included Aria2 for Metalink parsing and simultaneous image downloading from multiple sources, a customized version of MirrorBrain for automatic Metalink and torrent generation upon file request, and OpenTracker for servicing of the BitTorrent client (Aria2).

“We wrote code to allow the compute nodes to know how to download the image,” said Khosla. “This involved developing a new utility to replace the stock image creation utility that comes with xCAT.”

Once the BitTorrent implementation was completed, the X-ISS team focused on a variety of other challenges in the upgrade at GXT:

  • Rewrote the local scratch creation scripts to use GPT to support disks larger than 2 TB.
  • To facilitate the future addition of hardware to the GXT cluster without extensive programing, X-ISS created and implemented automated RAID configuration scripts.
  • So that GXT could leverage the capabilities of both CentOS 5 and CentOS 6, X-ISS set up an image generation host and tied it into the xCAT cluster manager which allows the user to create and boot CentOS 5 images from the CentOS 6-based cluster manager.

Following various tweaks and enhancements made in collaboration with xCAT developers, the latest version of the cluster manager was deployed. To eliminate downtime, xCAT2.3 was kept running in parallel as work continued on the upgrade. As portions of the cluster were down for maintenance, reboots were performed for just those portions to allow them to boot from the new cluster manager. Eventually, these rolling reboots allowed for the entire cluster to be running from xCAT 2.8 with no additional downtime required!

Download this case study: xCatUpgrade.CaseStudy3