HPC Pain Points Survey
X-ISS recently surveyed members of the HPC community to answer a series of questions regarding challenges in working with and managing their HPC systems. The survey asked participants to rank ten questions on a scale from zero (No Problem) to ten (Major Issue). The questions were related to a variety of HPC challenges, or “HPC Pain Points” that we have heard over the decades of working closely with customer and partner HPC installations, management and support. Over the next few months results of the survey will be posted here.
This initial installment will cover details on the respondents and basic statistics of the survey.
Survey HPC System Statistics
- 80 surveys completed (total)*
- 54 respondent records used in results
- Number of nodes on the HPC systems ranged from 2 to 20,000
- Average number of nodes = 1,970
- Shared storage ranged from 25 GB – 40 PB
- Total shared storage across all respondents = 144 PB
- 21 respondents had 1 PB or more
- 20 respondents had 1 TB – 400 TB
- 3 respondents had <1 TB
- 45 (or 83%) of the respondents are using InfiniBand fabric.
*Note: The published findings resulted in the inclusion of 54 of the 80 total respondents. The 26 surveys not used in the results were due to incomplete surveys, anomalous surveys which would have skewed the results, and illegible submissions.
Challenges with HPC vary depending on the system, workload characteristics, staffing, compliance, security, the organization and many other factors. The questions asked for this survey are listed below. Only questions 1 and 2 are covered in this Part 1 of the results.
- Part 1
- Question 1 – Integration of HPC cluster into enterprise infrastructure
- Question 2 – Managing multi-vendor hardware, cluster managers and schedulers
- Additional Metrics: Respondents market segmentation and HPC system statistics (See above.)
- Part 2
- Question 3 – Implementing efficient scheduler configurations
- Question 4 – Monitoring HPC clusters with IT infrastructure tools
- Additional Metrics: Commercial application count statistics
- Part 3
- Question 5 – Lack of HPC IT expertise within the organization
- Question 6 – Limited HPC support staff for number of systems and users
- Additional Metrics: Open source application count statistics
- Part 4
- Question 7 – Lack of HPC cloud / virtualization expertise in-house
- Question 8 – Poor experience for Windows users accessing Linux clusters
- Additional Metrics: In-house developed application count statistics
- Part 5
- Question 9 – Insufficient remote access and collaboration for HPC applications
- Question 10 – Ineffective reporting on job / application / project performance across multiple clusters
- Additional Metrics: Survey conclusions and take-away
Question 1 – Integration of HPC Cluster into Enterprise Infrastructure
This question targeted challenges with integrating HPC clusters into the organization’s enterprise infrastructure. This would include networking and communications, authentication integration (e.g. LDAP, Active Directory, etc.), remote access, storage, and security to name a few.
The result predominantly shows a varied distribution with 22% of respondents having no issues with enterprise integration and then a fairly large group of 34% with a pain point of 8 or higher.
Question 2 – Managing Multi-vendor Hardware, Cluster Managers and Schedulers
This question touches on the level of vendor product diversity in the HPC cluster; time (possibly pain) required in managing diversity, establishing integration / interaction between multi-vendor products and ongoing management of the same. Some organizations may deploy multiple clusters running disparate software stacks, schedulers and cluster managers on similar or different hardware platforms. It doesn’t take long for the support matrix for these organizations to get large, fast!
HPC talent within the organization who manage these clusters may have their own product preferences and domain knowledge further compounding this pain point. High diversity of multi-vendor hardware, cluster managers and schedulers often results in larger staffing demands, less standardization, slower support response time and a less performant HPC infrastructure.
Only 7% of respondents tell us that this is a major issue and 19% have no issues. Generally the distribution of response indicates that 2/3rds of organizations have little or mild pain in this area. We do feel for the ~ 25% in the 8, 9, and 10 ranges.
Question 3 – Implementing Efficient Scheduler Configurations?
High-performance computing (HPC) solutions, and other forms of centralized computing, are cost effective provided they are managed and utilized to extract maximum value. The opportunity in HPC of high utilization and reduction of operating costs comes, in large part, from efficient resource allocation, prioritization, initiation and termination of jobs. Robust and feature-rich scheduling software eases this burden, but in complex and dynamic environments, where re-provisioning and prioritization are frequent, the configuration, scripting and coordination of schedulers can be a monumental task. This question asks just how painful is the implementation of efficient scheduling.
As suggested in the dialog above, there is high variability in cluster scheduler configuration needs. The results mirror that variability with a fairly random distribution across the pain point. One notable statistic is that 41% of respondents submitted to a pain threshold of 8 or higher and the largest count of 17% at a pain point of 9.
Question 4 – Monitoring HPC clusters with IT infrastructure tools?
Clusters are monitored for different reasons, possibly for different metrics, with various degrees of fidelity and frequency. Whether you are monitoring for cluster health, utilization, predicting hardware failures, job queues, etc., the metrics collected can be from multiple sources and gathered by various centralized or distributed tools. In addition, these tools not only need to scale, they also need to aggregate data and summarize monitors as there may be thousands of nodes in the cluster.
For operational teams responsible for the day-to-day care and feeding of the cluster, one of the primary responsibilities is to keep the cluster up and running 24×7. Downtime is expensive especially if the failures affect a large number of nodes or users. Software tools are needed to automate and alert when failures occur (or ideally before they occur). We asked our respondents their opinion on how effective the current set of tools is to help them stay on top of these failures.
More than a third of respondents, or 35%, selected the 5 to 7 pain range. Only 8% of respondents placed themselves in the top threshold positions of 9 and 10 – feeling major pain in monitoring the cluster. The largest single distribution of 17% is at a threshold of 3. Fifty-five percent were in the 1-5 range and 45% in 6 to 10 range. There is a lot of variability in the numbers, but it seems that very few people have a very high point of pain in monitoring HPC clusters with IT infrastructure tools. This is a good thing.
As part of the survey we asked respondents the number of commercial applications installed on their HPC system. The results are below.
- 33% of respondents were running between 1 and 5 commercial applications on their HPC systems.
- 9% of respondents were running between 6 and 25 commercial applications on their HPC systems.
- 7% of respondents were running between 26 and 75 commercial applications on their HPC systems.
- 5% of respondents were running between 76 and 100 commercial applications on their HPC systems.
- 1 respondent (2%) was running 180 commercial applications on their HPC systems.
- 27% of respondents provided no response.
- 17% of respondents were not running any commercial applications on their HPC systems. Of these respondents, all but one were universities.
Question 5 – Lack of HPC IT Expertise within the Organization
This question asks the respondent to rate specifically the organization’s HPC IT expertise. All clusters are unique in their own right. Very rarely will you find two that are identical, and in part this uniqueness will dictate the skills and knowledge needed within the organization.
As mentioned in question 2 of this survey “Managing Multi-vendor Hardware, Cluster Managers and Schedulers,” there’s vendor-offering diversity and as such a need for specific skill sets for a particular cluster. At a minimum, managing an HPC cluster will require breadth of skills across operating system, hardware, fabrics, networking, software, cluster management tools, schedulers and a variety of other components. For larger clusters the demand for administration and staffing may require several administrators with specific domains of expertise rather than broad expertise.
One final note, this question was not intended to take into account organizational “Lack of Expertise” within groups such as management or users, having said that, the aforementioned variability inherently suggests a possible large range of rationales to respondent’s ratings, so let’s look at the responses.
Wow, “No Problem” a third, or 33% of all respondents are confident there’s no lack of HPC IT skills within their organization. This group of respondents obviously didn’t take into consideration the management group! But seriously now, rankings 1 and 2 collectively represent more than 50% of all responses. The lowest percentage was 2% in position 6. Splitting the responses, only 23% of responses are in the 6 to 10 pain areas leaving a whopping 77% of the distributions in the 1 to 5 range. To date, this is the most decided set of responses to our survey — good job HPC admins!
Question 6 – Limited HPC Support Staff for Number of Systems and Users
This question asks the respondents whether they feel there is adequate HPC support staffing for the size of the HPC systems and number of users in their environment.
We provide here similar comments as in the previous question as we are still talking about support. Yet in this case we are including the word ‘support’ which implies a support work load, and not just capabilities. As mentioned several times, when a cluster is properly deployed and configured, it runs at peak operational performance. The same is true of support staff who will be more effective and efficient if sufficiently experienced / trained with the systems, applications, applications and work flows.
Support staff, like clusters, can vary in breadth and capabilities. For small clusters, a few people may be able to provide adequate support. For large clusters, especially ones with hundreds of applications, such as universities, there is a need for a considerable number of staff and domain knowledge. So let’s see how well staffed our respondents feel their organization is for HPC systems and users.
The results indicate a pretty even distribution with 60% of respondents in the lower half of the scale feeling little pain. The remaining 40%, in the 6 – 10 pain point range, are not so lucky. More noticeably a third of all respondents, at the top of the scale, have 8 – 10 pain point. This top group may be suffering from serious under-staffing.
What is interesting, when reviewed with the answers from the previous question, it appears HPC support staff expertise is not an issue, but for many, the size of staff may be. Under-staffing is usually a budgeting issue or the lack of availability of experienced systems administrators or support staff.
Open Source Applications
As part of the survey we asked respondents the number of open source applications installed on their HPC system. The results are below.
- 29% of respondents provided no response.
- 7% of respondents were not running any open source applications on their HPC system. Of these respondents, only one was not a commercial company.
- 31% of respondents were running between 1 and 5 open source applications on their HPC systems.
- 13% of respondents were running between 6 and 25 open source applications on their HPC systems.
- 6% of respondents were running between 26 and 75 open source applications on their HPC systems.
- 9% of respondents were running between 76 and 100 open source applications on their HPC systems.
- 5% of respondent had more than 100 open source applications on their HPC systems.
Question 7 – Lack of HPC Cloud / Virtualization Expertise In-house
The last couple of survey questions focused on lack, or limitation, of support for a traditional cluster and users. Question 7 continues that theme but in the area of a less mature, and emerging, HPC deployment model, that being HPC in the cloud. We asked this particular question for a couple of reasons: HPC’s continued proliferation into commercial markets; HPC’s need for rapid provisioning, elasticity and vertical and horizontal scalability; finally, taking advantage of public cloud for infrequent high-performance workloads.
Initially looking at the result, one may question why there is such a high percentage of cloud / virtualization expertise, collectively >40% in pain levels 1, 2 and 3. There could be several reasons for this including the similarities in skills associated with traditional HPC deployment methods and cloud computing deployments, e.g., image management, deployment and re-provisioning, multi-tenancy, high availability fault tolerance, resiliency, etc. As likely, there is a growing use of virtualization and cloud-based technologies being used within HPC installations, or in place of traditional HPC solutions.
In the higher pain points, only 12% of respondents within the 9 and 10 pain levels consider themselves to have a lack of cloud and virtualization expertise. We believe the number of organizations that manage clusters will increasingly have expertise in cloud technologies and continued need for virtualization experience.
Question 8 – Poor Experience for Windows Users Accessing Linux Clusters
Some might ask why we asked this question. In our business, standing up and supporting Linux and Windows clusters, we interact with both Linux and Windows users. Excluding Windows users using a Windows cluster and Linux users accessing a Linux cluster, this leaves two scenarios: Linux users accessing a Windows cluster (less common) and Windows users accessing Linux clusters (common). The latter case is the topic of this question.
Poor experience for Windows users on Linux clusters typically falls into two main areas. The first is related to familiarity with Linux, working at the command line, using Linux tools and generally having to operate within a new unfamiliar environment, and when things go wrong, not being able to isolate and resolve issues. The second area is related to cluster access and integration of accounts, authentication and privileges between the Windows user’s familiar environment and the foreign Linux environment. In other words, making the Windows and Linux worlds not only play together agreeably, but share resources. The setup and configuration of software like SAMBA, which allows sharing of file and print services, authentication and authorization, name resolution and service announcement, can be difficult to setup, manage and troubleshoot.
Given the all the variables from training to integrated access, there is a considerable amount that can go wrong for a Windows user. Let’s see what the respondents had to say.
Generally the respondents do not see a lot of poor performance for Windows users on a Linux cluster. Only 19% of respondents felt there is an 8 or higher pain point for Windows user’s experience. Somewhat surprising is the almost 1/3 of respondents who have a 1 or 2 pain point with Windows users. At a higher level, almost 2/3 or 60% of respondents scored this question at a 5 or less pain point. The only caveat we need to make here is there was no qualifying question as to whether Windows users were accessing the respondent’s cluster. We suspect if the aforementioned question were asked as a filter we’d be seeing very different results.
As part of the survey we asked respondents the number of in-house applications installed on their HPC system. The results are below.
- 22% of respondents provided no response.
- 11% of respondents were not running any in-house applications on their HPC system. Of these respondents only one was not a university.
- 31% of respondents were running between 1 and 5 in-house applications on their HPC systems.
- 20% of respondents were running between 6 and 25 in-house applications on their HPC systems.
- 8% of respondents were running between 26 and 75 in-house applications on their HPC systems.
- 8% of respondents had more than 100 in-house applications on their HPC systems.
Question 9 – Insufficient remote access and collaboration for HPC applications
There have been several generations of the computer technology market moving from predominantly centralized computing to distributed computing and back to centralized computing in some form. In the world of traditional HPC, computational assets have been predominantly centralized, meaning accessing the various processing elements through remote access to a single HPC system. Obtaining remote access to some systems is simply not allowed due to security reasons. For people granted remote access the performance of remote access connection and tools can be limiting factors in productivity depending on the level of interaction required with the system, data set sizes and visualization requirements to name a few.
On the same note your analysis and workflow may require you to share data with other applications, tools or individuals as inputs for further analysis, post-processing, visualization, etc. The most common of these is sharing the results in a meaningful manner and that typically means in a graphical visual form. As such, remote visualization is one area we see the greatest demand for productivity and performance for remote access and collaboration on HPC clusters. How well the tools and components allow collaboration will have a direct impact on productivity. So we asked the question how difficult is your pain point for “Insufficient remote access and collaboration for HPC applications?” Let’s look at the results.
Almost 1/3 of all respondents have no pain with remote access or collaboration of HPC applications. This is the good news. Contributing to this high percentage are advancements in security, collaborative software, performance of remote connection speeds and remoting tools. Pain points 1 through 3 collectively represent 60% of the responses. Wow! An interesting side note, this is the only question in this survey where a respondent did not select a 9 or a 10 pain point. In this case neither was selected. Hands down this question was definitive in terms of least pain.
Question 10 – Ineffective Reporting on Job/Application/Project Performance Across Multiple Clusters
“If you can’t measure it, you can’t manage it.” “Without the data, it is just another option.” These are a couple of the adages regarding data, measurement and management, but are these truisms for HPC clusters? We think they are, so we asked our respondents their opinion on how painful it is using existing IT tools to collect, consolidate and report on performance measurements across all their clusters.
With this question in mind, we do need to consider how many survey respondents could be responsible for, or at least supporting, the collection and reporting of data across multiple clusters, mixed clusters (e.g. varying hardware or different HPC stacks or different OS such as Windows and Linux) and remote clusters. In that context we expect there would be a few individuals with a vested interest in all clusters across the organization and their use. These individuals, typically in a management or executive role, or possibly in finance, require details of usage, performance and related metrics in order to make upgrade, resource allocation or financial business decisions.
In our experience we predominantly see multiple people managing different clusters within the organization using different tools, methods and data stores for collection and reporting the same information. This disparity tends to lead to inconsistencies in reporting as well as inefficiencies in reporting if having to merge data from various sources. Some of these variances in tool selection are driven by the cluster and its stack, yet more often it is a decision made as a preference by individuals or groups. As a final note before we look at the responses, these decisions to use disparate tools are not always preferential or political. In many cases they are based on a lack of reporting tools capabilities where one tool, like DecisionHPC, can meet organization-wide reporting objectives. On that note, let’s jump into the results.
On the bottom end 19% of respondents have no pain in terms of ineffective reporting on job / application / project performance across multiple clusters. Collectively 71% of respondents are in the 1 through 5 pain points. On the top end we do have a number 12% of respondents reporting a 9 to 10 pain in this area. So the split on this one is that 30% of respondents have a 6 or higher pain point and the other 70% 5 and under pain point.
There are a few constants in the HPC market, one is change others include variability and diversity. Through the HPC deployments we’ve completed for our X-ISS clients and the hundreds of large-scale HPC software development projects I’ve personally been involved with we’ve yet to find two HPC systems that are similar enough to be called identical. Subtleties in deployment of the software stack, cluster managers, schedulers, hardware, firmware, fabrics, application, use cases… and the list goes on, make HPC as unique a market as the systems that represent that market.
In the process of consolidating and analyzing the survey results we were not surprised to find a few instances where variability and diversity may have imbrued results.
We hope you have you have found the results of this survey valuable or at least interesting.
Sr. Vice President of Technology and Services
eXcellence in IS Solutions, Inc. (X-ISS)