High Performance Computing Storage – Hybrid Cloud, Parallel File Systems, Key Challenges, and Top Vendors’ Products

Ashish Sharma Mar 16 - 10 min read

Audio : Listen to This Blog.

The toughest Terminator, T-1000 can demonstrate rapid shapeshifting, near-perfect mimicry, and recovery from damage. This is because it is made of mimetic polyalloy with robust mechanical properties. T-1000s naturally require top of the world speed, hi-tech communication system, razor-sharp analytical speed, and most powerful connectors and processors. Neural networks are also critical to the functioning of terminators. It stacks an incredible amount of data in nodes, which then communicates with the outer world depending on the input received.

We infer one important thing – these Terminators produce an arduous amount of data. Therefore, it must require a sleek data storage system that scales and carry capabilities to compute massive datasets. Which, rings a bell – just like the case of terminators, High Performance Computing (HPC) also require equally robust storage to maintain compute performance.

Well, HPC has been the nodal force to path defining innovations and scientific discoveries. This is because HPC enables processing of data and powering highly complex calculations at the speed of light. To give it a perspective, HPC leverages compute to deliver high performance. The rise of AI/ML, deep learning, edge computing and IoT created a need to store and process incredible amount of data. Therefore, HPC became the key enabler to bring digital technologies within the realm of daily use. In layman’s term, HPC can be referred as the supercomputers.

The Continual Coming of the age of HPC

The first supercomputer – CDC 6600 reigned for five years from its inception in 1964. CDC 6000 was paramount to the critical operations of the US government and the US military. It was considered 10 times faster to its nearest competitor – IBM 7030 Stretch. Well, it worked with a speed of up to 3 million floating-point operations per second (flops).

The need for complex computer modeling and simulation never stopped over the decades. Likewise, we also witnessed evolution of high-performance computers. These supercomputer were made of core-components, which had more power and vast memories to handle complex workloads and analyze datasets. Any new release of supercomputers would make its predecessors obsolete. Just like new robots from the Terminator series.

The latest report by Hyperion Research states that iterative simulation workloads and new workloads such as Al and other Big Data jobs would be driving the adoption of HPC Storage.

Understanding Data Storage as an Enabler for HPC

Investing in HPC is exorbitant. Therefore, one must bear in mind that it is essential to have a robust and equally proficient data storage system that runs concurrently with the HPC environment. Further some, HPC workloads differ based on its use cases. For example, HPC at the government & military secret agency consumes heavier workloads versus HPC at a national research facility. This means HPC storage require heavy customization for differential storage architecture, based on its application.

Hybrid Cloud – An Optimal Solution for Data-Intensive HPC Storage

Thinking about just the perfect HPC storage will not help. There has to an optimal solution that scales based on HPC needs. Ideally, it has to the right mix of best of the both – traditional storage (on-prem disk drives) and cloud (SSDs and HDDs). Complex, data-intensive IOPS can be channeled to SSDs, while usual streaming data can be handled by disk drives. An efficient combination of Hybrid Cloud – software defined storage and hardware configuration ultimately helps scale performance, while eliminating the need to have a storage tier separately. The software-defined storage must come with key characteristics – write back, read persistence performance statistics, dynamic flush, and I/O histogram. Finally, the HPC storage should support parallel file systems by handling complex sequential I/O.

Long Term Solution (LTS) Lustre for Parallel File System

More than 50 percent of the global storage architecture prefer Lustre – an open-source parallel file system to support HPC clusters. Well, for starters it offers free installation. Further, it provides massive data storage capabilities along with unified configuration, centralized management, simple installation, and powerful scalability. It is built on LTS community release allowing parallel I/O spanning multiple servers, clients, and storage devices. It offers open APIs for deep integration. The throughput is more than 1 terabyte/second. It also offers integrated support for an application built on Hadoop MapReduce applications.

Challenges of Data Management in Hybrid HPC Storage

Inefficient Data Handling

The key challenge in implementing hybrid HPC storage is inefficient data handling. Dealing with the large and complex dataset and accessing it over WAN is time-consuming and tedious.

Security is an another complex affair for HPC storage. The hybrid cloud file system also must include in-built data security. These small files must not be vulnerable to external threats. Providing SMBv3 encryption for files moving within the environment could be a great deal. Further, building the feature of snapshot replication can deliver integrated protection to the data in a seamless manner.

Right HPC product
End users usually find it difficult to choose the right product relevant to their services and industry. Hyperion Research presents an important fact. It states, “Although a large majority (82%) of respondents were relatively satisfied with their current HPC storage vendors, a substantial minority said they are likely to switch storage vendors the next time they upgrade their primary HPC system. The implication here is that a fair number of HPC storage buyers are scrutinizing vendors for competencies as well as price.”

Top HPC Storage products

Let’s briefly understand the top varied HPC Storage products in the market.

ClusterStor E1000 All Flash – By Cray (A HPE Company)

ClusterStor E1000 enables handling of the data at the speed of exascale. Its core is a combination of SSD and HDD. ClusterStor 1000 is a policy-driven architecture enabling you to move data intelligently. ClusterStor E1000 HDD-based configuration offers up to 50% more performance with the same number of drives than its closest competitors. This all-flash configuration is perfect for mainly small files, random access, and terabytes to single-digit PB capacity requirements.

Source: Cray Website

HPE Apollo 2000 System – By HPE

The HPE Apollo 2000 Gen10 system is designed as an enterprise-level, density-optimized, 2U shared infrastructure chassis for up to four HPE ProLiant Gen10 hot-plug servers with the entire traditional data center attributes—standard racks and cabling and rear-aisle serviceability access. A 42U rack fits up to 20 HPE Apollo 2000 system chassis, accommodating up to 80 servers per rack. It delivers the flexibility to tailor the system to the precise needs of your workload with the right compute, flexible I/O, and storage options. The servers can be “mixed and matched” within a single chassis to support different applications, and it can even be deployed with a single server, leaving room to scale as customer’s needs grow.

Source: HPE Website

PRIMERGY RX2530 M5 – By Fujitsu

The FUJITSU Server PRIMERGY RX2530 M5 is a dual-socket rack server that provides high performance of the new Intel® Xeon® Processor Scalable Family CPUs, expandability of up to 3TB of DDR4 memory and the capability to use Intel® Optane™ DC Persistent Memory, and up to 10x 2.5-inch storage devices – all in a 1U space saving housing. The system can also be equipped with the new 2nd generation processors of the Intel® Xeon® Scalable Family (CLX-R) delivering industry-leading frequencies. Accordingly, the PRIMERGY RX2530 M5 is the optimal system for large virtualization and scale-out scenarios, databases and for high-performance computing.

Source: Fujitsu Website

PowerSwitch Z9332F-ON – By Dell EMC

The Z9332F-ON 100/400GbE fixed switch comprises Dell EMC’s latest disaggregated hardware and software data center networking solutions, providing state-of-the-art, high-density 100/400 GbE ports and a broad range of functionality to meet the growing demands of today’s data center environment. These innovative, next-generation open networking high-density aggregation switches offer optimum flexibility and costeffectiveness for the web 2.0, enterprise, mid-market and cloud service provider with demanding compute and storage traffic environments. The compact PowerSwitch Z9332F-ON provides industry-leading density of either 32 ports of 400GbE in QSFP56-DD form factor or 128 ports of 100 or up to 144 ports of 10/25/50 (via breakout), in a 1RU design.

Source: Dell EMC Website

E5700 – By NetApp

E5700 hybrid-flash storage systems deliver high IOPS with low latency and high bandwidth for your mixed workload apps. Requiring just 2U of rack space, the E5700 hybrid array combines extreme IOPS, sub-100 microsecond response times, and up to 21GBps of read bandwidth and 14GBps of write bandwidth. With fully redundant I/O paths, advanced data protection features, and extensive diagnostic capabilities, the E5700 storage systems enable you to achieve greater than 99.9999% availability and provide data integrity and security.

Source: NetApp Website

ScaTeFS – By NEC Corporation

The NEC Scalable Technology File System (ScaTeFS) is a distributed and parallel file system designed for large-scale HPC systems requiring large capacity. To realize load balancing and scale-out, all typical basic functions of a file system (read/write operation, file/directory generation, etc.) are distributed to multiple IO servers uniformly since ScaTeFS does not need a master server for managing the entire file system such as a metadata server. Therefore, the throughput of the entire system increases, and parallel I/O processing can be used for large files.

Source: NEC Website

HPC-X ScalableHPC – By Mellanox

Mellanox HPC-X ScalableHPC toolkit is a comprehensive software package that includes MPI and SHMEM/PGAS communications libraries. HPC-X ScalableHPC also includes various acceleration packages to improve both the performance and scalability of high performance computing applications running on top of these libraries, including UCX (Unified Communication X) which accelerates point-to-point operations, and FCA (Fabric Collectives Accelerations) which accelerates collective operations used by the MPI/PGAS languages. This full-featured, tested and packaged toolkit enables MPI and SHMEM/PGAS programming languages to achieve high performance, scalability and efficiency, and to assure that the communication libraries are fully optimized of the Mellanox interconnect solutions.

Source: Mellanox Website

Panasas ActiveStor-18 – By Mircorway

Panasas® is the performance leader in hybrid scale-out NAS for unstructured data, driving industry and research innovation by accelerating workflows and simplifying data management. ActiveStor® appliances leverage the patented PanFS® storage operating system and DirectFlow® protocol to deliver high performance and reliability at scale from an appliance that is as easy to manage as it is fast to deploy. With flash technology speeding small file and metadata performance, ActiveStor provides significantly improved file system responsiveness while accelerating time-to-results. Based on a fifth-generation storage blade architecture and the proven Panasas PanFS storage operating system, ActiveStor offers an attractive low total cost of ownership for the energy, government, life sciences, manufacturing, media, and university research markets.

Source: Mircoway Website

Future Ahead

Dataset is growing enormously. And, there will be no end to it. HPC storage must be able to process data at the speed of the light to maintain compute efficiency at peak levels. HPC storage should climb to exascale from petascale. It must have robust in-built security, be fault-tolerant, be modular in design and most importantly, scale seamlessly. HPC storage based on hybrid cloud technology is a sensible path ahead; however, the efforts must be geared to control its components at runtime. Further, focus should also be on dynamic marshaling via the applet provisioning and in-built automation engine. This will improve compute performance and reduce costs.

Leave a Reply