The PC cluster for a large-scale visualization (VCC) is a cluster system composed of 65 nodes. Each node has 2 Intel Xeon E5-2670v2 processors and a 64 GB main memory. These 62 nodes are interconnected on InfiniBand FDR and form a cluster.
Also, this system has introduced ExpEther, a system hardware virtualization technology. Each node can be connected with extension I/O nodes with GPU resource, and SSD on a 20Gbps ExpEther network. A major characteristic of this cluster system is that it is reconfigured based on a user’s usage and purpose by changing the combination of nodes and extension I/O nodes.
In our center, a 2 PB storage system is managed with a NEC-developed fast distributed parallel file system called ScaTeFS (Scalable Technology File System). The large-scale computing system including VCC can access this storage system.
Each node has 2 Intel Xeon E5-2670v2 processors (200 GFlops) and a 64 GB main memory. An Intel Xeon E5-2670v2 processor runs at 2.5 GHz operating frequency with 10 cores. Performance of the processor is at 200 GFlops. Thus, a single node has 20 cores in total and exhibits 400 GFlops performance.
Internode communication takes place on an InfiniBand FDR network which provides 56 Gbps (bi-directional) as maximum transfer capability.
|# of processor (core)||2(20)|
The PC cluster for the large-scale visualization (VCC) is composed of 65 nodes. From this, the theoretical peak performance is calculated as follows.
|PC cluster for large-scale visualization (VCC)|
|# of processor (core)||130(1300)|
Also, the following reconfigurable resources can be attached to each node of VCC through the use of ExpEther, a system hardware virtualization technology, according to the users’ needs. At present, the reconfigurable resources are as below.
Not all reconfigurable resources can be attached to a node, and the maximum number of resources that can be attached to a node is limited. However, an arbitrary combination of resources such as SSD and GPU is possible.
|Reconfigurable resources||Total number of resources||performance|
|GPU:Nvidia Tesla K20||59||69.03TFlops|
|Storage:PCIe SAS (36TB)||6||216 TB|
VCC in our center has CentOS 6.4 installed as an operating system. Therefore, those who use a Linux-based OS for program development can easily develop and port their programs on this cluster.
At present, the following software is available.
|Intel Compiler (C/C++, Fortran)|
|Intel MKL (Ver.10.2)|
|AVS/Express MPE (Ver.8.1)|
|C/C++(Intel Compiler/GNU Compiler)|
|FORTRAN(Intel Compiler/GNU Compiler)|
Integrated scheduler composed of Job manipulator and NQSII is used for job management on VCC.
Performance tuning on VCC
Performance tuning on VCC can be categorized roughly into intra-node parallelism (shared-memory parallel processing) and inter-node parallelism (distributed memory parallel processing). The former parallelism distributes the computational workload to 20 cores on a single VCC node, and the latter parallelism to multiple nodes.
To harness the computational performance of the inherent performance of VCC, users need to simultaneously use intra-node and inter-node parallelism. In particular, since VCC has a large number of cores on a single node, intra-node parallelism leveraging 20 cores is efficient and effective in the case where computation is executable within a 64 GB main memory.
Furthermore, further performance tuning is possible by taking advantage of the reconfigurable resources of VCC. For example, the GPU could help gain more computational performance and SSD can help accelerate I/O performance.
Intra-node parallelism (shared-memory parallelism)
The main characteristic of this parallelism is that 20 cores of a single scalar processor “shares” a 64 GB main memory address space. In general, shared memory parallel processing is easier than distributed memory parallel processing in terms of programming and tuning. This is true because VCC and intra-node parallelism are easier than inter-node parallelism.
Representative techniques for Intra-node parallelism on VCC are the “OpenMP” and “pthread”.
The OpenMP defines a set of standard APIs (Application Programming Interfaces) for shared memory parallel processing. The OpenMP targets shared memory architecture systems. For example, as the code below shows, users only have to insert a compiler directive, which is an instruction to the compiler, onto the source code and then compile it. As a result, the compiler automatically generates an execution module which is performed by multiple threads. The OpenMP is a parallel processing technique which beginners of parallel programming can undertake with relative ease. Fortan, C and C++ can be the source code.
More detailed information on the OpenMP can be obtained from books, articles, and the Internet. Please check out these information sources.
#pragma omp parallel
#pragma omp for
for(i =1; i < 1000; i=i+1) x[i] = y[i] + z[i]; }
do i=1, 1000
x(i) = y(i) + z(i)
!$omp end parallel
More detail information on OpenMP can be obtained from books, documents, the Internet. Please check it out.
A thread is called a lightweight process and a fine-grained unit of execution. By using threads rather than processes on a CPU, the context switch can be executed faster, which will result in the acceleration of computing.
Importantly, in the case of using threads on VCC, users have to declare the use of all CPU cores as follows before submitting a job request:
#PBS -l cpunum_job=20
Internode parallelism (distributed memory parallel processing)
The main characteristic of this parallelism is that it leverages multiple memory address spaces on different nodes rather than on an identical memory address space. This characteristic is the reason why distributed memory parallel processing is more difficult than shared memory parallel processing.
Representative inter-node parallelism on VCC is the MPI (Message Passing Interface).
A Message Passing Interface provides a set of libraries and APIs for distributed parallel programming based on a message-passing method. Based on considerations of communication patterns that will most likely happen on distributed-memory parallel processing environments, the MPI offers intuitive API sets for peer-to-peer communication and collective communications, in which multiple processes are involved, such as MPI_Bcast and MPI_Reduce. Under MPI parallelism, developers must write a source code by considering data movement and workload distribution among MPI processes. Therefore, MPI is a somewhat more difficult parallelism for beginners. However, developers will learn to write a source code that can run faster than the codes with HPF, once they become intermediate and advanced programmers that can write a source code by considering hardware architecture and other characteristics. Furthermore, MPI has become a de facto standard utilized by many computer simulations and analysis. Moreover, today's computer architecture has become increasingly "Clusterized" and thus, mastering MPI is preferred.
Acceleration using GPU
In VCC, 59 NVIDIA Tesla K20 are prepared as reconfigurable resources. For example, by attaching 2 GPUs to each of 4 VCC nodes, users can try acceleration of computational performance using 8 GPUs. CUDA (Compute Unified Device Architecture) is prepared as a development environment in our Center.
More detailed information is available here.
More detail information is available from here.