D3 Center, The University of Osaka » Blog Archive

VCC

This service ended.

System overview

The PC cluster for a large-scale visualization (VCC) is a cluster system composed of 65 nodes. Each node has 2 Intel Xeon E5-2670v2 processors and a 64 GB main memory. These 62 nodes are interconnected on InfiniBand FDR and form a cluster.

Also, this system has introduced ExpEther, a system hardware virtualization technology. Each node can be connected with extension I/O nodes with GPU resource, and SSD on a 20Gbps ExpEther network. A major characteristic of this cluster system is that it is reconfigured based on a user’s usage and purpose by changing the combination of nodes and extension I/O nodes.

In our center, a 2 PB storage system is managed with a NEC-developed fast distributed parallel file system called ScaTeFS (Scalable Technology File System). The large-scale computing system including VCC can access this storage system.

【TIPS】ScaTeFS

ScaTeFS is a file system that achieves a higher degree of fault-tolerance through the removal of single point failure and the introduction of auto-recovery. ScateFS also offers fast parallel I/O operations and the strength of bulk operations by distributing metadata and data equally to multiple I/O servers.

【TIPS】System hardware virtualization technology ExpEther

ExpEther virtualizes PCI Express on Ethernet. This technology enables a scale-up of the internal bus of an ordinary computer on Ethernet.

Node performance

Each node has 2 Intel Xeon E5-2670v2 processors (200 GFlops) and a 64 GB main memory. An Intel Xeon E5-2670v2 processor runs at 2.5 GHz operating frequency with 10 cores. Performance of the processor is at 200 GFlops. Thus, a single node has 20 cores in total and exhibits 400 GFlops performance.

Internode communication takes place on an InfiniBand FDR network which provides 56 Gbps (bi-directional) as maximum transfer capability.

	1node (vcc)
# of processor (core)	2(20)
main memory	64GB
disk	1TB
performance	200GFlops

System Performance

The PC cluster for the large-scale visualization (VCC) is composed of 65 nodes. From this, the theoretical peak performance is calculated as follows.

	PC cluster for large-scale visualization (VCC)
# of processor (core)	130(1300)
main memory	4.160TB
disk	62TB
performance	26.0TFlops

Also, the following reconfigurable resources can be attached to each node of VCC through the use of ExpEther, a system hardware virtualization technology, according to the users’ needs. At present, the reconfigurable resources are as below.

【TIPS】Precautions in attaching reconfigurable resources

Not all reconfigurable resources can be attached to a node, and the maximum number of resources that can be attached to a node is limited.　However, an arbitrary combination of resources such as SSD and GPU is possible.

Reconfigurable resources	Total number of resources	performance
GPU:Nvidia Tesla K20	59	69.03TFlops
SSD:ioDrive2(365GB)	4	1.46TB
Storage:PCIe SAS (36TB)	6	216 TB

Software

VCC in our center has CentOS 6.4 installed as an operating system. Therefore, those who use a Linux-based OS for program development can easily develop and port their programs on this cluster.

At present, the following software is available.

Software
Intel Compiler (C/C++, Fortran)
Intel MPI(Ver.4.0.3)
Intel MKL (Ver.10.2)
AVS/Express PCE
AVS/Express MPE (Ver.8.1)
Gaussian09
GROMACS
OpenFOAM
LAMMPS

Programming language
C/C++(Intel Compiler/GNU Compiler)
FORTRAN(Intel Compiler/GNU Compiler)
Python
Octave
Julia

Integrated scheduler composed of Job manipulator and NQSII is used for job management on VCC.

Scheduler

Performance tuning on VCC

Performance tuning on VCC can be categorized roughly into intra-node parallelism (shared-memory parallel processing) and inter-node parallelism (distributed memory parallel processing). The former parallelism distributes the computational workload to 20 cores on a single VCC node, and the latter parallelism to multiple nodes.

To harness the computational performance of the inherent performance of VCC, users need to simultaneously use intra-node and inter-node parallelism. In particular, since VCC has a large number of cores on a single node, intra-node parallelism leveraging 20 cores is efficient and effective in the case where computation is executable within a 64 GB main memory.

Furthermore, further performance tuning is possible by taking advantage of the reconfigurable resources of VCC. For example, the GPU could help gain more computational performance and SSD can help accelerate I/O performance.

Intra-node parallelism (shared-memory parallelism)

The main characteristic of this parallelism is that 20 cores of a single scalar processor “shares” a 64 GB main memory address space. In general, shared memory parallel processing is easier than distributed memory parallel processing in terms of programming and tuning. This is true because VCC and intra-node parallelism are easier than inter-node parallelism.

Representative techniques for Intra-node parallelism on VCC are the “OpenMP” and “pthread”.

【TIPS】Distributed parallel processing by OpenMP

The OpenMP defines a set of standard APIs (Application Programming Interfaces) for shared memory parallel processing. The OpenMP targets shared memory architecture systems. For example, as the code below shows, users only have to insert a compiler directive, which is an instruction to the compiler, onto the source code and then compile it. As a result, the compiler automatically generates an execution module which is performed by multiple threads. The OpenMP is a parallel processing technique which beginners of parallel programming can undertake with relative ease. Fortan, C and C++ can be the source code.
More detailed information on the OpenMP can be obtained from books, articles, and the Internet. Please check out these information sources.

Example 1:　　　　　　　　　　　　　
#pragma omp parallel
　{
　　#pragma omp for
　 for(i =1; i < 1000; i=i+1) 　x[i] = y[i] + z[i]; 　}

Example 2:
!$omp parallel
　!$omp do
　　do i=1, 1000
　　　　x(i) = y(i) + z(i)
　　　　enddo
　　!$omp enddo
　!$omp end parallel

More detail information on OpenMP can be obtained from books, documents, the Internet. Please check it out.

【TIPS】Distributed parallel processing by thread

A thread is called a lightweight process and a fine-grained unit of execution. By using threads rather than processes on a CPU, the context switch can be executed faster, which will result in the acceleration of computing.

Importantly, in the case of using threads on VCC, users have to declare the use of all CPU cores as follows before submitting a job request:

#PBS -l cpunum_job=20

Internode parallelism (distributed memory parallel processing)

The main characteristic of this parallelism is that it leverages multiple memory address spaces on different nodes rather than on an identical memory address space. This characteristic is the reason why distributed memory parallel processing is more difficult than shared memory parallel processing.

Representative inter-node parallelism on VCC is the MPI (Message Passing Interface).

【TIPS】Distributed parallel processing by MPI

A Message Passing Interface provides a set of libraries and APIs for distributed parallel programming based on a message-passing method. Based on considerations of communication patterns that will most likely happen on distributed-memory parallel processing environments, the MPI offers intuitive API sets for peer-to-peer communication and collective communications, in which multiple processes are involved, such as MPI_Bcast and MPI_Reduce. Under MPI parallelism, developers must write a source code by considering data movement and workload distribution among MPI processes. Therefore, MPI is a somewhat more difficult parallelism for beginners. However, developers will learn to write a source code that can run faster than the codes with HPF, once they become intermediate and advanced programmers that can write a source code by considering hardware architecture and other characteristics. Furthermore, MPI has become a de facto standard utilized by many computer simulations and analysis. Moreover, today's computer architecture has become increasingly "Clusterized" and thus, mastering MPI is preferred.

Acceleration using GPU

In VCC, 59 NVIDIA Tesla K20 are prepared as reconfigurable resources. For example, by attaching 2 GPUs to each of 4 VCC nodes, users can try acceleration of computational performance using 8 GPUs. CUDA (Compute Unified Device Architecture) is prepared as a development environment in our Center.

【TIPS】Distributed parallel processing by CUDA

CUDA（Compute Unified Device Architecture）is the NVIDIA-provided integrated development environment for GPU computing and includes a library, compiler, and debugger, etc., in C. Therefore, those who experience C can easily undertake GPU programming.
More detailed information is available here.

More detail information is available from here.