\documentstyle[12pt,twoside]{article} \textwidth18cm \textheight25cm \oddsidemargin-1cm \evensidemargin-1cm \topmargin-2cm \setlength{\parindent}{0pt} \sloppy \begin{document} \title{\bf Comparison between a {\sf CM2} and a {\sf CM5}} \author{Hans Herrmann, Stephan Melin and Peter Ossadnik\\ HLRZ\\ KFA-J\"ulich\\ Postfach 1913\\ D-W-5170 J\"ulich\\ Germany} \maketitle \section{Architectures} \subsection{{\sf CM2}} The {\sf CM2} at GMD (cm2sun2.gmd.de) is a 16k processor machine with 32-bit Weitek floating point accelerators. One floating point accelerator is shared by 32 processors respectively. At daytime it is partitioned into two 8k node machines, one partition running under timesharing, the other running in exclusive mode. At nighttime the machine runs as a single 16k node partition in exclusive mode. The clock speed is 7 MHz. \subsection{{\sf CM5}} The {\sf CM5} at IPG in Paris (panoramix-0.ipgp.jussieu.fr) is a 128 node machine with vector units, which is partitioned into two 32 node partitions and one 64 node partition. On each node are 4 vector units and one Sparc processor (the one which is used in the SparcStation 2). \section{Performance} \subsection{Performance for a molecular dynamics simulation, using {\sf C*}} \subsubsection{Program description} This program simulates a two dimensional dense packing of nearly equally sized spheres (triangular lattice). The interactions are Hertz contact law and velocity dependend dissipation (normal and tangential). The numerical method used is a 5th order predictor corrector integration, this means that the forces between neighbouring particles have to be calculated at every time step. The neighbourhood of each particle is kept fixed. Most of the computational time is consumed in the force calculation (floating point) and a moderate but not neglegible amount of time is consumed for the communications. \subsubsection{Porting} The port from the {\sf CM2} version to the {\sf CM5} version turned out to be relatively simple, except that some functionality of the {\sf C*} compiler ({\sf C* Rel. 7.0 Beta}) was still missing and some declarations in the header files were wrong. The workaraounds for these limitations proved to be very simple and had no effect on the timings. \subsubsection{Performance} The system size is 16384 particles, one cycle consists of 2048 predictor-corrector steps. \bigskip \begin{tabular}{|l||l|l|} \hline & {\sf CM2} & {\sf CM5} \\ \hline \hline compiler & {\sf C* 6.0.3} & {\sf C* 7.1 Beta} \\ \hline options & {\sf -O} & NA \\ \hline \end{tabular} \bigskip \begin{tabular}{|l||rr|rr|} \multicolumn{5}{l}{Runtimes per cycle:} \\ \hline & \multicolumn{2}{c|}{single precision} & \multicolumn{2}{c|}{double precision} \\ \hline \hline {\sf CM2} at the GMD & 16k: & 42.15 sec & & \\ & 8k: & 67.23 sec & & \\ \hline {\sf CM2} with 64 bit FPU $\star$ & 16k: & 39.76 sec & 16k: & 69.17 sec \\ & 8k: & 61.17 sec & 8k: & 113.58 sec \\ \hline {\sf CM5} at the IPG & 64n: & 244.20 sec & 64n: & 290.24 sec \\ & 32n: & 466.14 sec & 32n: & 557.14 sec \\ \hline {\sf CM5} at the IPG with VU & 64n: & 92.56 sec & 64n: & 88.84 sec \\ & 32n: & 123.00 sec & 32n: & 116.76 sec \\ \hline \multicolumn{5}{l}{$\star$ courtesy by Martin Frick} \\ \multicolumn{5}{l}{The timings were done using the CM\_timer routines.}\\ \end{tabular} \subsubsection{Interpretation of benchmarks} When comparing the timings for single and double precision one clearly sees, that for the {\sf CM5} there is no penalty using 64 bit floats, in fact the vector units are more efficient with doubles than with singles. On the {\sf sparc} nodes (without {\sf VU}) singles are slightly more efficient than doubles. On the {\sf CM2} things are different, when a {\sf CM2} is equipped with 64 bit floating point accelerators double arithmetic is still about two times slower than single precision. When equipped with 32 bit floating point accelerators 64 bit arithmetic would have to be emulated in software, which would be at least 20 times slower. When comparing between the {\sf CM2} and the {\sf CM5} one gets approximately the same performance for a 32 node {\sf CM5} with vector units and a 8192 processor {\sf CM2} with 64 bit FPUs, using double precision. A 64 node {\sf CM5} with vector units is needed to achieve roughly the same performance as a 8192 processor {\sf CM2} with 32-Bit FPUs using single precision. Using a {\sf CM5} without vector units would not make much sense, unless one can use a machine with a huge amount of nodes. Extrapolated a 160 node {\sf CM5} would compare to a 8192 processor {\sf CM2} with 64-Bit FPUs using double precision, and a 224 node {\sf CM5} to a 8192 processor {\sf CM2} with 32-Bit FPUs using single precision. \subsection{Lattice model to simulate crack growth under the influence of a shock wave} \subsubsection{Description of the model} We simulate a triangular network of massive particles connected by elastic springs, whose spring constants are chosen randomly to simulate spatial disorder. At the beginning of the simulation a singular force - like an explosion - is applied to the central sites of the lattice, which results in an outward travelling shock wave. Under the influence of this shock wave the bonds with the largest stresses are broken. This results in a fractal crack network. The method used to solve this dynamical problem is a discrete time integration of Newton's equation with a time step $dt=1$, which resembles a ``cellular automaton''. \subsubsection{Implementation on the {\sf CM2}} This model is implemented on a {\sf CM2} in a straightforward way: For each site one determines in parallel the forces exerted by the neighboring sites and updates the site position. This is very elegantly expressed in {\sf CM-FORTRAN}. Since a triangular lattice is used, only nearest and next-nearest neighbor communication is required, which is easily done via cshift-calls. \subsubsection{Porting to the {\sf CM5}} The first step of the porting suffered from the fact, that new users on the IPG machine did not have a working programming environment. After having included all necessary program and library paths, the actual compilation of the {\sf CM-Fortran} program was simple. The command \begin{verbatim} cmf -vu -O program.fcm \end{verbatim} compiled the old Fortran code without problems. \subsubsection{Performance} In the following we show the total execution times for 200 time steps in a $512 \times 512$ system for different machine configurations. Since the {\sf CM-Fortran} compiler on the {\sf CM2} did have some trouble to generate efficient code, we also show the execution time of an highly optimized {\sf Fortran-PARIS} version of the same program. In the {\sf Fortran-PARIS} version we also used pipelined commands like {\tt cm\_f\_mult\_add}. But we have to stress, that the pure Fortran program used on the {\sf CM5} is exactly the same program as the one used on the {\sf CM2} . The {\sf CM2} numbers refer to {\sf PARIS} only, i.e. using the {\sf CMF PARIS} compiler. \medskip ({CM5: CMOST \sf 7.2 S 2}) original program version:\\ \begin{tabular}{|l|l|rl|} \hline {\sf CM5} & 64PN with vector unit & 35 & seconds \\ {\sf CM5} & 64PN without vector units & 309 & seconds \\ {\sf CM2} & 8k procs, pure CM-FORTRAN: & 376 & seconds \\ {\sf CM2} & 8k procs, Fortran-Paris: & 114 & seconds \\ \hline \end{tabular} \\ All timings done using single precision. \medskip The Fortran-Program includes the calculation of global sums and spreads along a short lattice axes (only six elements in this dimension) which is laid out {\sf SERIALLY} on one processor ({\tt m(6,512,512); m(:serial,:news,:news)}). Thus in principle no communication is required for these operations. However, the {\sf CM-Fortran} compiler on the {\sf CM5} does not recognize this fact. The computation can be sped up by almost a factor of two by calculating the sums and spreads step by step: \begin{verbatim} s = sum(m(1:6,:,:),dim=1) => s = m(1,:,:)+...+m(6,:,:) spread(m(1,:,:),dim=1,ncopies=6) => tmp(1,:,:) = m(1,:,:) tmp(6,:,:)=m(1,:,:) \end{verbatim} Using these optimizations we now give the execution times of the pure CM-Fortran code on a {\sf CM2} with a 64 bit FPU under the slicewise and the Paris model, and on a {\sf CM5} with recent versions of the system software: {\sf CM2} ({\tt cmf -cm2 -O -paris | -slicewise}) optimized version:\\ \begin{tabular}{|c||rl|c|rl|rl|} \hline & \multicolumn{3}{c|}{8k} & \multicolumn{4}{c|}{16k} \\ \hline & \multicolumn{2}{c|}{ slicewise } & \multicolumn{1}{c|}{ Paris } & \multicolumn{2}{c|}{ slicewise } & \multicolumn{2}{c|}{ Paris } \\ \hline single & 61.7 & sec & 108.2 sec & 31.7 & sec & 54.6 & sec \\ \hline double & 92.8 & sec & (not enough memory) & 48.1 & sec & 108.1 & sec \\ \hline \end{tabular} \\ All these timings were performed by Martin Frick on a {\sf CM2} in Cambridge/Mass. \medskip {\sf CM5} ({\tt cmf -cm5 -vu -O}) optimized version:\\ \begin{tabular}{|c||rl|rl|rl|rl|} \hline & \multicolumn{4}{c|}{32 Nodes} & \multicolumn{4}{c|}{64 Nodes} \\ \hline {\sf CMOST} & \multicolumn{2}{c|}{\sf 7.2 S 2} & \multicolumn{2}{c|}{\sf 7.2 beta 1.1-p4} & \multicolumn{2}{c|}{\sf 7.2 S 2} & \multicolumn{2}{c|}{\sf 7.2 beta 1.1-p4} \\ \hline single & 39 & sec & 31.2 & sec \ddag & 26 & sec & 15.0 & sec \ddag \\ \hline double & & & 31.1 & sec \ddag & & & 14.8 & sec \ddag \\ \hline \end{tabular} \\ \ddag These timings were performed by Martin Frick on a {\sf CM5} in Cambridge/Mass. \subsubsection{Interpretation} These execution times show, that the CM-Fortran compiler for the {\sf CM5} is able to produce much better code, than the compiler for the {\sf CM2}. From the original timing data under the old operating system {\sf CMOST 7.2 S 2} and not taking into account the complication with sums and spreads we see, that the performance of the simple {\sf CM-Fortran} program on a {\sf CM5} with 32PN is comparable to the performance of the highly optimized {\sf PARIS} Program on a {\sf CM2} with 16k processors. \noindent The newest version of the system software {\sf CMOST 7.2 beta 1.1-p4} leads to a speedup of 25\% and 75\%, depending on the chosen partition size. \subsection{Boltzmann lattice gas for shear thinning} \subsubsection{Description of the model} We simulate on a triangular lattice the motion of a fluid through a Boltmann lattice gas. On each site of the lattice one has six real variables describing the density of particles going into each of the six lattice directions. After each propagation step one has a ``collision'' step in which the six densities at each site are updated according to a rule that assures momentum and mass conservation. This method was developped by E. Flekk\o y to include the effects of shear thinning and eventually plug flow in non-Newtonian fluids. \subsubsection{Implementation on the {\sf CM2}} The program was written by E. Flekk\o y in {\sf cmf} using only one {\sf PARIS} statement. For convenience {\tt cshift} was used with variable shift length. By replacing parallel constructs on a serial axis and changing these {\tt cshifts} the program could be optimized by M. Frick by 30\% . \subsubsection{Porting to the {\sf CM5}} The first attempt to run the program in November 92 failed and we were told by AE R. Bourbonnais that it was probably an error in the {\sf cmf} compiler of the {\sf CM5} at IPG. Some months later the program ran without problems (of course after removing the {\sf PARIS} statements). \subsubsection{Performance} Lattice of size $1024 \times 1024$, $50$ iteration steps. I/O routines were removed for the measurement. \medskip {\sf CM2} 8k 32-Bit FPU ({\tt cmf -O}):\\ \begin{tabular}{|l||rl|} \hline & \multicolumn{2}{c|}{single} \\ \hline old version & 77.64 & sec \\ optimized version & 53.21 & sec \\ \hline \end{tabular} \medskip {\sf CM2} 64-Bit FPU ({\tt cmf -O -slicewise}):\\ \begin{tabular}{|r||rl|rl|} \hline & \multicolumn{2}{c|}{single} & \multicolumn{2}{c|}{double} \\ \hline 8k & 47.7 & sec & 62.8 & sec \dag \\ 16k & 24.0 & sec & 31.4 & sec \\ \hline \end{tabular} \\ \dag Please note that this value has been extrapolated. \medskip {\sf CM5} ({\tt cmf -cm5 -vu -O}) optimized version:\\ \begin{tabular}{|c||rl|rl|rl|rl|} \hline & \multicolumn{4}{c|}{32 Nodes} & \multicolumn{4}{c|}{64 Nodes} \\ \hline {\sf CMOST} & \multicolumn{2}{|c|}{\sf 7.2 beta 1} & \multicolumn{2}{c|}{\sf 7.2 beta 1.1-p4} & \multicolumn{2}{c|}{\sf 7.2 beta 1} & \multicolumn{2}{c|}{\sf 7.2 beta 1.1-p4} \\ \hline single & 30.5 & sec & 26.7 & sec \ddag & 15.1 & sec & 13.2 & sec \ddag \\ \hline double & 24.1 & sec & 21.4 & sec \ddag & 12.1 & sec & 10.6 & sec \ddag \\ \hline \end{tabular} \\ \ddag These timings were performed on a {\sf CM5} in Cambridge/Mass. \medskip The performance ratios for the present code are:\\ \begin{tabular}{|cl|c|} \hline & {\sf CM5} & 1.25 \\ \raisebox{1.5ex}[-1.5ex]{double / single} & {\sf CM2} with 64-Bit FPU & 0.76 \\ \hline \multicolumn{2}{|c|}{\sf CM5 CMOST 7.2 beta 1.1-p4 / CM5 CMOST 7.2 beta 1} & 1.14 \\ \hline \end{tabular} \medskip Comparing the {\sf CM5} ({\sf CMOST 7.2 beta 1.1-p4}) and the {\sf CM2}:\\ \begin{tabular}{|lc|c|} \hline single precision: & {\sf CM5} 32PN / {\sf CM2} 8k with 32-Bit FPU & 2.0 \\ double precision: & {\sf CM5} 64PN / {\sf CM2} 16k with 64-Bit FPU & 3.0 \\ \hline \end{tabular} \section{Summary} As a summary we compare in a short form the performance between the different applications on the {\sf CM2} and the {\sf CM5}: \medskip For single precision we get ({\sf CM2} with 32-Bit FPU): \\ \begin{tabular}{|l|rcc|c|} \hline Molecular dynamics & $0.7$ & $*$ & 8k {\sf CM2} = 64 PN {\sf CM5} & {\sf C*} \\ Shock waves & $3.3$ & $*$ & 8k {\sf CM2} = 64 PN {\sf CM5} & {\sf cmf} \\ Boltzmann & $4.0$ & $*$ & 8k {\sf CM2} = 64 PN {\sf CM5} & {\sf cmf} \\ \hline \end{tabular} \\ For double precision we get ({\sf CM2} with 64-Bit FPU): \\ \begin{tabular}{|l|rcc|c|} \hline Molecular dynamics & $1.3$ & $*$ & 8k {\sf CM2} = 64 PN {\sf CM5} & {\sf C*} \\ Shock waves & $6.3$ & $*$ & 8k {\sf CM2} = 64 PN {\sf CM5} & {\sf cmf, slicewise} \\ Boltzmann & $5.9$ & $*$ & 8k {\sf CM2} = 64 PN {\sf CM5} & {\sf cmf} \\ \hline \end{tabular} \section{Acknowledgements} We would like to thank Martin Frick from {\sf Thinking Machines Corp.} for his support while porting the programs from the {\sf CM2} to the {\sf CM5} and doing the timings on the {\sf CM2} with 64-Bit FPU and the {\sf CM5} in Cambridge/Mass.\\ We also would like to thank the GMD at St.Augustin for letting us use their {\sf CM2} and the IPG in Paris for letting us use their {\sf CM5}.\\ \bigskip We have used the new {\sf CM5} accepting that the following will be added to all benchmarking reports:\\ {\em \footnotesize \noindent Statement from Thinking Machines Corporation:\\ {\tt ``These results are based upon a test version of the software where the emphasis was on providing functionality and the tools necessary to begin testing the {\sf CM5} with vector units. This software release has not had the benefit of optimization or performance tuning and, consequently, is not necessarily representative of the performance of the full version of this software.'' } } \end{document}