并行编程简介

OpenMP 和 MPI 的概述

为了快速解决大型计算问题，有必要利用 CPU（中央处理器）上的多个内核和多个 CPU。迄今为止编写的大多数程序都是顺序的，编译器通常不会自动生成并行可执行文件，因此程序员需要修改原始的串行计算机代码以利用额外的处理能力。两个指定允许并行编程的库应该做什么的标准是 OpenMP 和 MPI（消息传递接口）。在本节中，我们将介绍理解、运行和修改本教程中的程序所需的最低限度信息。可以在 https://computing.llnl.gov/tutorials/ 和 http://www.citutor.org 中找到更详细的教程。

OpenMP 用于共享内存架构上的并行编程——每个计算进程对内存都有全局视图。它允许通过在原始代码中添加指令来逐步并行化现有的 Fortran、C 或 C++ 代码。因此它易于使用。但是，在使用 OpenMP 时需要谨慎才能获得良好的性能。在串行代码中添加指令很容易，但需要仔细考虑才能创建一个程序，该程序在并行运行时会显示性能改进并给出正确的结果。对于在规则网格上对多维偏微分方程进行数值求解，很容易执行有效且有效的基于循环的并行化，因此不需要完全理解 OpenMP 的所有功能。OpenMP 通常允许使用 10 个计算内核，特别是允许利用多核笔记本电脑、台式机和工作站。

MPI 用于分布式内存架构上的并行编程——当单独的计算进程可以访问自己的本地内存时，并且进程必须显式地接收属于已发送数据的其他进程的内存中的数据。MPI 是一个库，它允许通过添加显式地将数据从一个进程移动到另一个进程的函数调用来并行化 Fortran、C 和 C++ 程序。将串行程序转换为并行 MPI 程序需要仔细考虑，因为数据需要分解到不同的进程上，因此通常很难逐步并行化使用 MPI 的程序。并行化将使用 MPI 的程序的最佳方法取决于问题。在解决大型问题时，通常每个进程都没有足够的内存来简单地复制所有数据。因此，人们希望以尽可能减少执行正确计算所需的邮件传递量的方式将数据拆分（称为域分解）。编程这可能相当复杂且耗时。幸运的是，通过使用 2DECOMP&FFT 库^[1]（它是建立在 MPI 之上的），我们可以避免在编写傅里叶谱代码时不得不编写许多数据传递操作，同时仍然可以从能够在最多 O( $10^{5}$ ) 个处理器内核上求解偏微分方程中受益。

OpenMP

请阅读 https://computing.llnl.gov/tutorials/openMP/ 中的教程，然后回答以下问题

OpenMP 练习

什么是 OpenMP？
从 www.openmp.org 下载最新 OpenMP 规范的副本。最新规范的版本号是多少？
解释以下每个 OpenMP 指令的作用
1. !$OMP PARALLEL
2. !$OMP END PARALLEL
3. !$OMP PARALLEL DO
4. !$OMP END PARALLEL DO
5. !$OMP BARRIER
6. !$OMP MASTER
7. !$OMP END MASTER

尝试理解并运行列表 A 中的“Hello World”程序，使用 1、2、6 和 12 个线程。将每次运行的输出放在您的解决方案中，输出将位于一个以以下格式命名的文件中
helloworld.o**********
其中上面的最后几个条目是与运行次数相对应的数字。在列表 B 中，有一个在 Flux 上编译此程序的示例 Makefile。在列表 C 中有一个示例提交脚本。要将程序将运行的 OpenMP 进程数量从 2 改为 6，请更改
ppn=2
为
ppn=6
并还将 OMP_NUM_THREADS 变量的值从
{OMP_NUM_THREADS=2}
为
{OMP_NUM_THREADS=6}
在 Flux 上，每个节点最多有 12 个内核，因此对于大多数应用程序来说，最有效的使用线程数量是 12。

( A)

来自 http://en.wikipedia.org/wiki/OpenMP 的一个 Fortran 程序，它演示了使用 OpenMP 的并行性代码下载

!--------------------------------------------------------------------
!
!
! PURPOSE
!
! This program uses OpenMP to print hello world from all available
! threads
!
! .. Parameters ..
!
! .. Scalars ..
! id = thread id
! nthreads = total number of threads
!
! .. Arrays ..
!
! .. Vectors ..
!
! REFERENCES
! http:// en.wikipedia.org/wiki/OpenMP
!
! ACKNOWLEDGEMENTS
! The program below was modified from one available at the internet
! address in the references. This internet address was last checked
! on 30 December 2011
!
! ACCURACY
!
! ERROR INDICATORS AND WARNINGS
!
! FURTHER COMMENTS
!
!--------------------------------------------------------------------
! External routines required
!
! External libraries required
! OpenMP library
PROGRAM hello90
USE omp_lib
IMPLICIT NONE
INTEGER:: id, nthreads
!$OMP PARALLEL PRIVATE(id)
id = omp_get_thread_num()
nthreads = omp_get_num_threads()
PRINT *, 'Hello World from thread', id
!$OMP BARRIER
IF ( id == 0 ) THEN
PRINT*, 'There are', nthreads, 'threads'
END IF
!$OMP END PARALLEL
END PROGRAM

( B)

一个用于编译列表 A 中的 helloworld 程序的示例 Makefile 代码下载

#define the complier
COMPILER = ifort
# compilation settings, optimization, precision, parallelization
FLAGS = -O0 -openmp

# libraries
LIBS =
# source list for main program
SOURCES = helloworld.f90

test: $(SOURCES)
${COMPILER} -o helloworld $(FLAGS) $(SOURCES)

clean:
rm *.o

clobber:
rm helloworld

( C)

在 Flux 上使用的示例提交脚本代码下载

#!/bin/bash
#PBS -N helloworld
#PBS -l nodes=1:ppn=2,walltime=00:02:00
#PBS -q flux
#PBS -l qos=math471f11_flux
#PBS -A math471f11_flux
#PBS -M your_uniqname@umich.edu
#PBS -m abe
#PBS -V
#
# Create a local directory to run and copy your files to local.
# Let PBS handle your output
mkdir /tmp/${PBS_JOBID}
cp ${HOME}/ParallelMethods/helloworldOMP/helloworld /tmp/${PBS_JOBID}/helloworld
cd /tmp/${PBS_JOBID}

export OMP_NUM_THREADS=2
./helloworld

#Clean up your files
cd ${HOME}/ParallelMethods/helloworldOMP
/bin/rm -rf /tmp/${PBS_JOBID}

在二维热方程求解器的循环中添加 OpenMP 指令。使用 1、3、6 和 12 个线程运行生成的程序，并记录程序完成所需的时间。绘制最后一次迭代的图。

MPI

可以在 http://www.mpi-forum.org/ 找到当前 MPI 标准的副本。它允许并行化 Fortran、C 和 C++ 程序。有更新的并行编程语言，例如 Co-Array Fortran (CAF) 和 Unified Parallel C (UPC)，它们允许程序员将内存视为单个可寻址空间，即使是在分布式内存机器上也是如此。但是，计算机硬件限制意味着编写 MPI 程序时使用的许多编程概念将需要编写 CAF 和 UPC 程序。这些语言的编译器技术还没有像 Fortran 和 C 等旧语言的编译器技术那样发达，因此目前 Fortran 和 C 在高性能计算领域占据主导地位。可以在 http://www.shodor.org/refdesk/Resources/Tutorials/ 中找到编写和使用 MPI 程序所需的基本概念的介绍。可以在 Gropp、Lusk 和 Skjellum^[2]、Gropp、Lusk 和 Thakur^[3] 以及 https://computing.llnl.gov/tutorials/mpi/ 中找到有关 MPI 的更多信息。在线上有很多资源可用，但是一旦掌握了基本概念，最有用的是 MPI 命令索引，通常搜索引擎会提供列表来源，但我们发现以下网站很有用

MPI 练习

MPI 代表什么？
请阅读以下网址的教程：http://www.shodor.org/refdesk/Resources/Tutorials/BasicMPI/ 和 https://computing.llnl.gov/tutorials/mpi/，然后解释以下命令的作用：
- USE mpi或INCLUDE 'mpif.h'
- MPI_INIT
- MPI_COMM_SIZE
- MPI_COMM_RANK
- MPI_FINALIZE
当前 MPI 标准的版本号是多少？
尝试理解列表 D 中的 Hello World 程序。解释它与 A 的区别。在 1、2、6、12 和 24 个 MPI 进程上运行列表 D 中的程序^[4]。将每次运行的输出放在你的解决方案中，输出将存储在一个名为
helloworld.o**********
的文件中，其中上面的最后几个条目是对应于运行次数的数字。在 Flux 上编译它的示例 makefile 在列表 E 中。示例提交脚本在列表 F 中。要更改程序运行的 MPI 进程数量，例如从 2 更改为 6，请更改
ppn=2
为
ppn=6
并更改提交脚本：
mpirun -np 2 ./helloworld
为
mpirun -np 6 ./helloworld.

在 Flux 上，每个节点最多有 12 个核心，因此如果需要超过 12 个 MPI 进程，则还需要更改节点数量。所需的总核心数量等于节点数量乘以每个节点上的进程数量。因此，要使用 24 个进程，请更改
nodes=1:ppn=2
为
nodes=2:ppn=12
并更改提交脚本：
mpirun -np 2 ./helloworld
为
mpirun -np 24 ./helloworld.

( D)

一个使用 MPI 展示并行的 Fortran 程序代码下载

!--------------------------------------------------------------------
!
!
! PURPOSE
!
! This program uses MPI to print hello world from all available
! processes
!
! .. Parameters ..
!
! .. Scalars ..
! myid = process id
! numprocs = total number of MPI processes
! ierr = error code
!
! .. Arrays ..
!
! .. Vectors ..
!
! REFERENCES
! http:// en.wikipedia.org/wiki/OpenMP
!
! ACKNOWLEDGEMENTS
! The program below was modified from one available at the internet
! address in the references. This internet address was last checked
! on 30 December 2011
!
! ACCURACY
!
! ERROR INDICATORS AND WARNINGS
!
! FURTHER COMMENTS
!
!--------------------------------------------------------------------
! External routines required
!
! External libraries required
! MPI library
PROGRAM hello90
USE MPI
IMPLICIT NONE
INTEGER(kind=4) :: myid, numprocs, ierr

CALL MPI_INIT(ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)

PRINT*, 'Hello World from process', myid
CALL MPI_BARRIER(MPI_COMM_WORLD,ierr)
IF ( myid == 0 ) THEN
PRINT*, 'There are ', numprocs, ' MPI processes'
END IF
CALL MPI_FINALIZE(ierr)	
END PROGRAM

( E)

一个用于编译列表 {lst:ForMpiHw} 中的 helloworld 程序的示例 makefile 代码下载

#define the complier
COMPILER = mpif90
# compilation settings, optimization, precision, parallelization
FLAGS = -O0

# libraries
LIBS =
# source list for main program
SOURCES = helloworld.f90

test: $(SOURCES)
${COMPILER} -o helloworld $(FLAGS) $(SOURCES)

clean:
rm *.o

clobber:
rm helloworld

( F)

一个在 Flux 上使用的示例提交脚本代码下载

#!/bin/bash
#PBS -N helloworld
#PBS -l nodes=1:ppn=2,walltime=00:02:00
#PBS -q flux
#PBS -l qos=math471f11_flux
#PBS -A math471f11_flux
#PBS -M your_uniqname@umich.edu
#PBS -m abe
#PBS -V
#
# Create a local directory to run and copy your files to local.
# Let PBS handle your output
mkdir /tmp/${PBS_JOBID}
cp ${HOME}/ParallelMethods/helloworldMPI/helloworld /tmp/${PBS_JOBID}/helloworld
cd /tmp/${PBS_JOBID}

mpirun -np 2 ./helloworld

#Clean up your files
cd ${HOME}/ParallelMethods/helloworldMPI
/bin/rm -rf /tmp/${PBS_JOBID}

第一个并行程序：蒙特卡罗积分

为了在一个比 {Hello World} 稍微复杂一点的环境中介绍并行编程的基础知识，我们将考虑蒙特卡罗积分。我们将回顾概率和黎曼积分中的重要概念，然后给出示例算法，并解释为什么并行化可能会有所帮助。

概率

$f:U\subset \mathbb {R} ^{2}\rightarrow \mathbb {R} _{+}$ 是一个概率密度函数，如果 $\int \int _{U}f\mathrm {d} A=1$

如果 $f$ 是一个概率密度函数，它取集合 $U\subset \mathbb {R} ^{2}$ ，那么集合 $W\subset U$ 中事件发生的概率是 $P(W)=\int \int _{W}f\mathrm {d} A.$

明天下雪 $x$ 英寸，凯利明天在彩票中赢得 $y$ 美元的联合密度由 $f={\frac {c}{(1+x)(100+y)}}$ 给出，其中 $x,y\in [0,100]\times [0,100]$ ，否则 $f=0$ 。求 $c$ 。

假设 $X$ 是一个具有概率密度函数 $f_{1}(x)$ 的随机变量，而 $Y$ 是一个具有概率密度函数 $f_{2}(y)$ 的随机变量。那么 $X$ 和 $Y$ 是 **独立随机变量**，如果它们的联合密度函数是 $f(x,y)=f_{1}(x)f_{2}(y).$

明天是否会下雪以及凯莉明天是否会赢得彩票是独立随机变量。

如果 $f(x,y)$ 是随机变量 $X$ 和 $Y$ 的概率密度函数，**X 均值** 是 $\mu _{1}={\bar {X}}=\int \int xf\mathrm {d} A$ ，而 **Y 均值** 是 $\mu _{2}={\bar {Y}}=\int \int yf\mathrm {d} A.$

X 均值和 Y 均值分别是 X 和 Y 的期望值。

如果 $f(x,y)$ 是随机变量 $X$ 和 $Y$ 的概率密度函数，那么 **X 方差** 为 $\sigma _{1}^{2}={\overline {(X-{\bar {X}})^{2}}}=\int \int (x-{\bar {X}})^{2}f\mathrm {d} A$ ，**Y 方差** 为 $\sigma _{2}^{2}={\overline {(Y-{\bar {Y}})^{2}}}=\int \int (y-{\bar {Y}})^{2}f\mathrm {d} A.$

标准差定义为方差的平方根。

求雪量超过预期雪量的 1.1 倍，并且 Kelly 中奖金额超过预期金额的 1.2 倍的概率表达式。

练习

一个班级采用曲线评分方式。假设该班级是总体的一个代表性样本，数值分数 $x$ 的概率密度函数由 $f(x)=C\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right).$ 为了简化，我们假设 $x$ 可以取 $-\infty$ 和 $\infty$ ，尽管事实上考试分数在 $0$ 到 $100$ 之间。
1. 使用之前作业的结果确定 $C$ 。
2. 假设这个班级有 240 名学生，并且没有公布班级平均分和标准差。作为一个有进取心的学生，你对 60 名同学（假设他们是被随机选择的）进行了调查。你发现这 60 名学生的平均分为 55%，标准差为 10%。使用学生 t 分布 http://en.wikipedia.org/wiki/Student\%27s_t-distribution 估计样本平均值的 90% 置信区间。在 t 分布概率密度函数上绘制图形，并用阴影标记样本平均值 90% 置信区间所对应的区域。^[5]
**备注** 幸运的是，所有学生都非常勤奋，因此虽然负分数是可能的，但其发生的可能性非常低，因此为了简化上述计算，我们忽略了这种情况。

黎曼积分

回顾一下，我们可以用黎曼和来近似积分。很多积分无法用解析方法求解，但需要数值解。在本节中，我们将探索一种在计算机上进行此操作的简单方法。假设我们要找到 $I2d=\int _{0}^{1}\int _{0}^{4}x^{2}+2y^{2}\mathrm {d} y\mathrm {d} x.$ 如果我们用解析方法求解，我们会得到 $I2d=44.$ 假设我们忘记了如何积分，所以我们用数值方法进行。我们可以使用以下 Matlab 代码进行操作

( G)

一个用黎曼和来近似积分的 Matlab 程序代码下载

% A program to approximate an integral

clear all; format compact; format short;

nx=1000; % number of points in x
xend=1; % last discretization point
xstart=0; % first discretization point
dx=(xend-xstart)/(nx-1); % size of each x sub-interval

ny=4000; % number of points in y
yend=4; % last discretization point
ystart=0; % first discretization point
dy=(yend-ystart)/(ny-1); % size of each y sub-interval

% create vectors with points for x and y
for i=1:nx
    x(i)=xstart+(i-1)*dx;
end
for j=1:ny
    y(j)=ystart+(j-1)*dy;
end

% Approximate the integral by a sum
I2d=0;
for i=1:nx
    for j=1:ny
        I2d=I2d+(x(i)^2+2*y(j)^2)*dy*dx;
    end
end
% print out final answer
I2d

我们可以在三维空间中做类似的事情。假设我们要计算 $I3d=\int _{0}^{1}\int _{0}^{1}\int _{0}^{4}x^{2}+2y^{2}+3z^{2}\mathrm {d} z\mathrm {d} y\mathrm {d} x.$ 用解析方法求解，我们会得到 $I3d=68$

练习

修改 Matlab 代码以执行三维积分。
尝试确定二维或三维方法的精度如何随着子区间数量的变化而变化。

蒙特卡罗积分

可以将上述积分方案扩展到更高维度的积分。这可能会变得计算密集，因此经常使用基于概率的另一种积分方法。我们将讨论的方法称为蒙特卡罗方法，下面的方法描述基于 Michael Corral 的《向量微积分》第 3 章^[6]，可在 http://www.mecmath.net/ 找到，其中可以找到用于进行蒙特卡罗积分的 Java 和 Sage 程序。蒙特卡罗积分背后的想法基于函数平均值的概念，这在单变量微积分中遇到过。回顾一下，对于连续函数 $f(x)$ ，平均值 ${\bar {f}}$ 在区间 $\lbrack a,b\rbrack$ 上定义为

{\bar {f}}~=~{\frac {1}{b-a}}\int _{a}^{b}f(x)\,dx~.

( 1)

数量 $b-a$ 是区间 $\lbrack a,b\rbrack$ 的长度，可以看作是该区间的“体积”。将相同的推理应用于二元或三元函数，我们定义 $f(x,y)$ 在区域 $R$ 上的**平均值**为

{\bar {f}}~=~{\frac {1}{A(R)}}\iint \limits _{R}f(x,y)\,dA~,

( 2)

其中 $A(R)$ 是区域 $R$ 的面积，我们定义 $f(x,y,z)$ 在立体 $S$ 上的**平均值**为

{\bar {f}}~=~{\frac {1}{V(S)}}\iiint \limits _{S}f(x,y,z)\,dV~,

( 3)

其中 $V(S)$ 是立体 $S$ 的体积。因此，例如，我们有

\iint \limits _{R}f(x,y)\,dA~=~A(R){\bar {f}}~.

( 4)

区域 $R$ 上 $f(x,y)$ 的平均值可以看作是 $f$ 所有值之和除以 $R$ 中的点数。不幸的是，任何区域中都有无限多个点（实际上是不可数的），也就是说，它们不能在离散序列中列出。但是，如果我们取 $R$ 区域中非常大量的 $N$ 个随机点（可以通过计算机生成），然后计算这些点的 $f$ 值的平均值，并使用该平均值作为 ${\bar {f}}$ 的值？这正是蒙特卡洛方法所做的。因此，在公式 4 中，我们得到的近似值是

{\displaystyle \iint \limits _{R}f(x,y)\,dA~\approx ~A(R){\bar {f}}\pm A(R){\sqrt {\frac {{\overline {f^{2}}}-({\bar {f}})^{2}}{N}}}~,}">

( 5)

其中

{\bar {f}}~=~{\frac {\sum _{i=1}^{N}f(x_{i},y_{i})}{N}}\quad {\text{and}}\quad {\overline {f^{2}}}~=~{\frac {\sum _{i=1}^{N}(f(x_{i},y_{i}))^{2}}{N}}~,

( 6)

其中求和是在 $N$ 个随机点 $(x_{1},y_{1})$ , $\ldots$ , $(x_{N},y_{N})$ 上进行。公式 5 中的 $\pm$ “误差项”实际上并没有给出近似值的严格界限。它表示积分的预期值的单个标准差。也就是说，它给出了一个可能的误差界限。由于它使用随机点，蒙特卡罗方法是一个概率方法的例子（与确定性方法如黎曼和近似方法相反，后者使用特定公式生成点）。

例如，我们可以使用等式 5 中的公式来近似曲面 $z=x^{2}+2y^{2}$ 在矩形 $R=(0,1)\times (0,4)$ 上的体积 $V$ 。回想一下，实际体积是 $44$ 。下面是使用蒙特卡罗积分计算体积的 Matlab 代码

( H)

一个 Matlab 程序，演示了如何使用蒙特卡罗方法来计算

z=x^{2}+2y^{2}

下方的体积，其中

(x,y)\in (0,1)\times (0,4)

。代码下载

% A program to approximate an integral using the Monte Carlos method

% This program can be made much faster by using Matlab's matrix and vector
% operations, however to allow easy translation to other languages we have
% made it as simple as possible.

Numpoints=65536; % number of random points

I2d=0; % Initialize value
I2dsquare=0; % initial variance
for n=1:Numpoints
    % generate random number drawn from a uniform distribution on (0,1)
    x=rand(1);
    y=rand(1)*4;
    I2d=I2d+x^2+2*y^2;
    I2dsquare=I2dsquare+(x^2+2*y^2)^2;
end
% we scale the integral by the total area and divide by the number of
% points used
I2d=I2d*4/Numpoints
% we also output an estimated error
I2dsquare=I2dsquare*4/Numpoints;
EstimError=4*sqrt( (I2d^2-I2dsquare)/Numpoints)

下面显示了使用不同数量的随机点运行此程序的结果

N = 16: 41.3026 +/- 30.9791
N = 256: 47.1855 +/- 9.0386
N = 4096: 43.4527 +/- 2.0280
N = 65536: 44.0026 +/- 0.5151

正如你所看到的，近似值相当好。当 $N\to \infty$ 时，可以证明蒙特卡罗近似收敛于实际体积（按 $O({\sqrt {N}})$ 的数量级，在计算复杂度术语中）。

在上面的示例中，区域 $R$ 是一个矩形。要对非矩形（有界）区域 $R$ 使用蒙特卡罗方法，只需要稍作修改。选择一个包含 $R$ 的矩形 ${\tilde {R}}$ ，并像以前一样在该矩形中生成随机点。然后仅当这些点位于 $R$ 内时，才将它们用于 ${\bar {f}}$ 的计算。在这种情况下，无需为公式 ({eqn:monte}) 计算 $R$ 的面积，因为排除 $R$ 外的点允许你使用矩形 ${\tilde {R}}$ 的面积，与之前类似。

例如，可以证明，在非矩形区域 $R=\left\{(x,y):0\leq x^{2}+y^{2}\leq 1\right\}$ 上，曲面 $z=1$ 下的体积是 $\pi$ 。由于矩形 ${\tilde {R}}=[-1,1]\times [-1,1]$ 包含 $R$ ，我们可以使用与我们之前使用的类似程序，最大的变化是检查 $y^{2}+x^{2}\leq 1$ 是否对 $[-1,1]\times [-1,1]$ 中的随机点 $(x,y)$ 成立。下面列出了一个展示此功能的 Matlab 代码。

( I)

一个演示如何使用蒙特卡罗方法计算不规则区域的面积，以及计算

\pi

的 Matlab 程序。代码下载

% A program to approximate an integral using the Monte Carlos method

% This program can be made much faster by using Matlab's matrix and vector
% operations, however to allow easy translation to other languages we have
% made it as simple as possible.

Numpoints=256; % number of random points

I2d=0; % Initialize value
I2dsquare=0; % initial variance
for n=1:Numpoints
    % generate random number drawn from a uniform distribution on (0,1) and
    % scale this to (-1,1)
    x=2*rand(1)-1;
    y=2*rand(1) -1;
    if ((x^2+y^2) <1)
        I2d=I2d+1;
        I2dsquare=I2dsquare+1;
    end
end
% We scale the integral by the total area and divide by the number of
% points used
I2d=I2d*4/Numpoints
% we also output an estimated error
I2dsquare=I2dsquare*4/Numpoints;
EstimError=4*sqrt( (I2d^2-I2dsquare)/Numpoints)

使用不同数量的随机点运行该程序的结果如下所示

N = 16: 3.5000 +/- 2.9580
N = 256: 3.2031 +/- 0.6641
N = 4096: 3.1689 +/- 0.1639
N = 65536: 3.1493 +/- 0.0407

为了使用蒙特卡罗方法评估三重积分，你需要在平行六面体中生成随机三元组 $(x,y,z)$ ，而不是在矩形中生成随机对 $(x,y)$ ，并在公式 5 中使用平行六面体的体积代替矩形的面积。有关数值积分方法的更详细讨论，请参考高级数学课程。

练习

编写一个程序，使用蒙特卡罗方法逼近二重积分 $\iint \limits _{R}e^{xy}\,dA$ ，其中 $R=\lbrack 0,1\rbrack \times \lbrack 0,1\rbrack$ 。显示程序在 $N=10$ ， $100$ ， $1000$ ， $10000$ ， $100000$ 和 $1000000$ 个随机点。
编写一个程序，使用蒙特卡罗方法来近似三重积分 $\iiint \limits _{S}e^{xyz}\,dV$ ，其中 $S=\lbrack 0,1\rbrack \times \lbrack 0,1\rbrack \times \lbrack 0,1\rbrack$ 。显示程序输出，当 $N=10$ ， $100$ ， $1000$ ， $10000$ ， $100000$ 和 $1000000$ 个随机点。
使用蒙特卡罗方法来近似半径为 $1$ 的球体的体积。

并行蒙特卡罗积分

你可能已经注意到，这些算法很简单，但要获得准确的结果可能需要非常多的网格点。因此，在并行计算机上运行这些算法非常有用。我们将演示对 $\pi$ 进行并行蒙特卡罗计算。在我们这样做之前，我们需要学习如何使用并行计算机^[7]。

我们现在考察一个用于计算 $\pi$ 的 Fortran 程序。这些程序取自http://chpc.wustl.edu/mpi-fortran.html，在那里可以找到更详细的解释。这些程序的原始来源似乎是 Gropp、Lusk 和 Skjellum^[8]

串行

( I)

一个串行 Fortran 程序，演示了如何使用蒙特卡罗方法来计算

\pi

代码下载

!--------------------------------------------------------------------
!
!
! PURPOSE
!
! This program use a monte carlo method to calculate pi
!
! .. Parameters ..
! npts = total number of Monte Carlo points
! xmin = lower bound for integration region
! xmax = upper bound for integration region
! .. Scalars ..
! i = loop counter
! f = average value from summation
! sum = total sum
! randnum = random number generated from (0,1) uniform
! distribution
! x = current Monte Carlo location
! .. Arrays ..
!
! .. Vectors ..
!
! REFERENCES
! http://chpc.wustl.edu/mpi-fortran.html
! Gropp, Lusk and Skjellum, "Using MPI" MIT press (1999)
!
! ACKNOWLEDGEMENTS
! The program below was modified from one available at the internet
! address in the references. This internet address was last checked
! on 30 March 2012
!
! ACCURACY
!
! ERROR INDICATORS AND WARNINGS
!
! FURTHER COMMENTS
!
!--------------------------------------------------------------------
! External routines required
!
! External libraries required
! None
  PROGRAM monte_carlo
    IMPLICIT NONE

INTEGER(kind=8), PARAMETER :: npts = 1e10
    REAL(kind=8), PARAMETER :: xmin=0.0d0,xmax=1.0d0
    INTEGER(kind=8) :: i
    REAL(kind=8) :: f,sum, randnum,x

    DO i=1,npts
      CALL random_number(randnum)
      x = (xmax-xmin)*randnum + xmin
      sum = sum + 4.0d0/(1.0d0 + x**2)
    END DO
f = sum/npts
    PRINT*,'PI calculated with ',npts,' points = ',f

    STOP
END

( J)

一个用于编译清单 I 中程序的示例 makefile 代码下载

#define the complier
COMPILER = mpif90
# compilation settings, optimization, precision, parallelization
FLAGS = -O0

# libraries
LIBS =
# source list for main program
SOURCES = montecarloserial.f90

test: $(SOURCES)
${COMPILER} -o montecarloserial $(FLAGS) $(SOURCES)

clean:
rm *.o

clobber:
rm montecarloserial

( K)

一个示例提交脚本，用于在圣地亚哥超级计算机中心的 Trestles 上使用代码下载

#!/bin/bash
# the queue to be used.
#PBS -q shared
# specify your project allocation
#PBS -A mia122
# number of nodes and number of processors per node requested
#PBS -l nodes=1:ppn=1
# requested Wall-clock time.
#PBS -l walltime=00:05:00
# name of the standard out file to be "output-file".
#PBS -o job_output
# name of the job
#PBS -N MCserial
# Email address to send a notification to, change "youremail" appropriately
#PBS -M youremail@umich.edu
# send a notification for job abort, begin and end
#PBS -m abe
#PBS -V
cd $PBS_O_WORKDIR #change to the working directory
mpirun_rsh -np 1 -hostfile $PBS_NODEFILE montecarloserial

并行

( L)

一个并行 Fortran MPI 程序，用于计算

\pi

。代码下载

!--------------------------------------------------------------------
!
!
! PURPOSE
!
! This program uses MPI to do a parallel monte carlo calculation of pi
!
! .. Parameters ..
! npts = total number of Monte Carlo points
! xmin = lower bound for integration region
! xmax = upper bound for integration region
! .. Scalars ..
! mynpts = this processes number of Monte Carlo points
! myid = process id
! nprocs = total number of MPI processes
! ierr = error code
! i = loop counter
! f = average value from summation
! sum = total sum
! mysum = sum on this process
! randnum = random number generated from (0,1) uniform
! distribution
! x = current Monte Carlo location
! start = simulation start time
! finish = simulation end time
! .. Arrays ..
!
! .. Vectors ..
!
! REFERENCES
! http://chpc.wustl.edu/mpi-fortran.html
! Gropp, Lusk and Skjellum, "Using MPI" MIT press (1999)
!
! ACKNOWLEDGEMENTS
! The program below was modified from one available at the internet
! address in the references. This internet address was last checked
! on 30 March 2012
!
! ACCURACY
!
! ERROR INDICATORS AND WARNINGS
!
! FURTHER COMMENTS
!
!--------------------------------------------------------------------
! External routines required
!
! External libraries required
! MPI library
    PROGRAM monte_carlo_mpi
    USE MPI
    IMPLICIT NONE

    INTEGER(kind=8), PARAMETER :: npts = 1e10
    REAL(kind=8), PARAMETER :: xmin=0.0d0,xmax=1.0d0
    INTEGER(kind=8) :: mynpts
    INTEGER(kind=4) :: ierr, myid, nprocs
    INTEGER(kind=8) :: i
    REAL(kind=8) :: f,sum,mysum,randnum
    REAL(kind=8) :: x, start, finish
    
    ! Initialize MPI
    CALL MPI_INIT(ierr)
    CALL MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
    CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
    start=MPI_WTIME()

! Calculate the number of points each MPI process needs to generate
    IF (myid .eq. 0) THEN
mynpts = npts - (nprocs-1)*(npts/nprocs)
    ELSE
mynpts = npts/nprocs
    ENDIF
    
    ! set initial sum to zero
    mysum = 0.0d0
! use loop on local process to generate portion of Monte Carlo integral
    DO i=1,mynpts
      CALL random_number(randnum)
      x = (xmax-xmin)*randnum + xmin
      mysum = mysum + 4.0d0/(1.0d0 + x**2)
    ENDDO

! Do a reduction and sum the results from all processes
    CALL MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM,&
          0,MPI_COMM_WORLD,ierr)
    finish=MPI_WTIME()

    ! Get one process to output the result and running time
    IF (myid .eq. 0) THEN
f = sum/npts
         PRINT*,'PI calculated with ',npts,' points = ',f
         PRINT*,'Program took ', finish-start, ' for Time stepping'
    ENDIF

CALL MPI_FINALIZE(ierr)

    STOP
END PROGRAM

( M)

一个用于编译清单 L 中程序的示例 makefile 代码下载

#define the complier
COMPILER = mpif90
# compilation settings, optimization, precision, parallelization
FLAGS = -O0

# libraries
LIBS =
# source list for main program
SOURCES = montecarloparallel.f90

test: $(SOURCES)
${COMPILER} -o montecarloparallel $(FLAGS) $(SOURCES)

clean:
rm *.o

clobber:
rm montecarloparallel

( N)

这是一个在圣地亚哥超级计算机中心 Trestles 上使用的示例提交脚本代码下载

#!/bin/bash
# the queue to be used.
#PBS -q normal
# specify your project allocation
#PBS -A mia122
# number of nodes and number of processors per node requested
#PBS -l nodes=1:ppn=32
# requested Wall-clock time.
#PBS -l walltime=00:05:00
# name of the standard out file to be "output-file".
#PBS -o job_output
# name of the job, you may want to change this so it is unique to you
#PBS -N MPI_MCPARALLEL
# Email address to send a notification to, change "youremail" appropriately
#PBS -M youremail@umich.edu
# send a notification for job abort, begin and end
#PBS -m abe
#PBS -V

# change to the job submission directory
cd $PBS_O_WORKDIR
# Run the job
mpirun_rsh -np 32 -hostfile $PBS_NODEFILE montecarloparallel

练习

解释为什么使用蒙特卡洛方法来评估 $\int _{0}^{1}{\frac {1}{1+x^{2}}}\mathrm {d} x$ 允许您找到 $\pi$ ，并用您自己的话解释串行程序和并行程序做了什么。
找出在 32、64、128、256 和 512 个核心上运行并行蒙特卡洛程序所需的时间。
使用并行蒙特卡洛积分程序评估 $\iint x^{2}+y^{6}+\exp(xy)\cos(y\exp(x))\mathrm {d} A$ 在单位圆上。
使用并行蒙特卡洛积分程序来近似椭圆的体积 ${\frac {x^{2}}{9}}+{\frac {y^{2}}{4}}+{\frac {z^{2}}{1}}=1$ 。使用 OpenMP 或 MPI。
编写并行程序以找到 4 维球体的体积 $1\geq \sum _{i=1}^{4}x_{i}^{2}.$ 尝试蒙特卡洛方法和黎曼和方法。使用 OpenMP 或 MPI。

笔记

↑ 2decomp&fft
↑ Gropp, Lusk 和 Skjellum (1999)
↑ Gropp, Lusk 和 Thakur (1999)
↑
可以在超过 24 个进程上运行此程序，但是，输出变得非常多
↑
学生 t 分布在许多数值软件包中实现，例如 Maple、Mathematica、Matlab、R、Sage 等，因此如果您需要使用它来获得数值结果，则可以使用这些软件包之一。
↑ Corral (2011)
↑ 今天生产的许多计算机和移动电话都有 2 个或更多个核心，因此可以被认为是并行的，但这里指的是拥有数百个核心的计算机。
↑ Gropp, Lusk 和 Skjellum (1999)

参考文献

Corral, M. (2011). 向量微积分. {{cite book}}: Cite has empty unknown parameter: |coauthors= (help)

Gropp, W.; Lusk, E.; Skjellum, A. (1999). 使用 MPI. 麻省理工学院出版社。

Gropp, W.; Lusk, E.; Thakur, R. (1999). 使用 MPI-2. 麻省理工学院出版社。

Li, N. "2decomp&fft". {{cite web}}: Cite has empty unknown parameter: |coauthors= (help)

[1] 2decomp&fft

[2] Gropp, Lusk 和 Skjellum (1999)

[3] Gropp, Lusk 和 Thakur (1999)

[4] 
可以在超过 24 个进程上运行此程序，但是，输出变得非常多

[5] 
学生 t 分布在许多数值软件包中实现，例如 Maple、Mathematica、Matlab、R、Sage 等，因此如果您需要使用它来获得数值结果，则可以使用这些软件包之一。

[6] Corral (2011)

[7] 今天生产的许多计算机和移动电话都有 2 个或更多个核心，因此可以被认为是并行的，但这里指的是拥有数百个核心的计算机。

[8] Gropp, Lusk 和 Skjellum (1999)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]