"Your current attempt runs in about O(N^2). So even if you had enough memory, it won't finish in your lifetime." Alexander Yee^[1]

   "premature optimization is the root of all evil" - Donald Knuth in paper "Structured Programming With Go To Statements"

           "good algorithms beat optimized code" Bruce Dawson ( Fractal extreme )

软件优化

代码剖析
基准测试 ^[2]

如何询问优化问题 ?

"如果你已经实际剖析了代码，有具体的代码片段，以便每个人都可以运行相同的代码来查看其性能，并且你在某个地方发布了这个库，可以查看（GitHub、Bitbucket 等），那么我不一定认为在代码审查中进行讨论有问题。包括你选择的分析器中的结果，并识别瓶颈，将有助于保持主题的集中性。
如果你刚开始编写代码，但已经剖析了一小段代码，并展示了异常的性能，那么在 Stack Overflow 上提问是可接受的。包括你选择的分析器中的结果，并识别瓶颈，将有助于保持主题的集中性。
如果你只是让别人帮你优化代码，那就不要在任何地方提问。得到一个代码转储并被要求找到性能瓶颈，这只是我在专业工作中才会做的事，而不是在我的空闲时间。" (Makoto) ^[3]

另请参阅 : 代码审查 ^[4]

内部循环

维基百科中的内部循环 ^[5]

数值计算的优化

数字类型

double 比 float 快
"在我的测试（使用扰动）中，在 Fraktaler-3 中，x87 长双精度浮点数在 AMD 2700X CPU 上比双精度浮点数慢大约 4.2 倍。AMD RX 580 GPU 上的 floatexp（float + int32）比 AMD 2700X CPU 上的 x87 长双精度浮点数快大约 2.3 倍，因此根据硬件的不同，x87 长双精度浮点数可能没有必要..."fractalforums.org: from-java-to-cplusplus

距离

距离近似
科比 1/4 定理 ^[6]

分解

四叉树分解

"I did a web-based quad-tree Mandelbrot once.^[7]  I labelled the quadrants (eg a,b,c,d) the tile files were in directories (something like a.png b.png a/a.png a/b.png a/c/a/b.png), along with an sqlite database that said which tiles  were boring (all same colour), along with their period (0 for 100% exterior, 1 for 100% inside the main cardioid, 2 for 100% inside the main circle, etc).  The client was really simple, just a web page with tile paths; the server  would do a redirect to the relevant boring tile 0.png 1.png 2.png etc if any parent of the tile was boring, otherwise it'd try to serve the tile.png.  Meanwhile there was another process that just generated the tiles and populated the database.  I can't remember if I made 404 tile not found errors bump the priority of the tile for the renderer, but maybe.

Eventually i abandoned the project because recalculating from scratch is typically faster and more flexible for shallow zooms, and for deep zooms the storage is prohibitive." Claude Heiland-Allen ^[8]

对称性

镜像对称

参数平面和曼德布洛特集合在平面中关于 x 轴对称 :^[9]

colour(x,y) = colour(x,-y)

这里有一个 C 代码示例，展示了如何将对称图像（其轴位于其高度的中间）分成 3 部分

对称轴
2 个对称部分 : 轴上和轴下

// fill array using mirror symmetry of image 
// uses global var :  ...
int FillArray(unsigned char data[] )
{
   
 unsigned char Color; // gray from 0 to 255

// draw axis of symmetry
iy = iyAxisOfSymmetry; 
for(ix=ixMin;ix<=ixMax;++ix) PlotPoint(ix, iy, GiveColor(ix, iy));

// draw symmetric parts : above and below axis 
for(iyAbove = iyAboveMin; iyAbove<=iyAboveMax; ++iyAbove) 
  for(ix=ixMin; ix<=ixMax; ++ix)

  { // above axis compute color and save it to the array
    iy = iyAxisOfSymmetry + iyAbove;
    Color = GiveColor(ix, iy);
    PlotPoint(ix, iy, Color ); 
    // below the axis only copy Color the same as above without computing it 
    PlotPoint(ix, iyAxisOfSymmetry - iyAbove , Color ); 
   }   
 return 0;
}

另请参阅用于一般情况的 Pascal 代码 (Delphi) ^[10]

旋转对称

动力学平面和朱利亚集合具有旋转对称性。它可以用来加速计算

// fill array using symmetry of image 
// uses global var :  ...
int FillArraySymmetric(unsigned char data[] )
{
   
 unsigned char Color; // gray from 0 to 255

printf("axis of symmetry \n"); 
iy = iyAxisOfSymmetry;

for(ix=ixMin;ix<=ixMax;++ix) PlotPoint(ix, iy, GiveColor(ix, iy));

// above and below axis 
for(iyAbove = iyAboveMin; iyAbove<=iyAboveMax; ++iyAbove) 
  {printf(" %d from %d\n", iyAbove, iyAboveMax); //info 
  for(ix=ixMin; ix<=ixMax; ++ix)

  { // above axis compute color and save it to the array
    iy = iyAxisOfSymmetry + iyAbove;
    Color = GiveColor(ix, iy);
    PlotPoint(ix, iy, Color ); 
    // below the axis only copy Color the same as above without computing it 
    PlotPoint(ixMax-ix, iyAxisOfSymmetry - iyAbove , Color ); 
   } 
}  
 return 0;
}

复数

不显式定义复数或不使用复数类型的计算通常更快。

并行计算

在曼德布洛特和朱利亚集合中，每个点/像素都是独立计算的，因此可以轻松地将其分解成更小的任务并同时执行。 ^[11]

并行计算类型 : 维基百科描述 ^[12]

多核（计算）
- Parallella 和 OpenCl ^[13]
- 英特尔 MIC ^[14]
- MPI ^[15]
- OpenMP ^[16] ^[17] ^[18]
通用图形处理器计算 (GPGPU)
- OpenCl ^[19]
- CUDA ^[20]
- WebGl ^[21] ^[22]

向量化

C

gcc 向量扩展 ^[23]^[24]^[25]^[26]
英特尔® 高级向量扩展 ^[27]^[28]

c++

Eigen 中的 Mandelbrot - 显式矢量化

R

矢量化 ^[29]^[30]

以下是上面图片的 R 代码，包含注释：^[31]

## based on the R code by Jarek Tuszynski
## http://commons.wikimedia.org/wiki/File:Mandelbrot_Creation_Animation_%28800x600%29.gif

iMax=20; # number of frames and iterations

## world coordinate ( float )
cxmin = -2.2;
cxmax = 1.0;
cymax = 1.2;
cymin = -1.2;

## screen coordinate ( integer)

dx=800; dy=600 ;           # define grid size

## sequences of  values = rows and columns
cxseq = seq(cxmin, cxmax, length.out=dx);
cyseq = seq(cymin, cymax, length.out=dy);
## sequences of  values = rows * columns
cxrseq = rep(cxseq, each = dy);
cyrseq = rep(cyseq,  dx);
C = complex( real=cxrseq, imag = cyrseq); # complex vector
C = matrix(C,dy,dx)       # convert from vector to matrix
# Now C is a matrix of c points coordinates (cx,cy)

# allocate memory for all the frames
F = array(0, dim = c(dy,dx,iMax)) # dy*dx*iMax array filled with zeros

Z = 0                     # initialize Z to zero
for (i in 1:iMax)         # perform iMax iterations
{
  Z = Z^2+C               # iteration; uses n point to compute n+1 point
  
  F[,,i]  = exp(-abs(Z))              # an omitted index is used to represent an entire column or row
     # store magnitude of the complex number, scale it using an exponential function to emphasize fine detail, 
}

# color and save F array to the file 
library(caTools)          # load library with write.gif function
# mapped to color palette jetColors . Dark red is a very low number, dark blue is a very high number
jetColors = colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))
write.gif(F, "Mandelbrot.gif", col=jetColors, delay=100, transparent=0)

多核

"我总是创建与我拥有的核心数量相同的线程。超过这个数量，你的系统会花费太多时间进行任务切换" - Duncan C^[32]^[33]

使用 OpenMP

// fill array using symmetry of image 
// uses global var :  ...
int FillArraySymmetric(unsigned char data[] )
{
   
 unsigned char Color; // gray from 0 to 255

printf("axis of symmetry \n"); 
iy = iyAxisOfSymmetry; 
#pragma omp parallel for schedule(dynamic) private(ix,Color) shared(ixMin,ixMax, iyAxisOfSymmetry)
for(ix=ixMin;ix<=ixMax;++ix) PlotPoint(ix, iy, GiveColor(ix, iy));

/*
The use of ‘shared(variable, variable2) specifies that these variables should be shared among all the threads.
The use of ‘private(variable, variable2)’ specifies that these variables should have a separate instance in each thread.
*/

#pragma omp parallel for schedule(dynamic) private(iyAbove,ix,iy,Color) shared(iyAboveMin, iyAboveMax,ixMin,ixMax, iyAxisOfSymmetry)

// above and below axis 
for(iyAbove = iyAboveMin; iyAbove<=iyAboveMax; ++iyAbove) 
  {printf(" %d from %d\n", iyAbove, iyAboveMax); //info 
  for(ix=ixMin; ix<=ixMax; ++ix)

  { // above axis compute color and save it to the array
    iy = iyAxisOfSymmetry + iyAbove;
    Color = GiveColor(ix, iy);
    PlotPoint(ix, iy, Color ); 
    // below the axis only copy Color the same as above without computing it 
    PlotPoint(ixMax-ix, iyAxisOfSymmetry - iyAbove , Color ); 
   } 
}  
 return 0;
}

单元

"索尼 PlayStation 3 是消费市场上最便宜的并行计算机之一。" ^[34]^[35]

GPGPU

CUDA
- 适用于 CUDA 的 Yellow Dog Linux^[36]
- par4all ^[37]
OpenCl
GLSL - GLSL 编辑工具
- Fragmentarium : 用于使用基于 GPU（GLSL）的像素图形构建的 C++、OpenGL/GLSL 和 Qt 4 的游乐场。
- 着色器编辑器使用 kickjs
- GLSL 沙盒由 Ricardo Cabello Miguel 使用 three.js
- jsfiddle 和示例
- WebGl : 通过 html 使用 GLSL。它是一个基于 OpenGL ES 2.0 API 的 3D 图形 API，开发人员可以使用它来创建完整的 3D 网络应用程序。
  - Inigo Quilez、Pol Jeremias 的 shadertoy : 一个基于浏览器的片段着色器编辑器，用于创建 2D 效果。
    - shadertoy 如何运作？
  - Chrome 实验
  - WebGL 游乐场 - 新的不工作！！
  - 示例

Canvas : 支持浏览器的列表

效果

在 2 核 CPU 上的效果（每个核心 1 个线程）

方法	时间 [分钟]	相对速度
CPU & 无优化	80	1
CPU & 对称性	39	2
CPU & 对称性 & OpenMP	24	3
GPU	? (待办)

GPU 应该是

比 CPU 快 62 倍 ^[38]
快 760 倍 ^[39]

扰动算法与级数逼近（在 e100 附近的缩放级别快 100 倍！）^[40]

使用 WinXP Intel 2.9 GHz CPU（使用 1 个 CPU）和 GTX 480 GPU，我使用 1000x1000 图和 1000 次最大迭代得到以下结果：^[41]

类型	时间	描述
gpu	0.07 秒	gpu 是 GPU 上的纯 CUDA 解决方案
gpuarray	3.4 秒	gpuarray 在 GPU 上使用 Python 中类似 numpy 的 CUDA 包装器
numpy	43.4 秒	numpy 是 CPU 上的纯 Numpy（基于 C）解决方案
python（串行）	1605.6 秒	python 是 CPU 上的纯 Python 解决方案，使用 numpy 数组

参考文献

[1] stackoverflow 问题：如何以最快的方式计算 e 到 2 万亿位

[2] 计算机语言基准测试游戏

[3] Stack Overflow 元数据：可以问关于代码优化的帮助吗？

[4] stack exchange：代码审查

[5] 维基百科中的内循环

[6] 使用距离估计进行自适应过采样，作者：Claude Heiland-Allen

[7] 距离估计，作者：Claude Heiland-Allen

[8] ractalforums.org：使用四叉树快速创建超分辨率分形图像

[9] 维基百科：反射对称

[10] raktal.republika.pl 上关于 x 轴的镜像对称

[11] 令人尴尬的并行问题

[12] 维基百科：并行计算机 - 并行计算机的类别

[13] 使用 OpenCl 和 Parallella 的 Mandelbrot

[14] 维基百科：英特尔 MIC

[15] 并行 Mandelbrot 集（使用消息传递接口 (MPI) 库的 C 代码），作者：Omar U. Florez

[16] Joel Yliluoma 编写的《OpenMP 指南：C++ 的简单多线程编程》

[17] rogle 编写的《使用 OpenMP 的并行 Mandelbrot》

[18] udiusmaximus : 指数映射和 OpenMP

[19] ythreads 编写的《第 1 部分：OpenCL™ – 可移植并行性》

[20] Loren Shure 编写的《Matlab 中的 GPU 上的 Mandelbrot 集》

[21] Felix Woitzel 编写的《使用 webgl 的渐进 Julia 分形》

[22] roku.com 上的 glsl 沙箱

[23] : 向量扩展

[24] rt hubert 编写的《使用 gcc 向量支持的第一个代码示例》

[25] EJRH Edmund Horner 编写的《使用 SIMD 的 Mandelbrot 计算》

[26] 计算机语言基准测试游戏

[27] 英特尔高级矢量扩展

[28] Tim Horton 编写的《ICC 和 Mandelbrot》

[29] 向量化（并行计算）

[30] Matlab - 代码向量化指南

[31] J.D. Long 编写的《R 中的 Mandelbrot 集》

[32] /programming/what-is-the-general-approach-to-threading-for-2d-plotting/ Fractal 论坛讨论 : 2D 绘图的线程通用方法是什么

[33] ractalforums 讨论 : 以合理的价格获得最强大的计算机？PHPSESSID=7db41da33f499eb25ef85f46866438f2

[34] 针对特定硬件的算法 -德国萨尔大学数学图像分析组

[35] Jenks 并行编程博客

[36] 在 Cell BE 处理器上编程高性能应用程序，第 1 部分：PLAYSTATION 3 上的 Linux 简介

[37] r4all

[38] David Bucciarelli 编写的《MandelCPU 与 MandelGPU》

[39] thworks : 说明三种 GPU 计算 Mandelbrot 集的方法

[40] Bernard Geiger 编写的《Kalles Fraktaler 2》

[41] Ian Ozsvald ianozsvald.com 2010 年 7 月编写的《使用 GPU、串行 numpy 和更快的 numpy 计算的 Mandelbrot，用于展示 CPU 和 GPU 计算之间的速度差异》

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]