Calculating Pi in Parallel on GPU and CPU, using the Bailey Borwein Plouffe formula.

Remember the benchmarking program Super Pi? It is a prime example of a benchmark which fell out of use because it was single threaded in an increasingly threaded era. All modern CPUs released in the last few years have at least 2 cores. The latest enthusiast chips have 32 cores and 64 threads (AMD Threadripper 2990WX). While single thread performance does improve with each generation, to make any real use of the most powerful processors, parallel programming must be used.

GPUs on the other hand have many small compute processors which all put together are often more powerful than a CPU. But on a GPU each thread is slow on its own so they are totally useless for any task which is unable to scale to use all the available performance.

The video below is a demonstration of my implementation of the Bailey–Borwein–Plouffe formula¹.

This is able to calculate any specified hexadecimal place of Pi, within certain limitations related to computational precision. It is an excellent candidate for parallel implementation and the two programs I wrote, one for CPU, one for GPU, are able to scale to almost any number of threads. For calculating the 10,000,000th hexadecimal place, 40 million threads could be used in theory, although based on my experience overheads would mean that 40,000 or so might be better.

My GPU implementation uses 700 threads and the CPU implementation uses only 8 yet it is not far behind. This is because the GPU threads are much slower. GPUs perform at their best when working on shorter data types. The Double Precision (fp64) performance of my R9 290 is rated at 600 GFLOPS³ yet the Single Precision (fp32,float in C) performance is said to be 4.8 TFLOPS. A ratio of 8. However, despite being much newer the vast majority of current GPUs have ratios of 16(AMD) or 32(Nvidia) crippling double precison performance. Only workstation class cards have meaningful double precison performance.

For reasons better explained by D. Bailey² in his excellent paper on the formula, it is necessary to use fp64 to estimate hexadecimal places up to 10 million. With 128-bit precision places up to 100 trillion can be estimated. However, CPUs can only handle 80-bit (aka long double) natively in hardware and GPUs only fp64 in hardware. Libraries can be used to work with 128-bit precision but as these are not natively implemented in hardware they are as much as 100 time slower than 64-bit.

In time I will add more to this page about the formula itself and my implementation. Including download links once I have implemented the GPU algorithm in CUDA and compiled win10 versions.

Double Precision GPU Performance
Year	Brand	Name	FP64	FP32	Ratio	Launch Price (USD)
2012	Nvidia	GTX 680	135 GFLOPS	3250 GFLOPS	1:24	499
	Nvidia	Tesla K20X	1312 GFLOPS	3935 GFLOPS	1:3	7699
	AMD	Radeon 7970 Ghz	1075 GFLOPS	4301 GFLOPS	1:4	499
2013	Nvidia	GTX 780 Ti	222 GFLOPS	5345 GFLOPS	1:24	699
	Nvidia	Quadro K6000	1732 GFLOPS	5196 GFLOPS	1:3	5265
	AMD	Radeon R9 290X	704 GFLOPS	5632 GFLOPS	1:8	549
	AMD	Radeon 7990	2048 GFLOPS	8192 GFLOPS	1:4	999
2014	Nvidia	GTX 980	155 GFLOPS	4981 GFLOPS	1:32	549
	Nvidia	GTX Titan Black	1882 GFLOPS	5645 GFLOPS	1:3	999
	AMD	Firepro W9100	2619 GFLOPS	5238 GFLOPS	1:2	3999
2015	Nvidia	GTX 980 Ti	189 GFLOPS	6060 GFLOPS	1:32	649
	AMD	Radeon R9 Fury X	538 GFLOPS	8602 GFLOPS	1:16	649
	AMD	Firepro S9170	2619 GFLOPS	5238 GFLOPS	1:2	3999
2016	Nvidia	GTX 1080	277 GFLOPS	8873 GFLOPS	1:32	599
	Nvidia	Tesla P100	4763 GFLOPS	9526 GFLOPS	1:2	4599
	AMD	Radeon Pro Duo	1024 GFLOPS	16384 GFLOPS	1:16	1499
2017	Nvidia	GTX 1080 Ti	354 GFLOPS	11340 GFLOPS	1:32	699
	Nvidia	GTX Titan V	7450 GFLOPS	14899 GFLOPS	1:2	2999
	AMD	Radeon RX Vega 64	786 GFLOPS	12583 GFLOPS	1:16	499
2018	Nvidia	RTX 2080 Ti	420 GFLOPS	13448 GFLOPS	1:32	999
	Nvidia	Quadro GV100	8335 GFLOPS	16671 GFLOPS	1:2	8999
	AMD	Radeon Instinct MI60	7373 GFLOPS	14746 GFLOPS	1:2	Not Publicly Known
2019	AMD	Radeon VII	3360 GFLOPS	13440 GFLOPS	1:4	699
2020	Nvidia	RTX 3090	556 GFLOPS	35580 GFLOPS	1:64	1499
	Nvidia	A100 PCIe	9746 GFLOPS	19490 GFLOPS	1:2	9999
	AMD	Radeon RX 6900 XT	1440 GFLOPS	23040 GFLOPS	1:16	999
	AMD	Radeon Instinct MI100	11540 GFLOPS	23070 GFLOPS	1:2	6400
2021	Nvidia	RTX 3080 Ti	533 GFLOPS	34100 GFLOPS	1:64	1199
2021	AMD	Radeon Instinct MI250X	47870 GFLOPS	47870 GFLOPS	1:1	Not Publicly Known
2022	Nvidia	RTX 4090	1290 GFLOPS	82580 GFLOPS	1:64	1599
	Nvidia	H100 PCIe	12040 GFLOPS	24080 GFLOPS	1:2	~30000
	AMD	Radeon RX 7900 XTX	3838 GFLOPS	61420 GFLOPS	1:16	999

Madgwick.xyz

Calculating Pi in Parallel on GPU and CPU, using the Bailey Borwein Plouffe formula.

References