Madgwick.xyz

August 31, 2018 (2018-08-31)
Updated: December 11, 2022 (2022-12-11)

Calculating Pi in Parallel on GPU and CPU, using the Bailey Borwein Plouffe formula.

Remember the benchmarking program Super Pi? It is a prime example of a benchmark which fell out of use because it was single threaded in an increasingly threaded era. All modern CPUs released in the last few years have at least 2 cores. The latest enthusiast chips have 32 cores and 64 threads (AMD Threadripper 2990WX). While single thread performance does improve with each generation, to make any real use of the most powerful processors, parallel programming must be used.

GPUs on the other hand have many small compute processors which all put together are often more powerful than a CPU. But on a GPU each thread is slow on its own so they are totally useless for any task which is unable to scale to use all the available performance.

The video below is a demonstration of my implementation of the Bailey–Borwein–Plouffe formula1.

This is able to calculate any specified hexadecimal place of Pi, within certain limitations related to computational precision. It is an excellent candidate for parallel implementation and the two programs I wrote, one for CPU, one for GPU, are able to scale to almost any number of threads. For calculating the 10,000,000th hexadecimal place, 40 million threads could be used in theory, although based on my experience overheads would mean that 40,000 or so might be better.

My GPU implementation uses 700 threads and the CPU implementation uses only 8 yet it is not far behind. This is because the GPU threads are much slower. GPUs perform at their best when working on shorter data types. The Double Precision (fp64) performance of my R9 290 is rated at 600 GFLOPS3 yet the Single Precision (fp32,float in C) performance is said to be 4.8 TFLOPS. A ratio of 8. However, despite being much newer the vast majority of current GPUs have ratios of 16(AMD) or 32(Nvidia) crippling double precison performance. Only workstation class cards have meaningful double precison performance.

For reasons better explained by D. Bailey2 in his excellent paper on the formula, it is necessary to use fp64 to estimate hexadecimal places up to 10 million. With 128-bit precision places up to 100 trillion can be estimated. However, CPUs can only handle 80-bit (aka long double) natively in hardware and GPUs only fp64 in hardware. Libraries can be used to work with 128-bit precision but as these are not natively implemented in hardware they are as much as 100 time slower than 64-bit.

In time I will add more to this page about the formula itself and my implementation. Including download links once I have implemented the GPU algorithm in CUDA and compiled win10 versions.

Double Precision GPU Performance
Year Brand Name FP64 FP32 Ratio Launch Price (USD)
2012 Nvidia GTX 680 135 GFLOPS 3250 GFLOPS 1:24 499
Tesla K20X 1312 GFLOPS 3935 GFLOPS 1:3 7699
AMD Radeon 7970 Ghz 1075 GFLOPS 4301 GFLOPS 1:4 499
2013 Nvidia GTX 780 Ti 222 GFLOPS 5345 GFLOPS 1:24 699
Quadro K6000 1732 GFLOPS 5196 GFLOPS 1:3 5265
AMD Radeon R9 290X 704 GFLOPS 5632 GFLOPS 1:8 549
Radeon 7990 2048 GFLOPS 8192 GFLOPS 1:4 999
2014 Nvidia GTX 980 155 GFLOPS 4981 GFLOPS 1:32 549
GTX Titan Black 1882 GFLOPS 5645 GFLOPS 1:3 999
AMD Firepro W9100 2619 GFLOPS 5238 GFLOPS 1:2 3999
2015 Nvidia GTX 980 Ti 189 GFLOPS 6060 GFLOPS 1:32 649
AMD Radeon R9 Fury X 538 GFLOPS 8602 GFLOPS 1:16 649
Firepro S9170 2619 GFLOPS 5238 GFLOPS 1:2 3999
2016 Nvidia GTX 1080 277 GFLOPS 8873 GFLOPS 1:32 599
Tesla P100 4763 GFLOPS 9526 GFLOPS 1:2 4599
AMD Radeon Pro Duo 1024 GFLOPS 16384 GFLOPS 1:16 1499
2017 Nvidia GTX 1080 Ti 354 GFLOPS 11340 GFLOPS 1:32 699
GTX Titan V 7450 GFLOPS 14899 GFLOPS 1:2 2999
AMD Radeon RX Vega 64 786 GFLOPS 12583 GFLOPS 1:16 499
2018 Nvidia RTX 2080 Ti 420 GFLOPS 13448 GFLOPS 1:32 999
Quadro GV100 8335 GFLOPS 16671 GFLOPS 1:2 8999
AMD Radeon Instinct MI60 7373 GFLOPS 14746 GFLOPS 1:2 Not Publicly Known
2019 AMD Radeon VII 3360 GFLOPS 13440 GFLOPS 1:4 699
2020 Nvidia RTX 3090 556 GFLOPS 35580 GFLOPS 1:64 1499
A100 PCIe 9746 GFLOPS 19490 GFLOPS 1:2 9999
AMD Radeon RX 6900 XT 1440 GFLOPS 23040 GFLOPS 1:16 999
Radeon Instinct MI100 11540 GFLOPS 23070 GFLOPS 1:2 6400
2021 Nvidia RTX 3080 Ti 533 GFLOPS 34100 GFLOPS 1:64 1199
AMD Radeon Instinct MI250X 47870 GFLOPS 47870 GFLOPS 1:1 Not Publicly Known
2022 Nvidia RTX 4090 1290 GFLOPS 82580 GFLOPS 1:64 1599
H100 PCIe 12040 GFLOPS 24080 GFLOPS 1:2 ~30000
AMD Radeon RX 7900 XTX 3838 GFLOPS 61420 GFLOPS 1:16 999
References