OLD VERSION A quick look at Compute Languages

This is the old version. Please see the new version. This is just a quick look at the platforms available and the PC hardware they support put into one place (mostly for my own reference). I use Tensorflow as an example of how this situation plays out for a piece of software that utilises compute. There are a variety of different compute platforms available, targeting varying platforms with differing degrees of development, documentation and adoption.

The two well-known and longest developed platforms are CUDA and OpenCL. CUDA is developed by Nvidia exclusively for their devices, while OpenCL is developed by the Khronos group and is an open, royalty-free standard which can be implemented by any hardware manufacturer.

CUDA

Since its introduction in 2007 CUDA has been continuously developed by Nvidia. There are two different version numbers to pay attention to. Compute Capability refers to the instruction level which can be used on a particular device. A new generation of devices comes a newer version supporting extra operations which don’t exist on earlier devices. The CUDA development toolkit itself has a different version number system. Earlier versions of the toolkit supported all devices regardless of the Compute Capability version. However, since toolkit version 7 support for the oldest devices has been gradually removed when those devices became legacy, this is when they stop receiving updated drivers.

CUDA Support
Codename	Compute Capability	GPU Number(s)	Latest Driver?	Latest CUDA Toolkit?
Tesla	1.0	Early Geforce 8800	No (Legacy Driver 342.01)	No Toolkit version 6.5
	1.1,1.2	Older Geforce 8xxx and 9xxx GT/GTX 2xx and 3xx except below
	1.3	GTX 295, 285, 280, 275, 260
Fermi	2.0	GTX 480, 470, 465, 590, 580, 570	Yes	No Toolkit version 8.0
Fermi	2.1	GTX 460, GTS 450, GT 4xx, GT 6xx		No Toolkit version 8.0
Kepler	3.0	GTX 690, 680, 670, 660, 650, 770, 760, GT 740		Yes
Kepler	3.5	GTX Titan(s), 780 Ti, 780
Maxwell	5.0	GTX 750 Ti, GTX 750
Maxwell	5.2	GTX Titan X, 980 Ti, 980, 970, 960, 950
Pascal	6.1	GTX Titan(s), 1080 Ti, 1080, 1070 (Ti), 1060, 1050, GT 1030
Volta	7.0	GTX Titan V
Turing	7.5	RTX Titan, 2080 Ti, 2080, 2070, 2060

OpenCL and SYCL

Released 2 years after CUDA, OpenCL aims to be an open compute standard for any device which can implement it and is not vendor specific. Versions are simple in OpenCL as development is not as fast paced as in CUDA. Version 1.2 was finalised in 2011 and the latest Version 2.2 is still in development and not supported by many vendors. Due to the way that compute kernels are implemented, unlike CUDA, OpenCL doesn’t need its own compiler. Only C header and library files are needed to start development not a whole toolkit.

SYCL¹ is an abstraction layer for OpenCL intended to be more efficient than OpenCL which is lower level. However, it doesn’t appear to have seen as much adoption as OpenCL itself.

OpenCL Support
OpenCL	Nvidia	AMD	Intel
1.0	None	HD 48xx, HD 46xx	None
1.1	Geforce 8xxx, 9xxx, GTX 2xx, 4xx, 5xx	HD 5xxx	None
1.2	GTX 6xx, 7xx, 9xx, 10xx, Titan(s)	HD 6xxx,7xxx	Core ix-3xxx(Ivy Bridge), ix-4xxx (Haswell)
2.0	None (see notes)	HD 7790, HD 8xxx, R7/9 2xx, R9 3xx, RX 4xx,5xx, Fury(X), Vega 56/64, Instinct	None
2.1	None	None (see notes)	Core ix-5xxx(Broadwell), ix-6xxx (Skylake) ix-7xxx(Kaby Lake), ix-8xxx (Coffee Lake)

Vulkan

The Vulkan API, also developed by the Khronos group, is an open standard which is used as a replacement for Direct X/OpenGL in 3D game engines. But it can also be used for compute tasks⁹, the disadvantage is that it’s a very low-level API. An advantage would be that, due to being used in game engines, it is equally well supposed by Nvidia and AMD.

A complex situation with AMD

Historically AMD promoted OpenCL for compute applications on its GPUs. More recently this has evolved. AMD has released an entirely new open source driver and software stack called ROCm. This includes two high-level languages, HC and HIP. HC (Heterogeneous Compute)¹⁰ is a C++ API similar to C++ AMP, it targets only AMD devices. HIP (Heterogeneous Computing Interface for Portability)¹¹ is a API which is intentionally very similar to CUDA. It is intended to simplify the process of converting CUDA programs to run on AMD devices. AMD supply a tool to “hipify” existing CUDA code, converting it to HIP which can then be compiled to and run on AMD devices. AMD’s compiler also allows Nvidia devices to be targeted by HIP code. This is possible because internally HIP code can operate as an interfacing layer. When targeting AMD devices, the compiler (called hcc) compiles HIP directly for AMD devices. When targeting Nvidia devices, hcc converts the HIP code back to CUDA and then passes it to Nvidia’s compiler (nvcc) which then complies it as usual. The diagram below is a very simplified representation of the interactions between CUDA, HIP, HCC and NVCC.

ROCm Driver Support
Name	Codename	Example Cards	HC Support	OpenCL Support	Notes
gfx701	Hawaii	R9 290(X), R9 390(X)	Yes	Yes	Experimental Support, no CPU or PCIe requirements
gfx802	Tonga	R9 285, R9 385, R9 380(X)	No		Supported by driver but NOT by the HCC Compiler (see note)
gfx803	Fiji	R9 Fury(X), R9 Nano	Yes		Requires CPU with PCIe Gen3 + PCIe Atomics
gfx803	Polaris 10/11	RX 480–60, RX 580,570,560(some)			Requires CPU with PCIe Gen3 + PCIe Atomics
gfx901	Vega 10	Vega FE, RX Vega 64, RX Vega 56			Requires CPU with PCIe Gen3 + PCIe Atomics (can be configured for PCIe Gen 2 w/out Atomics)
gfx906	Vega 20	AMD Radeon Instinct MI60, MI50
gfx1010	Navi 10	TBD AMD GPUs	Unknown TBD

Tensorflow as an example

Tensorflow⁴ is a well-known platform for machine learning and deep learning. It works as a library for python and aims to be simpler to use, by abstracting low level implementations away. It supports CPUs and GPUs as its back end. The downloadable binaries (installed through python pip) support standard CPUs and Nvidia GPUs which have at least Compute Capability 3.5, older versions supported CC 3. Nvidia support requires the CUDA toolkit and the cuDNN library to be installed.

There is no out of the box support for AMD graphics cards. AMD have created a separate port⁵ of Tensorflow which supports some of their latest GPUs. This port requires the ROCm runtime (explained above) to be installed, and a supported GPU to function. Benchmarks show performance to be similar to Nvidia cards.⁶ The main disadvantage is that fewer cards are supported and the version of Tensorflow is a version behind the latest release.

There is also an implementation of Tensorflow which uses OpenCL and SYCL.⁷ It requires OpenCL 1.2 and an extension called cl_khr_spir. This means that in theory it’s supported by most AMD cards. However, from what I can tell the performance is inferior⁸ to AMDs port which has an optimised implementation.

Tensorflow and AMD ROCm Tensorflow GPUs
Vendor	Tensorflow Supported GPUs
Nvidia	GTX 780 Ti, 780, 750 (Ti), 9xx, 10xx, 20xx, Titan(s)
AMD	R9 Fury(X), R9 Nano, RX 480–60 Vega FE, RX Vega 64, Vega 56, RX 580–60, Instinct

Madgwick.xyz