A quick look at Compute Languages

This is just a quick look at the platforms available and the PC hardware they support put into one place (mostly for my own reference). I use Tensorflow as an example of how this situation plays out for a piece of software that utilises compute. There are a variety of different compute platforms available, targeting varying platforms with differing degrees of development, documentation and adoption.

The two well-known and longest developed platforms are CUDA and OpenCL. CUDA is developed by Nvidia exclusively for their devices, while OpenCL was developed by the Khronos group. OpenCL is an open and royalty-free standard, which could be implemented by any hardware manufacturer. OpenCL still appears to be maintained but the final versions of the standard look unlikely to ever be implementated, the Khronos group now seems to be focusing its efforts on integrating all it's platforms into SPIR-V.

CUDA

Since its introduction in 2007 CUDA has been continuously developed by Nvidia. There are two different version numbers to pay attention to. Compute Capability refers to the instruction level which can be used on a particular device. A new generation of devices comes a newer version supporting extra operations which don’t exist on earlier devices. The CUDA development toolkit itself has a different version number system. Earlier versions of the toolkit supported all devices regardless of the Compute Capability version. However, since toolkit version 7 support for the oldest devices has been gradually removed when those devices became legacy, this is when they stop receiving updated drivers.

CUDA Support
Codename	Compute Capability	GPU Names	Latest Driver?	Latest CUDA Toolkit?
Tesla	1.0	Early Geforce 8800	No (Legacy Driver 342.01)	No Toolkit version 6.5
	1.1,1.2	Older Geforce 8xxx and 9xxx GT/GTX 2xx and 3xx except those listed below
	1.3	GTX 295, 285, 280, 275, 260
Fermi	2.0	GTX 480, 470, 465, 590, 580, 570	No (Legacy Driver 391.35)	No Toolkit version 8.0
Fermi	2.1	GTX 460, GTS 450, GT 4xx, GT 6xx	No (Legacy Driver 391.35)	No Toolkit version 8.0
Kepler	3.0	GTX 690, 680, 670, 660, 650, 770, 760, GT 740	Yes	Yes
Kepler	3.5	GTX Titan(s), 780 Ti, 780
Maxwell	5.0	GTX 750 Ti, GTX 750
Maxwell	5.2	GTX Titan X, 980 Ti, 980, 970, 960, 950
Pascal	6.1	GTX Titan(s), 1080 Ti, 1080, 1070 (Ti), 1060, 1050, GT 1030
Volta	7.0	GTX Titan V
Turing	7.5	RTX Titan, 2080 Ti, 2080, 2070, 2060, 1660 Ti, 1660

OpenCL

Released 2 years after CUDA, OpenCL aimed to be an open compute standard for any device which could implement it and was not vendor specific. Versions were simpler in OpenCL as development was slower than CUDA. The widely used Version 1.2 was finalised in 2011 and the last Version was 2.2. Due to the way that compute kernels are implemented, unlike CUDA, OpenCL doesn’t need its own compiler. Only C header and library files are needed to start development not a whole toolkit. The OpenCL kernel code is passed to the device driver where it is compiled, this approach allows the same program to work with more than one vendor without changes being needed or architecture specific builds.

SYCL was originally intended as an abstraction layer for OpenCL¹, however as explained below it has moved on since then. The OpenCL element of SYCl is more a backwards compatibilty feature and not likely to see adoption (unless GPU Vendors add support for OpenCL 2.2).

The current OpenCL looks to be superseded by a newer and more flexible OpenCL-Next¹⁴. This will certainly work using the SPIR-V ecosystem described below.

OpenCL Support
OpenCL	Nvidia	AMD	Intel
1.0	None	HD 48xx, HD 46xx	None
1.1	Geforce 8xxx, 9xxx, GTX 2xx, 4xx, 5xx	HD 5xxx	None
1.2	GTX 6xx, 7xx, 9xx, 10xx, Titan(s)	HD 6xxx,7xxx	Core ix-3xxx(Ivy Bridge), ix-4xxx (Haswell)
2.0	None (see notes)	HD 7790, HD 8xxx, R7/9 2xx, R9 3xx, RX Series and all later AMD GPUs	None
2.1	None	None (see notes)	Core ix-5xxx(Broadwell) and all later Intel CPU Platforms

SPIR-V (Vulkan, SYCL)

The Khronos group has developed a new ecosystem, at the nexus of this is SPIR-V¹⁷ (Standard Portable Intermediate Representation). This IR acts as an intermediary between various different languages (shaders for games, compute etc.) and drivers runtimes (OpenCL 2.2, Vulkan, OpenGL 4.6). Other than game orientated shaders, this is all very much a work in progress. Khronos makes the standard, not the necessary compilers. It appears that in the long term (if implemented) it will be possible to take say OpenCL code compile it to SPIR-V and then run this on a supported driver runtime. This would bring a huge advantage of flexibility, because this would decouple the various programming languages from the GPU compiler. So long as a language can be compiled to SPIR-V it can be run GPUs (also CPUs, ASICs). Currently SPIR-V is working well for shaders, but compute support and compiler support for this is still being worked on.

Vulkan

The Vulkan API, also developed by the Khronos group, is an open standard which is used as a replacement for Direct X/OpenGL in 3D game engines. It can also be used for compute tasks⁹, the disadvantage is that it’s a very low-level API. An advantage would be that, due to being used in game engines, it is equally well supposed by Nvidia and AMD. Running SPIR-V compute code on Vulkan will allow a much broader range of languages to be supported on GPUs (see above), this is not possible yet but it should be in future.

SYCL

Currently Intel are developing support for SYCL in the LLVM/Clang compiler¹⁵. There is also a separate project (hipSYCL)¹⁶ which allows SYCL code to be compiled straight to GPU machine code for AMD (via ROCm) and Nvidia GPUs, this doesn’t use SPIR-V and so will potentially be superseded in future once SPIR-V is fully supported for compute.

A complex situation with AMD

Historically AMD had promoted OpenCL for compute applications on its GPUs. This evolved a few years ago when AMD released an entirely new open source Linux driver and software stack called ROCm. This originally included two high-level languages, HC and HIP. For the first time this allowed some Popular GPU accelerated libaries to be run on AMD Hardware. HC (Heterogeneous Compute)¹⁰ (now deprecated) was a C++ API similar to C++ AMP, it targeted only AMD devices. Although it was easy to use and had some innovative features, it saw little use and had poor documentation which is probably why it was dropped.

HIP (Heterogeneous Computing Interface for Portability)¹¹ is a API which is intentionally very similar to CUDA. It is intended to simplify the process of converting CUDA programs to run on AMD devices. AMD supply a tool to “hipify” existing CUDA code, converting it to HIP which can then be compiled to and run on AMD devices. AMD’s compiler also allows Nvidia devices to be targeted by HIP code. This is possible because internally HIP code can operate as an interfacing layer. When targeting AMD devices, the compiler (called hcc) compiles HIP directly for AMD devices. When targeting Nvidia devices, hcc converts the HIP code back to CUDA and then passes it to Nvidia’s compiler (nvcc) which then complies it as usual. The diagram below is a very simplified representation of the interactions between CUDA, HIP, HCC and NVCC.

ROCm Driver Support
Internal Code	Codename	Example Cards	ROCm Support	Notes
gfx701	Hawaii	R9 290(X), R9 390(X)	Yes (before 2.0) No (after 2.0)	Experimental Support in at least ROCm versions < 2.0
gfx802	Tonga	R9 285, R9 385, R9 380(X)	No	Support was never completed (see footnote)
gfx803	Fiji	R9 Fury(X), R9 Nano	Yes	Requires CPU with PCIe Gen3 + PCIe Atomics Some low end cards and mobile chips are unsupported.
	Polaris 10/11	RX 480-70/580-570, RX 560(some)
	Polaris 20	RX 590
gfx901	Vega 10	Vega FE, RX Vega 64/56, MI 25		All chips Fully Supported
gfx906	Vega 20	Radeon VII, Instinct MI 60, MI 50		All chips Fully Supported
gfx908	Vega 20/30?	Instinct MI 120, MI 100		AKA Arcturus, Release TBD
gfx1010	Navi 10	RX 5700 (XT)	No	Unknown if will ROCm support will be added in future

Tensorflow as an example

Tensorflow⁴ is a well-known platform for machine learning and deep learning. It works as a library for python and aims to be simpler to use, by abstracting low level implementations away. It supports CPUs and GPUs as its back end. The downloadable binaries (installed through python pip) support standard CPUs and Nvidia GPUs which have at least Compute Capability 3.5, older versions supported CC 3. Nvidia support requires the CUDA toolkit and the cuDNN library to be installed.

AMD have created a port⁵ of Tensorflow which supports their latest GPUs. This port requires the ROCm runtime (explained above) to be installed, and a supported GPU to function. Benchmarks show performance to be similar to Nvidia cards.⁶ A disadvantage is that fewer cards are supported and the version of Tensorflow is behind the latest release.

Before AMDs ROCm port there was a third party implementation of Tensorflow which used OpenCL and SYCL.⁷ This required OpenCL 1.2 and an extension called cl_khr_spir. In theory it was supported by most AMD cards. However, AMD dropped support for SPIR in modern drivers¹³ and so this approch no longer functions. In any case from what I could tell the performance was inferior⁸ compared to AMDs ROCm implementation which is being continuously optimised and developed.

Tensorflow and AMD ROCm Tensorflow GPUs
Vendor	Tensorflow Supported GPUs
Nvidia	GTX 780 Ti, 780, 750 (Ti), GTX 900 series and all later Nvidia GPUs
AMD	All GPUs that are fully supported by the ROCm driver (see above table)

Madgwick.xyz