Of late, I’ve been playing around with the computer language Futhark, and would like to share a little of what I have learned.
The aim of the language is to make it easy for standard humans to write high performance code for GPUs — those ultra-fast graphical processing units beloved by groups as different as gamers and researchers in fields such as machine learning. Futhark, which has an active development team based at the University of Copehagen, is the brainchild of Troels Henriksen and is the subject of his 2017 PhD thesis.
Futhark’s contribution is significant because while GPUs, with hundreds of cores, are amazingly fast, writing good code for them is difficult.
Channel the power
To make full use of the power of a GPU, the code must be written so as to be parallelizable, so that most or all the cores can work on the computation. This is tricky business, and at present the best performing GPU code is typically hand-crafted by experts in the field. This where Futhark comes in, providing both a language and an optimizing compiler. You, the programmer, express your computational ideas in Futhark using bulk operations on arrays and high-level constructs such as
reduce. The compiler translates your program text into GPU code that makes full use of the power of the GPU. As Henriksen says,
Because it’s nicer than writing CUDA or OpenCL by hand!
Does it work?
This all sounds very appealing. But does Futhark fulfill its promise? The benchmarks displayed below, taken from a video of Henriksen speaking at BornHack in 2018, tell the story. The bars (blue for nVidia and red for AMD) show the speed-up for compiled Futhark code relative to hand-crafted expert code. Most speedups are greater than 1, and many are much greater,which is good, while a few are less than 1 — not so good. On the whole, the performance is excellent. It will likely get better as the project, which is still young, matures.
Why does it work?
Why does the Futhark compiler produce such good code? The fundamental answer is that the source text on which the compiler operates is written in a functional language like ML, Elm, or Haskell. (Indeed, the Futhark compiler is written in Haskell). This means that the code expresses the overall intention of the computation using high-level operators like map and reduce which are inherently parallelizable. These constructs are things that the compiler “knows” and can “reason about,” thereby giving it the information it needs to produce optimized output. Of course, the devil is in the details. There is large distance between “inherently parallelizable” and “actually parallelized.”
Another reason that the compiler is so good is that is has a sound theoretical foundation, e.g, the second order array combinators (SOACs) around which the language is built:
Futhark, therefore, provides a language common to both the human programmer and the compiler by means of which both can reason about computation. Few things have more practical impact than good theory!
Let’s look a few snippets of Futhark code. Below is an example taken from futhark-lang.org. It computes the average of an array of numbers,
xs. The fragment
(xs: follows 64) tells the compiler that the argument of the
average function is an array of 64-bit floating point numbers. The fragment
reduce (+) 0.0 xs tells us that Futhark is a functional language — the word
reduce is the tipoff. What about the last fragment? The expression
(length xs) computes the length of the array, and the function
f64 converts that length from an integer to a double-precision float. Types are a big deal in Futhark — no automatic conversions. You have to be clear about what you are doing so that the compiler can do its job.
let average (xs: f64) = reduce (+) 0.0 xs / r64 (length xs)
Futhark has a repl in which you can experiment with code that you are writing:
> let average (xs: f64) = reduce (+) 0.0 xs / r64 (length xs)
> average [1, 2, 3, 4, 2, 1]
The primary reference is futhark-lang.org. There are few videos on YouTube which are worth looking at. Google “futhark gpu” to find them. Just “futhark” won’t work because you will get mostly links to sites about Runic inscriptions.
An excellent short paper on the theoretical foundations of Futhark is Futhark: Purely Functional GPU-Programming with Nested Parallelism and In-Place Array Updates by Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea.
The talk by Guy Blelloch on Functional Parallel Computing is— among other things, a beautiful analysis of a cost model for the untyped lambda calculus. See this link for a discussion of the model for Futhark.
I’ve written a few notes on my own experience with Futhark, and plan to add to them as I understand the language better. Here is a little repo with Futhark code for a Monte Carlo computation of π, and this video shows how Elm, Python, and Futhark work together to produce a simulation of heat conduction.
A Remark on the Associative Law (Note added)
In writing GPU code, there is a little pitfall (into which I blundered) that I would like to mention. As a little exercise, I decided to write a Futhark program to compute the sum h(n) = 1 + 1/2 + 1/3 + … + 1/n. Here was my first try:
let harmonic (n:i32): f32 =
reduce (\acc n -> acc + 1.0/n) 0.0
(map f32.i32 (range 1 (n+1) 1))
Maybe not great code, but it seemed like it should do the job. Indeed, it worked fine when I compiled it to C for the CPU on my laptop. But, surprise surprise! It gave wrong answer when compiled to OpenCL for the GPU. What the heck?! The problem, it turns out, was lack of understanding and respect for the associative law. The Futhark compiler rearranges things behind the scene in order to parallelize the code. Implicit in this rearrangement is the assumption that the function which appears as the first argument of
reduce is associative.
Let’s be more precise. The type signature for
reduce in Futhark is
reduce: (a -> a -> a) -> a -> [ ]a -> a
which is deployed in function calls like
reduce f startingValue inputList , One can ask that the function
f be associative, meaning that
(A) f(x,f(y,z)) f(f(x,y),z)
This is not the associative law that we remember from elementary school. To connect with that experience, define an operator
x * y = f(x,y). The the equation (A) can then be rewritten as
x * (y * z) = (x * y) * z. The function I used was
f(x,y) = x + 1/y . It is not associative. Bad!
As Troels Henriksen pointed out to me, the correct way to sum the series for
let harmonic (n:i32): f32 = reduce (+) 0.0
(map (1/) (map f32.i32 (range 1 (n+1) 1)))
In this formulation, the familiar addition function
(+) is associative, so all is good.