# performance – Reasonably efficient pure-functional matrix product in Haskell?

## performance – Reasonably efficient pure-functional matrix product in Haskell?

As efficient as, say, Java. For concreteness, lets assume Im talking about a simple triple loop, single precision, contiguous column-major layout (float[], not float[][]) and matrices of size 1000×1000, and a single-core CPU. (If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)

So something like

``````public class MatrixProd {
static float[] matProd(float[] a, int ra, int ca, float[] b, int rb, int cb) {
if (ca != rb) {
throw new IllegalArgumentException(Matrices not fitting);
}
float[] c = new float[ra*cb];
for(int i = 0; i < ra; ++i) {
for(int j = 0; j < cb; ++j) {
float sum = 0;
for(int k = 0; k < ca; ++k) {
sum += a[i*ca+k]*b[k*cb+j];
}
c[i*cb+j] = sum;
}
}
return c;
}

static float[] mkMat(int rs, int cs, float x, float d) {
float[] arr = new float[rs*cs];
for(int i = 0; i < rs; ++i) {
for(int j = 0; j < cs; ++j) {
arr[i*cs+j] = x;
x += d;
}
}
return arr;
}

public static void main(String[] args) {
int sz = 100;
float strt = -32, del = 0.0625f;
if (args.length > 0) {
sz = Integer.parseInt(args);
}
if (args.length > 1) {
strt = Float.parseFloat(args);
}
if (args.length > 2) {
del = Float.parseFloat(args);
}
float[] a = mkMat(sz,sz,strt,del);
float[] b = mkMat(sz,sz,strt-16,del);
System.out.println(a[sz*sz-1]);
System.out.println(b[sz*sz-1]);
long t0 = System.currentTimeMillis();
float[] c = matProd(a,sz,sz,b,sz,sz);
System.out.println(c[sz*sz-1]);
long t1 = System.currentTimeMillis();
double dur = (t1-t0)*1e-3;
System.out.println(dur);
}
}
``````

I suppose? (I hadnt read the specs properly before coding, so the layout is row-major, but since the access pattern is the same, that doesnt make a difference as mixing layouts would, so Ill assume thats okay.)

I havent spent any time on thinking about a clever algorithm or low-level optimisation tricks (I wouldnt achieve much in Java with those anyway). I just wrote the simple loop, because

I dont want this to sound like a challenge, but note that Java can satisfy all of the above easily

And thats what Java gives easily, so Ill take that.

(If you are getting 0.5-2 floating point operations per cycle, you are probably in the ballpark)

Nowhere near, Im afraid, neither in Java nor in Haskell. Too many cache misses to reach that throughput with the simple triple loop.

Doing the same in Haskell, again no thinking about being clever, a plain straightforward triple loop:

``````{-# LANGUAGE BangPatterns #-}
module MatProd where

import Data.Array.ST
import Data.Array.Unboxed

matProd :: UArray Int Float -> Int -> Int -> UArray Int Float -> Int -> Int -> UArray Int Float
matProd a ra ca b rb cb =
let (al,ah)     = bounds a
(bl,bh)     = bounds b
{-# INLINE getA #-}
getA i j    = a!(i*ca + j)
{-# INLINE getB #-}
getB i j    = b!(i*cb + j)
{-# INLINE idx #-}
idx i j     = i*cb + j
in if al /= 0 || ah+1 /= ra*ca || bl /= 0 || bh+1 /= rb*cb || ca /= rb
then error \$ Matrices not fitting:  ++ show (ra,ca,al,ah,rb,cb,bl,bh)
else runSTUArray \$ do
arr <- newArray (0,ra*cb-1) 0
let outer i j
| ra <= i   = return arr
| cb <= j   = outer (i+1) 0
| otherwise = do
!x <- inner i j 0 0
writeArray arr (idx i j) x
outer i (j+1)
inner i j k !y
| ca <= k   = return y
| otherwise = inner i j (k+1) (y + getA i k * getB k j)
outer 0 0

mkMat :: Int -> Int -> Float -> Float -> UArray Int Float
mkMat rs cs x d = runSTUArray \$ do
let !r = rs - 1
!c = cs - 1
{-# INLINE idx #-}
idx i j = cs*i + j
arr <- newArray (0,rs*cs-1) 0
let outer i j y
| r < i     = return arr
| c < j     = outer (i+1) 0 y
| otherwise = do
writeArray arr (idx i j) y
outer i (j+1) (y + d)
outer 0 0 x
``````

and the calling module

``````module Main (main) where

import System.Environment (getArgs)
import Data.Array.Unboxed

import System.CPUTime
import Text.Printf

import MatProd

main :: IO ()
main = do
args <- getArgs
let (sz, strt, del) = case args of
(a:b:_)   -> (read a, read b, 0.0625)
(a:_)     -> (read a, -32, 0.0625)
_         -> (100, -32, 0.0625)
a = mkMat sz sz strt del
b = mkMat sz sz (strt - 16) del
print (a!(sz*sz-1))
print (b!(sz*sz-1))
t0 <- getCPUTime
let c = matProd a sz sz b sz sz
print \$ c!(sz*sz-1)
t1 <- getCPUTime
printf %.6fn (fromInteger (t1-t0)*1e-12 :: Double)
``````

So were doing almost exactly the same things in both languages. Compile the Haskell with `-O2`, the Java with javac

``````\$ java MatrixProd 1000 -13.7 0.013
12915.623
12899.999
8.3592897E10
8.193
\$ ./vmmult 1000 -13.7 0.013
12915.623
12899.999
8.35929e10
8.558699
``````

And the resulting times are quite close.

And if we compile the Java code to native, with `gcj -O3 -Wall -Wextra --main=MatrixProd -fno-bounds-check -fno-store-check -o jmatProd MatrixProd.java`,

``````\$ ./jmatProd 1000 -13.7 0.013
12915.623
12899.999
8.3592896512E10
8.215
``````

theres still no big difference.

As a special bonus, the same algorithm in C (gcc -O3):

``````\$ ./cmatProd 1000 -13.7 0.013
12915.623047
12899.999023
8.35929e+10
8.079759
``````

So this reveals no fundamental difference between straightforward Java and straightforward Haskell when it comes to computationally intensive tasks using floating point numbers (when dealing with integer arithmetic on medium to large numbers, the use of GMP by GHC makes Haskell outperform Javas BigInteger by a huge margin for many tasks, but that is of course a library issue, not a language one), and both are close to C with this algorithm.

In all fairness, though, that is because the access pattern causes a cache-miss every other nanosecond, so in all three languages this computation is memory-bound.

If we improve the access pattern by multiplying a row-major matrix with a column-major matrix, all become faster, the gcc-compiled C finishes it 1.18s, java takes 1.23s and the ghc-compiled Haskell takes around 5.8s, which can be reduced to 3 seconds by using the llvm backend.

Here, the range-check by the array library really hurts. Using the unchecked array access (as one should, after checking for bugs, since the checks are already done in the code controlling the loops), GHCs native backend finishes in 2.4s, going via the llvm backend lets the computation finish in 1.55s, which is decent, although significantly slower than both C and Java. Using the primitives from `GHC.Prim` instead of the array library, the llvm backend produces code that runs in 1.16s (again, without bounds-checking on each access, but that only valid indices are produced during the computation can in this case easily be proved before, so here, no memory-safety is sacrificed¹; checking each access brings the time up to 1.96s, still significantly better than the bounds checking of the array library).

Bottom line: GHC needs (much) faster branching for the bounds-checking, and theres room for improvement in the optimiser, but in principle, Haskells approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity, were just not yet there. For the time being, one has to decide how much of which point one is willing to sacrifice.

¹ Yes, thats a special case, in general omitting the bounds-check does sacrifice memory-safety, or it is at least harder to prove that it doesnt.

There are two angles to attack this problem on.

1. Research, along these lines, is ongoing. Now, there are plenty of Haskell programmers who are smarter than me; a fact I am constantly reminded of and humbled by. One of them may come by and correct me, but I dont know of any simple way to compose safe Haskell primitives into a top-of-the-line matrix multiplication routine. Those papers that you talk about sound like a good start.

However, Im not a computer science researcher. I wonder if its possible to keep simple things simple in Haskell.

If you cite those papers, maybe we could help decipher them.

2. Software engineering, along these lines, is well-understood, straightforward, and even easy. A savvy Haskell coder would use a thin wrapper around BLAS, or look for such a wrapper in Hackage.

Deciphering cutting-edge research is an ongoing process that shifts knowledge from the researchers to the engineers. It was a computer science researcher, C.A.R. Hoare, who first discovered quicksort and published a paper about it. Today, it is a rare computer science graduate who cant personally implement quicksort from memory (at least, those that graduated recently).

## Bit of history

Almost this exact question has been asked in history a few times before.

1. Is it possible to write matrix arithmetic in Fortran that is as fast as assembly?

2. Is it possible to write matrix arithmetic in C that is as fast as Fortran?

3. Is it possible to write matrix arithmetic in Java that is as fast as C?

4. Is it possible to write matrix arithmetic in Haskell that is as fast as Java?

So far, the answer has always been, not yet, followed by close enough. The advances that make this possible come from improvements in writing code, improvements to compilers, and improvements in the programming language itself.

As a specific example, C was not able to surpass Fortran in many real-world applications until C99 compilers became widespread in the past decade. In Fortran, different arrays are assumed to have distinct storage from each other, whereas in C this is not generally the case. Fortran compilers were therefore permitted to make optimizations that C compilers could not. Well, not until C99 came out and you could add the `restrict` qualifier to your code.

The Fortran compilers waited. Eventually the processors became complex enough that good assembly writing became more difficult, and the compilers became sophisticated enough that the Fortran was fast.

Then C programmers waited until the 2000s for the ability to write code that matched Fortran. Until that point, they used libraries written in Fortran or assembler (or both), or put up with the reduced speed.

The Java programers, likewise, had to wait for JIT compilers, and had to wait for specific optimizations to appear. JIT compilers were originally an esoteric research concept until they became a part of daily life. Bounds checking optimization was also necessary in order to avoid a test and branch for every array access.

## Back to Haskell

So, it is clear the Haskell programmers are waiting, just like the Java, C, and Fortran programmers before them. What are we waiting for?

• Maybe were just waiting for someone to write the code, and show us how its done.

• Maybe were waiting for the compilers to get better.

• Maybe were waiting for an update to the Haskell language itself.

And maybe were waiting for some combination of the above.

Purity and monads get conflated a lot in Haskell. The reason for this is because in Haskell, impure functions always use the `IO` monad. For example, the `State` monad is 100% pure. So when you say, pure and type signature does not use the `State` monad, those are actually completely independent and separate requirements.

However, you can also use the `IO` monad in the implementation of pure functions, and in fact, its quite easy:

``````addSix :: Int -> Int
addSix n = unsafePerformIO \$ return (n + 6)
``````

Okay, yes, thats a stupid function, but it is pure. Its even obviously pure. The test for purity is twofold:

1. Does it give the same result for the same inputs? Yes.

2. Does it produce any semantically significant side effects? No.

The reason we like purity is because pure functions are easier to compose and manipulate than impure functions are. How theyre implemented doesnt matter as much. I dont know if youre aware of this, but `Integer` and `ByteString` are both basically wrappers around impure C functions, even though the interface is pure. (Theres work on a new implementation of `Integer`, I dont know how far it is.)

The question is whether Haskells approach (purity encoded in the type system) is compatible with efficiency, memory-safety and simplicity.

The answer to that part is yes, since we can take simple functions from BLAS and put them in a pure, type-safe wrapper. The wrappers type encodes the safety of the function, even though the Haskell compiler is unable to prove that the functions implementation is pure. Our use of `unsafePerformIO` in its implementation is both an acknowledgement that we have proven the purity of the function, and it is also a concession that we couldnt figure out a way to express that proof in Haskells type system.

But the answer is also not yet, since I dont know how to implement the function entirely in Haskell as such.

Research in this area is ongoing. People are looking at proof systems like Coq and new languages like Agda, as well as developments in GHC itself. In order to see what kind of type system wed need to prove that high-performance BLAS routines can be used safely. These tools can also be used with other languages like Java. For example, you could write a proof in Coq that your Java implementation is pure.

I apologize for the yes and no answer, but no other answer would recognize both the contributions of engineers (who care about yes) and researchers (who care about not yet).

P.S. Please cite the papers.

#### performance – Reasonably efficient pure-functional matrix product in Haskell?

Like Java, Haskell is not the best language for writing numerical code.

Haskells numeric-heavy codegeneration is… average. It hasnt had the years of research behind it that the likes of Intel and GCC have.

What Haskell gives you instead, is a way to cleanly interface your fast code with the rest of your application. Remember that 3% of code is responsible for 97% of your applications running time. 1

With Haskell, you have a way to call these highly optimized functions in a way that interfaces extremely nicely with the rest of your code: via the very nice C Foreign Function Interface. In fact, if you so desired, you could write your numeric code in the assembly language of your architecture and get even more performance! Dipping into C for performance-heavy parts of your application isnt a bug – its a feature.

But I digress.

By having these highly optimized functions isolated, and with a similar interface to the rest of your Haskell code, you could perform high level optimizations with Haskells very powerful rewrite rules, which allow you to write rules such as `reverse . reverse == id` which automagically reduce complex expressions at compile time 2. This leads to extremely fast, purely functional, and easy-to-use libraries like Data.Text 3 and Data.Vector .

By combining high and low levels of optimization, we end up with a much more optimized implementation, with each half (C/asm, and Haskell) relatively easy to read. The low level optimization is done in its native tongue (C or assembly), the high level optimization gets a special DSL (Haskell rewrite rules), and the rest of the code is oblivious to it completely.

In conclusion, yes, Haskell can be faster than Java. But it cheats by going through C for the raw FLOPS. This is much harder to do in Java (as well as having a much higher overhead for Javas FFI), so its avoided. In Haskell, its natural. If your application spends an exorbitant amount of time doing numeric calculations, then maybe instead of looking at Haskell or Java, you look at Fortran for your needs. If your application spends a large portion of its time in a tiny part of performance-sensitive code, then the Haskell FFI is your best bet. If your application doesnt spend any time in numeric code… then use whatever you like. =)

Haskell (nor Java, for that matter) isnt Fortran.

1 These numbers were made up, but you get my point.

Now that thats out of the way, to answer your actual question:

No, its not currently smart to write your matrix multiplications in Haskell. At the moment, REPA is the canonical way to do this . The implementation partially breaks memory safety, (they use unsafeSlice), but the broken memory safety is isolated to that function, actually very safe (but not easily verified by the compiler), and easy to remove if things go wrong (replace unsafeSlice with slice).

But this is Haskell! Very rarely are the performance characteristics of a function to be taken in isolation. That can be a bad thing (in the case of space leaks), or a very, very good thing.

Although the matrix multiplication algorithm used is naive, it will perform worse in a raw benchmark. But rarely does our code look like benchmarks.

What if you were a scientist with millions of data points and want to multiply huge matrices? 

For those people, we have mmultP . This performs matrix multiplication, but is data-parallel, and subject to REPAs nested data parallelism. Also note that the code is essentially unchanged from the sequential version.

For those people that dont multiply huge matrices, and instead multiply lots of little matrices, there tends to be other code interacting with said matrices. Possibly cutting it up into column vectors and finding their dot products, maybe finding its eigenvalues, maybe something else entirely. Unlike C, Haskell knows that although you like to solve problems in isolation, the most efficient solution usually isnt found there.

Like ByteString, Text, and Vector, REPA arrays are subject to fusion. 2 You should actually read 2 by the way – its a very well written paper. This, combined with aggressive inlining of relevant code and REPAs highly parallel nature allows us to express these high-level mathematical concepts with very advanced high-level optimizations behind the scenes.

Although a method of writing an efficient matrix multiplication in pure functional languages isnt currently know, we can come somewhat close (no automatic vectorization, a few excessive dereferences to get to the actual data, etc.), but nothing near what IFORT or GCC can do. But programs dont exist on an island, and making the island as a whole perform well is much, much easier in Haskell than Java.

 Acutally, the best way to do this is by using the GPU. There are a few GPU DSLs available for Haskell which make this possible to do natively. Theyre really neat!