from Guide to Hacking on Sep 24, 2023

How to use Apple GPUs from Python

Most code samples for using Apple GPUs require Xcode, Objective C, or a combination of both. This post distills the process down to a single file, to demonstrate how you can leverage Apple GPUs from the comfort of Python.

Say you want to write a custom matrix multiply. This matrix multiply could leverage your custom sparse matrix format, operate on a custom data type, or fuse other operations in. Whatever the optimization, you now have a reason to write a custom kernel.

On Nvidia GPUs, it's rather "straightforward"¹ to write custom kernels — namely, use CUDA. The toolset for developing and optimizing CUDA kernels is well known, and there are a large number of projects that build and integrate kernels of their own.

However, on Apple GPUs, it's much less straightforward to do this. There are of course plenty of tutorials online, but these demos usually require Xcode, Objective C, or a combination of both. What if I'm working in Python? Don't want to depend on a GUI like Xcode to run a simple script? In short, what is the simplest "Hello world" to interface with custom kernels on Apple GPUs?

Getting setup

Create a new directory to house your project. I will create one on my desktop.

mkdir ~/Desktop/metal
cd ~/Desktop/metal

In this new directory, create a new virtual environment.

python -m venv env
source env/bin/activate

We now need to install pyobjc, a Python binding for many of the OSX's built-in utilities, including an API that we'll use to run custom kernels on Apple GPUs.

pip install pyobjc

Let's now dive right into the code.

Step 0: Write a Metal kernel

We won't dive too deeply into kernel writing. However, we'll scratch the surface here to give you an idea. At its core, kernels are written using Metal Shader Language² (MSL), which is very similar to C++.

Here's a kernel that takes in an array of integers, and adds 2 to every item in that array, in-place. Create a file called add2.metal and write the following.

metal/add2.metal

/* import metal library, like `from metal import *` */
#include <metal_stdlib>
using namespace metal;

/**

Define a function called `add2_kernel` that doesn't return
a value. Instead, we modify the input array in-place.

:param in uint8_t*: An array of integers, as input.
:param id uint: The index of the current thread. We use this
    to index into the input array. In other words, we
    parallelize by assigning each thread to a different
    element of the input array.

*/
kernel void add2_kernel(device uint8_t *in  [[ buffer(0) ]],
                        uint id [[ thread_position_in_grid ]]) {
    in[id] = in[id] + 2;    /* add 2 in-place */
}

Here's another kernel that takes in a single array of floats and computes the log of every element. This time, the operation does not happen in-place. Instead, results are written to an output array. Create a file called log.metal and write the following.

metal/log.metal

#include <metal_stdlib>
using namespace metal;

/**

Define a function called `log_kernel` that doesn't return
a value. Instead, we write results to an output array.

:param in float*: An array of floats, as input.
:param out float*: An array of floats, which will contain
    our output.
:param id uint: The index of the current thread. We use this
    to index into the input array. In other words, we
    parallelize by assigning each thread to a different
    element of the input array.

*/
kernel void log_kernel(device float *in  [[ buffer(0) ]],
                       device float *out [[ buffer(1) ]],
                       uint id [[ thread_position_in_grid ]]
) {
    out[id] = log(in[id]);  /* log each element *not in-place */
}

For this step, we'll import and use this log kernel from the comfort of Python. To do so, start by downloading a file with utilities you'll need.

wget https://github.com/alvinwan/guide-to-hacking/tree/main/metal/v0-hello-world/utils.py -O utils.py

Now, create a short and simple script to load the kernel and run it. Create a file demo.py with the following contents.

metal/demo.py

from utils import load
import numpy as np

# create an array, with a single random float value
input_array = np.random.random(1).astype(np.float32)

# load kernel from file, as a runnable python function
log = load('log.metal', function_name='log_kernel')

# run kernel on input array above
output_array = log(input_array)

# check output is correct
error = np.abs(output_array - np.log(input_array)).max()
assert error < 1e-5, "❌ Output does not match reference!"
print("✅ Reference matches output!")

Now, run your hello world script.

python demo.py

The above script will load and execute the Metal kernel, then compare the kernel output to a numpy reference. The comparison should succeed and give the following success message.

✅ Reference matches output!

This now completes your very first, "Hello world" kernel on an Apple GPU.

How to run Metal from Python

There isn't officially a Python API for accessing and running Metal kernels. However, staying in Python-land is still possible by jumping through a few hoops.

In short, we use Objective C's Metal API, which is exposed via Python bindings by the pyobjc library. The bindings are generated rather than manually defined, so you can expect a one-to-one translation between Python bindings and the original Objective C API.

For example, take the first few lines of sample Objective C code, found in the official "Performing Calculations on a GPU" tutorial.

id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> defaultLibrary = [device newDefaultLibrary];
id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];

Translating this into Python is fairly straightforward — we can read off each line and write the same code with Python syntax.

from Metal import *
device = MTLCreateSystemDefaultDevice();
defaultLibrary = device.newDefaultLibrary();
addFunction = defaultLibrary.newFunctionWithName_("add_arrays")

However, notice the last method name wasn't a perfect translation, with the Python method newFunctionWithName_ featuring a tailing underscore. This is because pyobjc replaces all colons in the method title in official documentation (newFunctionWithName:) with underscores. Here's another example. In Objective C, we have the following.

_mAddFunctionPSO = [device newComputePipelineStateWithFunction: addFunction error:&error];

The corresponding method title in official documentation is newComputePipelineStateWithFunction:error:, and as a result, the corresponding Python binding is called newComputePipelineStateWithFunction_error_. This gives us the following Python translation.

_mAddFunctionPSO = device.newComputePipelineStateWithFunction_error_(addFunction, None);

Using the above, you should now be able to adapt any Metal API resource written in Objective C to your advantage. In this post, we'll cover the basics so you don't need to go digging and translating yourself.

Step 1: Write a launcher from scratch

Above, we used some scaffolding code to get up and running. Let's now remove that crutch and dive into how we load and execute Metal kernels from Python. Create a new file called run.py, with the following contents — start by importing the Metal API and C datatypes.