from Guide to Hacking on Sep 24, 2023

How to use Apple GPUs from Python

Most code samples for using Apple GPUs require Xcode, Objective C, or a combination of both. This post distills the process down to a single file, to demonstrate how you can leverage Apple GPUs from the comfort of Python.

Say you want to write a custom matrix multiply. This matrix multiply could leverage your custom sparse matrix format, operate on a custom data type, or fuse other operations in. Whatever the optimization, you now have a reason to write a custom kernel.

On Nvidia GPUs, it's rather "straightforward"1 to write custom kernels — namely, use CUDA. The toolset for developing and optimizing CUDA kernels is well known, and there are a large number of projects that build and integrate kernels of their own.

However, on Apple GPUs, it's much less straightforward to do this. There are of course plenty of tutorials online, but these demos usually require Xcode, Objective C, or a combination of both. What if I'm working in Python? Don't want to depend on a GUI like Xcode to run a simple script? In short, what is the simplest "Hello world" to interface with custom kernels on Apple GPUs?

Getting setup

Create a new directory to house your project. I will create one on my desktop.

mkdir ~/Desktop/metal
cd ~/Desktop/metal

In this new directory, create a new virtual environment.

python -m venv env
source env/bin/activate

We now need to install pyobjc, a Python binding for many of the OSX's built-in utilities, including an API that we'll use to run custom kernels on Apple GPUs.

pip install pyobjc

Let's now dive right into the code.

Step 0: Write a Metal kernel

We won't dive too deeply into kernel writing. However, we'll scratch the surface here to give you an idea. At its core, kernels are written using Metal Shader Language2 (MSL), which is very similar to C++.

Here's a kernel that takes in an array of integers, and adds 2 to every item in that array, in-place. Create a file called add2.metal and write the following.

metal/add2.metal

/* import metal library, like `from metal import *` */
#include <metal_stdlib>
using namespace metal;

/**

Define a function called `add2_kernel` that doesn't return
a value. Instead, we modify the input array in-place.

:param in uint8_t*: An array of integers, as input.
:param id uint: The index of the current thread. We use this
    to index into the input array. In other words, we
    parallelize by assigning each thread to a different
    element of the input array.

*/
kernel void add2_kernel(device uint8_t *in  [[ buffer(0) ]],
                        uint id [[ thread_position_in_grid ]]) {
    in[id] = in[id] + 2;    /* add 2 in-place */
}

Here's another kernel that takes in a single array of floats and computes the log of every element. This time, the operation does not happen in-place. Instead, results are written to an output array. Create a file called log.metal and write the following.

metal/log.metal

#include <metal_stdlib>
using namespace metal;

/**

Define a function called `log_kernel` that doesn't return
a value. Instead, we write results to an output array.

:param in float*: An array of floats, as input.
:param out float*: An array of floats, which will contain
    our output.
:param id uint: The index of the current thread. We use this
    to index into the input array. In other words, we
    parallelize by assigning each thread to a different
    element of the input array.

*/
kernel void log_kernel(device float *in  [[ buffer(0) ]],
                       device float *out [[ buffer(1) ]],
                       uint id [[ thread_position_in_grid ]]
) {
    out[id] = log(in[id]);  /* log each element *not in-place */
}

For this step, we'll import and use this log kernel from the comfort of Python. To do so, start by downloading a file with utilities you'll need.

wget https://github.com/alvinwan/guide-to-hacking/tree/main/metal/v0-hello-world/utils.py -O utils.py

Now, create a short and simple script to load the kernel and run it. Create a file demo.py with the following contents.

metal/demo.py

from utils import load
import numpy as np

# create an array, with a single random float value
input_array = np.random.random(1).astype(np.float32)

# load kernel from file, as a runnable python function
log = load('log.metal', function_name='log_kernel')

# run kernel on input array above
output_array = log(input_array)

# check output is correct
error = np.abs(output_array - np.log(input_array)).max()
assert error < 1e-5, "❌ Output does not match reference!"
print("✅ Reference matches output!")

Now, run your hello world script.

python demo.py

The above script will load and execute the Metal kernel, then compare the kernel output to a numpy reference. The comparison should succeed and give the following success message.

✅ Reference matches output!

This now completes your very first, "Hello world" kernel on an Apple GPU.

How to run Metal from Python

There isn't officially a Python API for accessing and running Metal kernels. However, staying in Python-land is still possible by jumping through a few hoops.

In short, we use Objective C's Metal API, which is exposed via Python bindings by the pyobjc library. The bindings are generated rather than manually defined, so you can expect a one-to-one translation between Python bindings and the original Objective C API.

For example, take the first few lines of sample Objective C code, found in the official "Performing Calculations on a GPU" tutorial.

id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> defaultLibrary = [device newDefaultLibrary];
id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];

Translating this into Python is fairly straightforward — we can read off each line and write the same code with Python syntax.

from Metal import *
device = MTLCreateSystemDefaultDevice();
defaultLibrary = device.newDefaultLibrary();
addFunction = defaultLibrary.newFunctionWithName_("add_arrays")

However, notice the last method name wasn't a perfect translation, with the Python method newFunctionWithName_ featuring a tailing underscore. This is because pyobjc replaces all colons in the method title in official documentation (newFunctionWithName:) with underscores. Here's another example. In Objective C, we have the following.

_mAddFunctionPSO = [device newComputePipelineStateWithFunction: addFunction error:&error];

The corresponding method title in official documentation is newComputePipelineStateWithFunction:error:, and as a result, the corresponding Python binding is called newComputePipelineStateWithFunction_error_. This gives us the following Python translation.

_mAddFunctionPSO = device.newComputePipelineStateWithFunction_error_(addFunction, None);

Using the above, you should now be able to adapt any Metal API resource written in Objective C to your advantage. In this post, we'll cover the basics so you don't need to go digging and translating yourself.

Step 1: Write a launcher from scratch

Above, we used some scaffolding code to get up and running. Let's now remove that crutch and dive into how we load and execute Metal kernels from Python. Create a new file called run.py, with the following contents — start by importing the Metal API and C datatypes.

metal/run.py

import Metal
import ctypes
# Load the Metal kernel. Kernel adds 2 to input, in-place. dev = Metal.MTLCreateSystemDefaultDevice() # Get GPU src = open('add2.metal').read() # Load the kernel source code lib, _ = dev.newLibraryWithSource_options_error_(src, None, None)

Then, access the default GPU available to the device, and load the Metal kernel.

metal/run.py

import Metal import ctypes
# Load the Metal kernel. Kernel adds 2 to input, in-place. dev = Metal.MTLCreateSystemDefaultDevice() # Get GPU src = open('add2.metal').read() # Load the kernel source code lib, _ = dev.newLibraryWithSource_options_error_(src, None, None) func = lib.newFunctionWithName_("add2_kernel")
# Create input buffer. Initialized to all zeros. storage = Metal.MTLResourceStorageModeShared input_buffer = dev.newBufferWithLength_options_(1, storage)

Now, we have the Python representation of our kernel in func. There are three steps we need to follow, to execute the kernel.

  1. Prepare input arguments.
  2. Construct a command that determines how to run the kernel.
  3. Execute the command.

Let's start with the first step, where we prepare a buffer for the first and only input argument to the kernel. Our array in this case has only 1 value. To be specific, our array occupies only 1 byte, for the one 1-byte value it contains.

metal/run.py

dev = Metal.MTLCreateSystemDefaultDevice() # Get GPU src = open('add2.metal').read() # Load the kernel source code lib, _ = dev.newLibraryWithSource_options_error_(src, None, None) func = lib.newFunctionWithName_("add2_kernel")
# Create input buffer. Initialized to all zeros. storage = Metal.MTLResourceStorageModeShared input_buffer = dev.newBufferWithLength_options_(1, storage)
# Define a 'command' that specifies how to run the kernel commandQueue = dev.newCommandQueue() # queue of commands commandBuffer = commandQueue.commandBuffer() computeEncoder = commandBuffer.computeCommandEncoder() # start

For the second step, we follow a few sub-steps:

metal/run.py

# Create input buffer. Initialized to all zeros. storage = Metal.MTLResourceStorageModeShared input_buffer = dev.newBufferWithLength_options_(1, storage)
# Define a 'command' that specifies how to run the kernel commandQueue = dev.newCommandQueue() # queue of commands commandBuffer = commandQueue.commandBuffer() computeEncoder = commandBuffer.computeCommandEncoder() # start pso = dev.newComputePipelineStateWithFunction_error_(func, None)[0] computeEncoder.setComputePipelineState_(pso) # set kernel to call computeEncoder.setBuffer_offset_atIndex_(input_buffer, 0, 0) # arg1 grd = grp = Metal.MTLSizeMake(1, 1, 1) # 1 thread globally computeEncoder.dispatchThreads_threadsPerThreadgroup_(grd, grp) computeEncoder.endEncoding() # end
# Execute the 'command' we defined above commandBuffer.commit() commandBuffer.waitUntilCompleted()

This completes our command definition. For our third step, we now execute the command.

metal/run.py

computeEncoder.setBuffer_offset_atIndex_(input_buffer, 0, 0) # arg1 grd = grp = Metal.MTLSizeMake(1, 1, 1) # 1 thread globally computeEncoder.dispatchThreads_threadsPerThreadgroup_(grd, grp) computeEncoder.endEncoding() # end
# Execute the 'command' we defined above commandBuffer.commit() commandBuffer.waitUntilCompleted()
# Check output. Input was 0, kernel adds 2, so output is 2. buffer = input_buffer.contents().as_buffer(1) # get buffer item = ctypes.c_uint8.from_buffer(buffer) # cast to uint8 assert item.value == 2, f"❌ Output does not match reference!"

This completes our kernel execution. Let's now add a few lines of code to verify the results of our kernel execution.

metal/run.py

# Execute the 'command' we defined above commandBuffer.commit() commandBuffer.waitUntilCompleted()
# Check output. Input was 0, kernel adds 2, so output is 2. buffer = input_buffer.contents().as_buffer(1) # get buffer item = ctypes.c_uint8.from_buffer(buffer) # cast to uint8 assert item.value == 2, f"❌ Output does not match reference!" print("✅ Reference matches output!")

This completes our launcher from scratch. Now, run your file to load, execute, and verify the Metal kernel.

python run.py

The comparison should succeed and give the following success message.

✅ Reference matches output!

This now completes your very first kernel launcher from scratch. For our next step, we'll initialize the input argument to a more useful value, other than a single zero.

Step 2: Populate inputs

Above, we left the input unpopulated, using its default value of zero and using an array of just length one. Let's now initialize the input more reasonably — using more and non-zero values. Start by modifying our buffer to be longer. In particular, we'll setup a 1024-length buffer for 1024 1-byte values.

metal/run.py

lib, _ = dev.newLibraryWithSource_options_error_(src, None, None) func = lib.newFunctionWithName_("add2_kernel") # Create input buffer. Initialize to random integers. storage = Metal.MTLResourceStorageModeShared
N = 1024 # our input will have 1024 integers input_buffer = dev.newBufferWithLength_options_(N, storage)
input_contents = input_buffer.contents().as_buffer(N) random_integers = [random.randint(0, 253) for _ in range(N)] for i in range(N): # copy random values into buffer input_contents[i] = random_integers[i]

Next, grab a modifiable version of the input buffer's contents, and populate it with random integers.

metal/run.py

# Create input buffer. Initialize to random integers. storage = Metal.MTLResourceStorageModeShared N = 1024 # our input will have 1024 integers input_buffer = dev.newBufferWithLength_options_(N, storage)
input_contents = input_buffer.contents().as_buffer(N) random_integers = [random.randint(0, 253) for _ in range(N)] for i in range(N): # copy random values into buffer input_contents[i] = random_integers[i]
# Define a 'command' that specifies how to run the kernel commandQueue = dev.newCommandQueue() # queue of commands commandBuffer = commandQueue.commandBuffer() computeEncoder = commandBuffer.computeCommandEncoder() # start

Now that our input is much longer, also update how our kernel parallelizes over the input. In particular, we previously asked the kernel to work with just one thread. Now, with 1024 elements, there are 1024 total "jobs" that need to be executed. This is the number of threads per grid.

metal/run.py

commandBuffer = commandQueue.commandBuffer() computeEncoder = commandBuffer.computeCommandEncoder() # start pso = dev.newComputePipelineStateWithFunction_error_(func, None)[0] computeEncoder.setComputePipelineState_(pso) # set kernel to call computeEncoder.setBuffer_offset_atIndex_(input_buffer, 0, 0) # arg1
grp = Metal.MTLSizeMake(32, 1, 1) # 32 threads per group grd = Metal.MTLSizeMake(1024, 1, 1) # 1024 threads per grid
computeEncoder.dispatchThreads_threadsPerThreadgroup_(grd, grp) computeEncoder.endEncoding() # end # Execute the 'command' we defined above commandBuffer.commit()

Finally, update our verification code at the end of the file to check all elements in the input array, instead of just the first item.

metal/run.py

# Execute the 'command' we defined above commandBuffer.commit() commandBuffer.waitUntilCompleted() # Check output. Input was 0, kernel adds 2, so output is 2.
output_metal = list(input_contents) output_python = [x + 2 for x in random_integers] assert output_metal == output_python, f"❌ Output does not match reference!"
print("✅ Reference matches output!")

This concludes a more practical kernel, which takes in randomized integer values, and produces a verifiably correct output. For our next step, we'll update our input data type to be floating point instead of integer.

Step 3: Working with floating point

Our kernel above operates only with integers, but real-world use cases for custom GPU kernels involve floating point numbers as well. As a result, let's update our kernel to using floating point. Update the method signature, and change the integer 2 into a float 2.0.

metal/add2.metal

to index into the input array. In other words, we parallelize by assigning each thread to a different element of the input array. */
kernel void add2_kernel(device float *in [[ buffer(0) ]], uint id [[ thread_position_in_grid ]]) { in[id] = in[id] + 2.0; /* add 2 in-place */
}

When switching from integers to floating point, there are two core issues:

  1. Floating point numbers generally take several bytes. FP16, the smallest popularly-supported bitwidth, requires 2 bytes.
  2. Floating point can't be "read directly". This is in contrast to integers, where you can quite easily read off the value of your base-2 number. With floating point, some arithmetic is needed to translate the mantissa and exponent bits into a value.

These two challenges above mean that we need to be careful. Previously, we had a number of simplifications, since each value was exactly a byte long. Now, we need to keep the notions of array length and byte length separate.

metal/run.py

lib, _ = dev.newLibraryWithSource_options_error_(src, None, None) func = lib.newFunctionWithName_("add2_kernel") # Create input buffer. Initialize to random integers. storage = Metal.MTLResourceStorageModeShared
N = 1024 # our input will have 1024 integers B = N * 4 # fp32 is 4 bytes per value
input_buffer = dev.newBufferWithLength_options_(B, storage) input_contents = input_buffer.contents().as_buffer(B) input_array = (ctypes.c_float * N).from_buffer(input_contents) input_list = [random.random() for _ in range(N)] # generate input_array[:] = input_list # copy random values into buffer

Next, before updating our input buffer, we need to cast the contents to floating point. For this example, we'll use FP32. Like before, we then generate random values and populate the input buffer with our generated values.

metal/run.py

storage = Metal.MTLResourceStorageModeShared N = 1024 # our input will have 1024 integers B = N * 4 # fp32 is 4 bytes per value input_buffer = dev.newBufferWithLength_options_(B, storage) input_contents = input_buffer.contents().as_buffer(B)
input_array = (ctypes.c_float * N).from_buffer(input_contents) input_list = [random.random() for _ in range(N)] # generate input_array[:] = input_list # copy random values into buffer
# Define a 'command' that specifies how to run the kernel commandQueue = dev.newCommandQueue() # queue of commands commandBuffer = commandQueue.commandBuffer() computeEncoder = commandBuffer.computeCommandEncoder() # start

Since our ground truth and kernel-computed outputs are both now floating point, we also need to update our comparison expression in the verification step. In short, instead of checking of exact matches, we can add a tolerance for a tiny epsilon value, such as 1e-5.

metal/run.py

commandBuffer.waitUntilCompleted() # Check output. Input was 0, kernel adds 2, so output is 2. output_metal = list(input_contents) output_python = [x + 2 for x in input_list]
assert [(a - b) < 1e-5 for a, b in zip(output_metal, output_python)], f"❌ Output does not match reference!"
print("✅ Reference matches output!")

This completes our modifications floating point. Now, run your file.

python run.py

Just like with every time before, this should then give you the following message, indicating success.

✅ Reference matches output!

This completes our latest iteration of the Metal kernel. You now have a working Python integration with Metal, via Python bindings for the Objective C Metal API.

Bonus: Interoperating with Numpy

Most of you reading this use NumPy or PyTorch — or some kind of library for linear algebra and vector math. Let's see a quick and dirty example of using Metal with NumPy. To start, create a random array of values using NumPy, then define a new buffer that reads directly from the generated array.

metal/run.py

lib, _ = dev.newLibraryWithSource_options_error_(src, None, None) func = lib.newFunctionWithName_("add2_kernel") # Create input buffer. Initialize to random floats. storage = Metal.MTLResourceStorageModeShared
input_array = np.random.random(1024).astype(np.float32) input_buffer = dev.newBufferWithBytes_length_options_( input_array, input_array.nbytes, storage)
# Define a 'command' that specifies how to run the kernel commandQueue = dev.newCommandQueue() # queue of commands commandBuffer = commandQueue.commandBuffer() computeEncoder = commandBuffer.computeCommandEncoder() # start

Next, after your kernel has finished executing, grab the same readable buffer of contents from the input array, then cast that buffer into a floating point NumPy array.

metal/run.py

# Execute the 'command' we defined above commandBuffer.commit() commandBuffer.waitUntilCompleted() # Check output. Input was 0, kernel adds 2, so output is 2.
output = input_buffer.contents().as_buffer(input_array.nbytes) output_array = np.frombuffer(output, dtype=np.float32)
error = np.abs((input_array + 2.0) - output_array).max() assert error < 1e-5, f"❌ Output does not match reference!" print("✅ Reference matches output!")

Finally, in your verification code, use vector math to bound all the errors, instead of manually looping over both vectors.

metal/run.py

commandBuffer.waitUntilCompleted() # Check output. Input was 0, kernel adds 2, so output is 2. output = input_buffer.contents().as_buffer(input_array.nbytes) output_array = np.frombuffer(output, dtype=np.float32)
error = np.abs((input_array + 2.0) - output_array).max() assert error < 1e-5, f"❌ Output does not match reference!"
print("✅ Reference matches output!")

This finally completes your NumPy-friendly Metal kernel. Let's run your script.

python run.py

Just like with every time before, this should then give you the following message, indicating success.

✅ Reference matches output!

This completes our final version of the Metal kernel, fully interoperable with NumPy. You can use a similar technique to interoperate with PyTorch as well.

Conclusion

In short, executing Metal kernels from Python directly is possible because Python bindings for Objective C's Metal API exist. We use this liberally in this post, building up our launcher and runner incrementally, until we ended up with a generic, NumPy-compatible script.

You can now use this script as a boilerplate, to accelerate your computational workloads and test actual, on-device inference speeds with custom Metal kernels on Apple GPU.


back to Guide to Hacking



  1. Of course, writing and optimizing CUDA kernels is no easy task. This is just to say that the pipeline for doing so is well established. 

  2. You can learn more about the Metal Shader Language (MSL) in the official Apple Metal specification. This document includes all the information you should need for writing kernels. The Wikipedia article on the Metal API may also be helpful.