On Apple silicon (I'm using base M5), gemm gets poor performance relative to Apple's Accelerate framework. In my benchmarks, dgemm achieves about 64 GFLOPS at its peak (single thread) while Accelerate is getting about 480 GLOPS.
Playing around with the SME instructions, I was able to get about 440 GFLOPS for dgemm on a base M5. I've also written kernels for sgemm, cgemm, and zgemm that are all in my repo here. The code is not optimized for small matrices yet but it would be nice to get more performance out of the hardware.
On Apple silicon (I'm using base M5), gemm gets poor performance relative to Apple's Accelerate framework. In my benchmarks, dgemm achieves about 64 GFLOPS at its peak (single thread) while Accelerate is getting about 480 GLOPS.
Playing around with the SME instructions, I was able to get about 440 GFLOPS for dgemm on a base M5. I've also written kernels for sgemm, cgemm, and zgemm that are all in my repo here. The code is not optimized for small matrices yet but it would be nice to get more performance out of the hardware.