Request: Support SME for all GEMM kernels on Apple silicon

On Apple silicon (I'm using base M5), gemm gets poor performance relative to Apple's Accelerate framework. In my benchmarks, dgemm achieves about 64 GFLOPS at its peak (single thread) while Accelerate is getting about 480 GLOPS. 

Playing around with the SME instructions, I was able to get about 440 GFLOPS for dgemm on a base M5. I've also written kernels for sgemm, cgemm, and zgemm that are all in my repo [here](https://github.com/vlovero/ARMv9.2-gemm/tree/main). The code is not optimized for small matrices yet but it would be nice to get more performance out of the hardware.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Support SME for all GEMM kernels on Apple silicon #5841

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Request: Support SME for all GEMM kernels on Apple silicon #5841

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions