Skip to content

Request: Support SME for all GEMM kernels on Apple silicon #5841

@vlovero

Description

@vlovero

On Apple silicon (I'm using base M5), gemm gets poor performance relative to Apple's Accelerate framework. In my benchmarks, dgemm achieves about 64 GFLOPS at its peak (single thread) while Accelerate is getting about 480 GLOPS.

Playing around with the SME instructions, I was able to get about 440 GFLOPS for dgemm on a base M5. I've also written kernels for sgemm, cgemm, and zgemm that are all in my repo here. The code is not optimized for small matrices yet but it would be nice to get more performance out of the hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions