Product

SpeakerKit

March 7, 2025

SpeakerKit
SpeakerKit Pro diarizes this 6:22 audio file in ~2 seconds
right after WhisperKit Pro transcribed it in 25 seconds using Large v3 Turbo on an iPhone

The top feature request from developers building with WhisperKit: On-device speaker diarization, the task of identifying "who speak when". Responding to popular demand, we built SpeakerKit, the newest addition to the Argmax SDK family of on-device inference frameworks.

Highlights

  • Speed: Identifies speakers in ~1 second for 4 minutes of audio on an iPhone. Several times faster than any other system we benchmarked, server-side or on-device notwithstanding.
  • Quality: Matches the error rate of state-of-the-art systems such as Pyannote across 13 datasets despite an order of magnitude speedup.
  • Size: ~10 megabytes in total. Bundle with your app or download in a blink.
  • Wide compatibility. Ship to all devices supported by iOS 16 or macOS 13 and newer. Android support is coming soon.
  • Modularity: Works together with WhisperKit to produce diarized transcripts ("who spoke what and when") and can also be used with any other transcription engine, a flexibility many server-side APIs do not offer.

Benchmarks

We have also built SDBench, a Python toolkit for reproducibly benchmarking speaker diarization systems across 13+ widely used datasets following standardized procedures to enable apples-to-apples comparison and fine-grained understanding of tradeoffs. Code will be open-sourced and the accompanying paper will be published in April due to conference submission restrictions.

Roadmap

Commercial use cases for diarization generally involve diarizing transcripts, i.e. "who spoke what and when". After attaining state-of-the-art standalone diarization quality for SpeakerKit (as measured by DER), our next focus is to attain the same level of quality for diarized transcripts (measured by WDER) by optimizing the joint usage of WhisperKit and SpeakerKit.

SpeakerKit is more than just diarization. A major upcoming feature is speaker identification: Extracting voiceprints for a given speaker and identifying them in novel contexts.

Availability

We appreciate the 100+ applications for our Early Access Program (EAP). Due to engineering resource constraints, we were only able to grant access to a fraction of the applicants such as Macwhisper and Detail. The EAP program ends today and SpeakerKit Pro joins the Argmax SDK.

Argmax SDK is available with a license subscription for your application starting today!

Related Articles