March 7, 2025
The top feature request from developers building with WhisperKit: On-device speaker diarization, the task of identifying "who speak when". Responding to popular demand, we built SpeakerKit, the newest addition to the Argmax SDK family of on-device inference frameworks.
Highlights
Benchmarks
We have also built SDBench, a Python toolkit for reproducibly benchmarking speaker diarization systems across 13+ widely used datasets following standardized procedures to enable apples-to-apples comparison and fine-grained understanding of tradeoffs. Code is now open-source. We encourage the community to contribute other state-of-the-art systems and relevant datasets to this benchmark.
Architecture
The system architecture is described in our research paper which can be found here.
Roadmap
Commercial use cases for diarization generally involve diarizing transcripts, i.e. "who spoke what and when". After attaining state-of-the-art standalone diarization quality for SpeakerKit (as measured by DER), our next focus is to attain the same level of quality for diarized transcripts (measured by WDER) by optimizing the joint usage of WhisperKit and SpeakerKit.
SpeakerKit is more than just diarization. A major upcoming feature is speaker identification: Extracting voiceprints for a given speaker and identifying them in novel contexts.
Availability
We appreciate the 100+ applications for our Early Access Program (EAP). Due to engineering resource constraints, we were only able to grant access to a fraction of the applicants such as Macwhisper and Detail. The EAP program ends today and SpeakerKit Pro joins the Argmax SDK.
Argmax SDK is available with a license subscription for your application starting today!