Voice Notes on Mac: Siri Dictation vs WhisperKit vs Cloud Services
Compare voice-to-text options for Mac: Siri Dictation, WhisperKit (local), and cloud APIs. We test accuracy, privacy, speed, and offline support.
Sometimes the fastest way to capture an idea is to say it. But voice-to-text on Mac is not one-size-fits-all. Some options send your audio to the cloud. Others process everything on your device. The differences matter — especially for privacy and reliability.
Here is how the three main approaches compare.
The Three Approaches
| Feature | Siri Dictation | WhisperKit (Local) | Cloud APIs |
|---|---|---|---|
| Processing | Apple servers (partial on-device) | 100% on-device | Third-party servers |
| Internet required | Yes (mostly) | No | Yes |
| Privacy | Apple’s servers | Nothing leaves your Mac | Varies by provider |
| Languages | 60+ | 100+ | 100+ |
| Accuracy | Very good | Very good | Excellent |
| Speed | Near real-time | Near real-time | Depends on network |
| Cost | Free (built-in) | Free (open-source) | Pay per minute |
| Custom vocabulary | Limited | No | Some providers |
| Works offline | Limited languages | Yes, fully | No |
| Used in | All macOS apps | SlashNote | Various apps |
Siri Dictation: The Built-In Option
How it works
Siri Dictation is built into macOS. Press the dictation key (Fn twice or the microphone icon on the keyboard) and start speaking. The system transcribes your speech and types it wherever your cursor is.
On newer Macs with Apple Silicon, some processing happens on-device for supported languages. But for most use cases, audio is still sent to Apple’s servers for processing.
Accuracy
Siri Dictation has improved significantly over the years. For everyday speech in common languages (English, Spanish, Mandarin, etc.), accuracy is very good — typically 95%+ for clear speech.
It handles punctuation commands well: saying “period,” “comma,” “new line,” or “question mark” inserts the correct punctuation.
Strengths:
- Excellent for common languages
- Good punctuation handling
- Continuous dictation (keeps listening)
- Works in every text field on macOS
Weaknesses:
- Technical jargon and proper nouns can be hit-or-miss
- Background noise reduces accuracy more than local models
- Occasional network lag causes delayed transcription
Privacy
Audio data goes to Apple’s servers for processing. Apple states it does not associate dictation data with your Apple ID after 6 months and deletes the data within 2 years.
With on-device dictation (Apple Silicon, supported languages), data stays local. But this isn’t available for all languages, and the system may still fall back to server processing.
Best for
- Quick dictation in any app (emails, messages, documents)
- Users who don’t want to install anything
- Casual note-taking with punctuation commands
WhisperKit: 100% On-Device
How it works
WhisperKit is an open-source speech-to-text engine based on OpenAI’s Whisper model, optimized to run on Apple Neural Engine. It processes audio entirely on your Mac — no network connection needed, no data sent anywhere.
SlashNote uses WhisperKit for all voice features. You hold a modifier key (Cmd for raw transcription, Ctrl for AI-processed), speak, and release. The text appears in your note.
Accuracy
WhisperKit’s accuracy is comparable to Siri Dictation for most languages and often better for technical vocabulary. Because Whisper was trained on a massive multilingual dataset, it handles accents, code-switching (mixing languages), and domain-specific terms well.
Strengths:
- Strong with technical jargon and mixed-language speech
- Consistent accuracy regardless of network conditions
- Handles accents and dialects well
- Automatic language detection across 100+ languages
Weaknesses:
- No real-time streaming (processes after you finish speaking)
- No punctuation commands (relies on natural speech patterns)
- First inference may take a moment as the model loads
Privacy
This is WhisperKit’s strongest feature. Zero audio data leaves your device. Ever.
- No network requests during processing
- No audio stored after transcription
- No accounts, no API keys, no telemetry
- Runs on Apple Neural Engine — fast, efficient, completely local
For anyone handling sensitive information — legal notes, medical dictation, private thoughts — this level of privacy is not available from cloud solutions.
Two modes in SlashNote
Voice Note (Hold Cmd): Raw transcription. You speak, WhisperKit converts to text, the text appears in your note. Simple, fast.
AI Voice Note (Hold Ctrl): You speak, WhisperKit converts to text, then AI processes the text into a structured note. Stream-of-consciousness input, organized output. The AI step uses your chosen provider (cloud or Ollama for fully local).
Best for
- Privacy-sensitive voice notes
- Offline use (airplane, spotty WiFi, no internet)
- Technical dictation (code terms, product names)
- Multilingual users who switch between languages
Cloud APIs: Maximum Power
How it works
Cloud speech-to-text APIs — OpenAI Whisper API, Google Cloud Speech-to-Text, Amazon Transcribe, Microsoft Azure Speech — send your audio to remote servers for processing. They return text with high accuracy and often include features like speaker diarization and custom vocabularies.
These APIs are typically used by apps rather than end users directly. If your note-taking app offers cloud-based voice input, it uses one of these services under the hood.
Accuracy
Cloud APIs generally offer the highest accuracy because they run the largest models on powerful server hardware.
OpenAI’s Whisper API and Google’s Speech-to-Text consistently top benchmarks:
- 97%+ word error rate for clean speech
- Strong performance on noisy audio
- Excellent multilingual support
- Speaker diarization (who said what)
Strengths:
- Highest raw accuracy
- Best handling of noisy environments
- Speaker identification
- Custom vocabulary support (some providers)
- Real-time streaming capability
Weaknesses:
- Requires internet connection
- Audio sent to third-party servers
- Cost per minute of audio
- Latency depends on network conditions
- Rate limits on API usage
Privacy
This is where cloud APIs fall short. Your audio — your actual voice — is sent to servers operated by OpenAI, Google, Amazon, or Microsoft.
Each provider’s privacy policy is different:
- OpenAI Whisper API: Does not use API data for training by default. Audio is retained for 30 days for abuse monitoring.
- Google Cloud Speech: Data processed and deleted. May be used for service improvement unless opted out.
- Amazon Transcribe: Processed data may be used to improve the service. Can opt out.
- Microsoft Azure: Data retention varies by configuration.
For non-sensitive content, this may be acceptable. For private notes, medical dictation, or legal work, cloud processing introduces unnecessary risk.
Cost
Cloud APIs charge per minute of audio:
- OpenAI Whisper: ~$0.006 per minute
- Google Speech-to-Text: ~$0.006-0.024 per minute (depending on model)
- Amazon Transcribe: ~$0.024 per minute
- Azure Speech: ~$0.016 per minute
For occasional use, costs are minimal. For heavy dictation (hours per day), they add up.
Best for
- Meeting transcription with multiple speakers
- Professional transcription services
- Apps that need maximum accuracy in noisy environments
- Use cases where privacy is not a primary concern
Head-to-Head Comparison
Accuracy test
For a simple test — reading a paragraph of mixed technical and everyday English in a quiet room — all three approaches score within 2-3% of each other. The differences emerge in edge cases:
| Scenario | Siri | WhisperKit | Cloud API |
|---|---|---|---|
| Quiet room, clear speech | Excellent | Excellent | Excellent |
| Background noise (cafe) | Good | Good | Very good |
| Technical jargon (programming) | Fair | Good | Very good |
| Mixed languages | Fair | Very good | Very good |
| Heavy accent | Good | Good | Very good |
| Offline | Limited | Full support | Not available |
Speed test
| Metric | Siri | WhisperKit | Cloud API |
|---|---|---|---|
| Start-to-text (10 sec audio) | ~1-2 sec | ~2-3 sec | ~2-5 sec |
| Start-to-text (60 sec audio) | ~2-3 sec | ~5-8 sec | ~5-10 sec |
| First-time load | Instant | ~3-5 sec | Instant |
| Requires internet | Mostly yes | No | Yes |
Siri is fastest for short dictation because it streams in real-time. WhisperKit processes after you finish speaking but is consistent. Cloud APIs depend on network conditions.
Privacy scorecard
| Criterion | Siri | WhisperKit | Cloud API |
|---|---|---|---|
| Audio stays on device | Partial | Always | Never |
| No account required | Apple ID | No account | API key |
| Works without internet | Limited | Always | Never |
| Provider can hear your audio | Yes (mostly) | Never | Yes |
| Data retention | Up to 2 years | None | 30 days+ |
| Suitable for sensitive content | Depends | Yes | No |
Which Should You Use?
Use Siri Dictation if:
- You just need quick dictation in any app
- You don’t want to install anything extra
- Punctuation commands (“period”, “new paragraph”) are important to your workflow
- You primarily use one language
Use WhisperKit (via SlashNote) if:
- Privacy matters — no audio should leave your device
- You work offline regularly or on unreliable networks
- You dictate technical content (code, product names, jargon)
- You switch between languages
- You want voice input integrated with AI note processing
Use Cloud APIs if:
- You need the absolute highest accuracy
- You transcribe meetings with multiple speakers
- You need custom vocabulary for specialized domains
- Privacy is not a concern for the content being transcribed
The Bigger Picture
Voice input is becoming a standard feature in productivity tools. The question is not whether to use it, but how.
For most note-taking — capturing ideas, recording thoughts, quick reminders — any of these three approaches works. The difference is where your voice data goes.
If you value privacy and want voice notes that stay entirely on your Mac, WhisperKit through SlashNote gives you that with no setup, no accounts, and no compromises on accuracy.