WhisperPen is a command-line tool that leverages speech recognition and AI to convert spoken words into enhanced text. It combines OpenAI's Whisper model for accurate speech recognition with Ollama's Qwen 2.5 32B model for text enhancement.
-
Speech Recognition
- Accept voice input from users
- Convert speech to text accurately
- Support Chinese language input
- Translate to English
- Support wake word detection
- Wake word: "小王小王"
- Background listening
- Low resource usage
- Quick response time
-
AI Enhancement
- Use local Ollama platform
- Utilize Qwen 2.5 32B model
- Enhance text quality
- Maintain professional tone
-
Output Management
- Save to whisperpen.md
- Auto-copy to clipboard
- Support multiple outputs
-
Improved Recognition
- Offline processing
- Noise reduction
- Better accuracy
- Fast response time
-
Performance Optimization
- Configuration caching
- Quick environment check
- Efficient resource usage
- Temporary file management
-
Speech Handler (
speech_handler.py
)class SpeechHandler: def __init__(self): # Initialize Whisper model # Configure audio settings # Setup noise reduction
-
Text Processor (
text_processor.py
)class TextProcessor: def __init__(self): # Initialize Qwen model # Configure processing parameters
-
File Handler (
file_handler.py
)class FileHandler: def __init__(self): # Setup file management # Configure clipboard
-
Wake Word Detector (
wake_detector.py
)class WakeDetector: def __init__(self): # Initialize PocketSphinx # Configure wake word model # Setup background listening
-
Audio Capture
- Sample rate: 44100Hz
- Bit depth: 16-bit
- Channel: Mono
- Noise reduction: Butterworth filter
- Preprocessing: scipy signal processing
- Volume normalization: Required
- Signal-to-noise ratio: Needs improvement
-
Wake Word Detection
- Engine: PocketSphinx
- Wake word: "小王小王"
- Mode: Background listening
- Resource usage: Minimal
- Response time: < 0.5s
- States:
- Sleeping (waiting for wake word)
- Waking (transitioning)
- Active (listening for commands)
- Processing (handling input)
-
Speech Recognition
- Model: OpenAI Whisper base
- Model size: Upgrade to medium/large for better accuracy
- Language: Chinese
- Format: WAV
- Mode: Offline processing
- Initial prompt: Add language context
- Temperature: Lower for more accurate results
- Model Loading:
- Cache model to disk
- Lazy loading strategy
- Optimize memory usage
- Support model quantization
- Performance Optimization:
- Model quantization (int8)
- Batch processing
- Smaller model for initial pass
- Parallel processing
- GPU acceleration if available
-
Text Enhancement
- Model: qwen2.5:32b
- Task: Translation + Enhancement
- Context: Professional
- API: Ollama local deployment
-
Output Management
- Format: Markdown
- Location: whisperpen.md
- Clipboard: Automatic
- Cache: Configuration persistence
- Display Format:
- Show original recognition
- Show enhanced version
- Use rich formatting
- Support comparison view
-
Performance Metrics
- Recognition accuracy > 95%
- Processing time < 5s
- Memory usage < 4GB
-
Error Handling
- Audio capture failures
- Recognition errors
- Model loading issues
- File system errors
-
User Experience
- Clear progress indicators
- Helpful error messages
- Intuitive interface
-
Planned Features
- Multiple language support
- Custom model selection
- Batch processing
- Configuration UI
-
Technical Debt
- Code optimization
- Test coverage
- Documentation
- Performance monitoring
All changes must be documented in:
- changelog.md - Feature and requirement changes
- README.md - User-facing documentation
- This design document - Technical specifications