Skip to content

A cross-platform desktop application built with Tauri and React for real-time voice-to-text transcription using Deepgram's API. Functional clone of Wispr Flow focusing on core voice input workflow with push-to-talk recording and live speech recognition.

Notifications You must be signed in to change notification settings

Mausumi134/voice-to-text-desktop-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voice-to-Text Desktop App

A cross-platform desktop application built with Tauri and React that provides real-time voice-to-text transcription using Deepgram's API. This is a functional clone of Wispr Flow, focusing on core voice-to-text workflow rather than UI replication.

🎯 Features

  • Real-time Voice Transcription: Live speech-to-text using Deepgram's API
  • Push-to-Talk Interface: Hold to record, release to stop
  • Cross-platform: Works on Windows, macOS, and Linux
  • Audio Device Detection: Automatic microphone detection and configuration
  • Export Options: Copy to clipboard or download as text file
  • Clean UI: Modern, responsive interface with visual recording feedback

🛠 Tech Stack

  • Frontend: React with modern hooks and components
  • Backend: Rust with Tauri framework
  • Audio Processing: cpal for cross-platform audio capture
  • API Integration: Deepgram WebSocket streaming API
  • Build System: Vite for frontend, Cargo for Rust backend

📋 Prerequisites

Before running the application, ensure you have:

  1. Rust (latest stable version)

    • Install from rustup.rs
    • Verify installation: rustc --version
  2. Node.js (version 16 or higher)

    • Download from nodejs.org
    • Verify installation: node --version
  3. Deepgram API Key

    • Sign up at deepgram.com
    • Create a new project and copy your API key

🚀 Installation & Setup

  1. Clone the repository

    git clone <your-repo-url>
    cd voice-to-text-app
  2. Install dependencies

    npm install
  3. Run the development server

    npm run tauri dev
  4. For production build

    npm run tauri build

📱 Usage

  1. Launch the application

    • The app will open with an API key input screen
  2. Enter your Deepgram API key

    • Paste your API key and click "Start App"
  3. Grant microphone permissions

    • Allow the app to access your microphone when prompted
  4. Start recording

    • Hold the record button and speak clearly
    • Release the button to stop recording
  5. View transcription

    • Your speech will appear as text in real-time
    • Use copy or download buttons to save the text

🏗 Architecture

Project Structure

voice-to-text-app/
├── src/                    # React frontend
│   ├── components/         # UI components
│   ├── hooks/             # Custom React hooks
│   └── App.jsx            # Main application
├── src-tauri/             # Rust backend
│   ├── src/
│   │   ├── audio/         # Audio capture & streaming
│   │   ├── commands.rs    # Tauri command handlers
│   │   └── lib.rs         # Main Rust application
│   └── Cargo.toml         # Rust dependencies
└── package.json           # Node.js dependencies

Technical Architecture

┌─────────────────────────────────────────┐
│                Frontend                 │
│  ┌─────────────┐  ┌─────────────────┐   │
│  │ UI Controls │  │ Audio Visualizer│   │
│  │ (Push-to-   │  │ & Text Display  │   │
│  │  Talk)      │  └─────────────────┘   │
│  └─────────────┘                        │
└─────────────────┬───────────────────────┘
                  │ Tauri Commands & Events
┌─────────────────▼───────────────────────┐
│              Rust Backend               │
│  ┌─────────────┐  ┌─────────────────┐   │
│  │ Audio       │  │ Deepgram       │   │
│  │ Capture     │  │ Integration    │   │
│  │ (cpal)      │  │ (WebSocket)    │   │
│  └─────────────┘  └─────────────────┘   │
└─────────────────────────────────────────┘

Key Components

  1. Audio Capture System

    • Uses cpal for cross-platform microphone access
    • Handles multiple audio formats (F32, I16, U16)
    • Adapts to device's native configuration
  2. Deepgram Integration

    • WebSocket streaming for real-time transcription
    • Secure TLS connection with proper headers
    • JSON response parsing and error handling
  3. Event-Driven Communication

    • Tauri events for real-time UI updates
    • Background thread processing
    • Clean separation between frontend and backend

🔧 Technical Decisions

Audio Processing

  • Device Compatibility: Uses device's native audio configuration instead of forcing specific formats
  • Real-time Streaming: Direct PCM to WebSocket pipeline for minimal latency
  • Cross-platform: cpal library ensures compatibility across Windows, macOS, and Linux

WebSocket Implementation

  • Secure Connection: TLS-enabled WebSocket to Deepgram's API
  • Proper Headers: Includes all required WebSocket handshake headers
  • Error Recovery: Graceful handling of connection failures

State Management

  • Simple Architecture: React state with custom hooks
  • Event-based Updates: Real-time transcription via Tauri events
  • Thread Safety: Arc for shared state in Rust

⚠️ Known Limitations

  1. Internet Dependency: Requires active internet connection for Deepgram API
  2. API Key Required: Must have valid Deepgram API key to function
  3. Audio Device: Requires working microphone for voice input
  4. Language Support: Currently configured for English (en-US)

🐛 Troubleshooting

Common Issues

"No input device available"

  • Ensure microphone is connected and working
  • Check system audio permissions
  • Try restarting the application

"TLS support not compiled in"

  • Ensure you're using the latest build with TLS features
  • Run cargo clean and rebuild if necessary

"WebSocket protocol error"

  • Verify your Deepgram API key is correct
  • Check internet connection
  • Ensure firewall allows WebSocket connections

"Recording session error: stream configuration not supported"

  • This should be resolved in the current version
  • The app now adapts to your device's native audio format

Debug Mode

To see detailed logs, run the development version:

npm run tauri dev

Check the console for connection status and error messages.

🧪 Testing

The application has been tested with:

  • ✅ Real-time voice transcription
  • ✅ Multiple recording sessions
  • ✅ Copy and download functionality
  • ✅ Error handling scenarios
  • ✅ Audio device compatibility

📄 License

MIT License - See LICENSE file for details

🤝 Contributing

This is a technical assignment project demonstrating:

  • Cross-platform desktop app development
  • Real-time audio processing
  • API integration with WebSocket streaming
  • Clean code architecture
  • Modern Rust and React development

📞 Support

For technical questions or issues:

  1. Check the troubleshooting section above
  2. Review console logs in development mode
  3. Verify all prerequisites are installed correctly

About

A cross-platform desktop application built with Tauri and React for real-time voice-to-text transcription using Deepgram's API. Functional clone of Wispr Flow focusing on core voice input workflow with push-to-talk recording and live speech recognition.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published