설명 없음

Kolja Beigel 762f79249b Initial commit 1 년 전
RealtimeSTT 762f79249b Initial commit 1 년 전
tests 762f79249b Initial commit 1 년 전
.gitignore 762f79249b Initial commit 1 년 전
README.md 762f79249b Initial commit 1 년 전
requirements.txt 762f79249b Initial commit 1 년 전

README.md

RealTimeSTT

A fast Voice Activity Detection and Transcription System

Listens to microphone, detects voice activity and immediately transcribes it using the faster_whisper model. Adapts to various environments with a ambient noise level-based voice activity detection.

Ideal for applications like voice assistants or any application where immediate speech-to-text conversion is desired with minimal latency.

Features

  1. Voice Activity Detection: Automatically starts/stops recording when speech is detected or when speech ends.
  2. Wake Word Detection: Starts recording when a specified wake word (or words) is detected.
  3. Buffer Management: Handles short and long term audio buffers for efficient processing.
  4. Event Callbacks: Customizable callbacks for when recording starts or finishes.
  5. Noise Level Calculation: Adjusts based on the background noise for more accurate voice activity detection.
  6. Error Handling: Comprehensive error handling to catch and report any anomalies during execution.

Installation

pip install RealTimeSTT

GPU Support

To significantly improve transcription speed, especially in real-time applications, we strongly recommend utilizing GPU acceleration via CUDA. By default, the transcription is performed on the CPU.

  1. Install NVIDIA CUDA Toolkit 11.8:

  2. Install NVIDIA cuDNN 8.7.0 for CUDA 11.x:

    • Visit NVIDIA cuDNN Archive.
    • Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".
    • Download and install the software.
  3. Reconfigure PyTorch for CUDA:

    • If you have PyTorch installed, remove it: pip uninstall torch.
    • Install PyTorch again with CUDA support: pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118.

Note: To check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list.

Usage

Automatic Recording

print(AudioToTextRecorder().text())

Manual Recording

recorder.start()
recorder.stop()
print(recorder.text())

Callbacks

You can set callback functions to be executed when recording starts or stops:

def my_start_callback():
    print("Recording started!")

def my_stop_callback():
    print("Recording stopped!")

recorder = AudioToTextRecorder(on_recording_started=my_start_callback, on_recording_finished=my_stop_callback)

Configuration

The class comes with numerous configurable parameters such as buffer size, activity thresholds, and smoothing factors to fine-tune the recording and transcription process based on the specific needs of your application:

  • model: Specifies the size of the transcription model to use or the path to a converted model directory. Valid options are 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'. If a specific size is provided, the model is downloaded from the Hugging Face Hub.

  • language: Defines the language code for the speech-to-text engine. If not specified, the model will attempt to detect the language automatically.

  • wake_words: A comma-separated string of wake words to initiate recording. Supported wake words include 'alexa', 'americano', 'blueberry', 'bumblebee', 'computer', 'grapefruits', 'grasshopper', 'hey google', 'hey siri', 'jarvis', 'ok google', 'picovoice', 'porcupine', 'terminator'.

  • wake_words_sensitivity: Determines the sensitivity for wake word detection, ranging from 0 (least sensitive) to 1 (most sensitive). The default value is 0.5.

  • on_recording_started: A callable option which is invoked when the recording starts.

  • on_recording_finished: A callable option invoked when the recording ends.

  • min_recording_interval: Specifies the minimum interval (in seconds) for recording durations.

  • interval_between_records: Determines the interval (in seconds) between consecutive recordings.

  • buffer_duration: Indicates the duration (in seconds) to maintain pre-roll audio in the buffer.

  • voice_activity_threshold: The threshold level above the long-term noise to detect the start of voice activity.

  • voice_deactivity_sensitivity: Sensitivity level for voice deactivation detection, ranging from 0 (least sensitive) to 1 (most sensitive). The default value is 0.3.

  • voice_deactivity_silence_after_speech_end: Duration (in seconds) of silence required after speech ends to trigger voice deactivation. The default is 0.1 seconds.

  • long_term_smoothing_factor: Exponential smoothing factor utilized in calculating the long-term noise level.

  • short_term_smoothing_factor: Exponential smoothing factor for calculating the short-term noise level.

  • level: Sets the desired logging level for internal logging. Default is logging.WARNING.

Contribution

Contributions are always welcome!

License

MIT

Author

Kolja Beigel
Email: kolja.beigel@web.de
GitHub