stt_server.py 35 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676
  1. """
  2. Speech-to-Text (STT) Server with Real-Time Transcription and WebSocket Interface
  3. This server provides real-time speech-to-text (STT) transcription using the RealtimeSTT library. It allows clients to connect via WebSocket to send audio data and receive real-time transcription updates. The server supports configurable audio recording parameters, voice activity detection (VAD), and wake word detection. It is designed to handle continuous transcription as well as post-recording processing, enabling real-time feedback with the option to improve final transcription quality after the complete sentence is recognized.
  4. ### Features:
  5. - Real-time transcription using pre-configured or user-defined STT models.
  6. - WebSocket-based communication for control and data handling.
  7. - Flexible recording and transcription options, including configurable pauses for sentence detection.
  8. - Supports Silero and WebRTC VAD for robust voice activity detection.
  9. ### Starting the Server:
  10. You can start the server using the command-line interface (CLI) command `stt-server`, passing the desired configuration options.
  11. ```bash
  12. stt-server [OPTIONS]
  13. ```
  14. ### Available Parameters:
  15. - `--model` (str, default: 'medium.en'): Path to the STT model or model size. Options: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v1, large-v2, or any huggingface CTranslate2 STT model like `deepdml/faster-whisper-large-v3-turbo-ct2`.
  16. - `--realtime_model_type` (str, default: 'tiny.en'): Model size for real-time transcription. Same options as `--model`.
  17. - `--language` (str, default: 'en'): Language code for the STT model. Leave empty for auto-detection.
  18. - `--input_device_index` (int, default: 1): Index of the audio input device to use.
  19. - `--silero_sensitivity` (float, default: 0.05): Sensitivity for Silero Voice Activity Detection (VAD). Lower values are less sensitive.
  20. - `--webrtc_sensitivity` (int, default: 3): Sensitivity for WebRTC VAD. Higher values are less sensitive.
  21. - `--min_length_of_recording` (float, default: 1.1): Minimum duration (in seconds) for a valid recording. Prevents short recordings.
  22. - `--min_gap_between_recordings` (float, default: 0): Minimum time (in seconds) between consecutive recordings.
  23. - `--enable_realtime_transcription` (flag, default: True): Enable real-time transcription of audio.
  24. - `--realtime_processing_pause` (float, default: 0.02): Time interval (in seconds) between processing audio chunks for real-time transcription. Lower values increase responsiveness.
  25. - `--silero_deactivity_detection` (flag, default: True): Use Silero model for end-of-speech detection.
  26. - `--early_transcription_on_silence` (float, default: 0.2): Start transcription after specified seconds of silence.
  27. - `--beam_size` (int, default: 5): Beam size for the main transcription model.
  28. - `--beam_size_realtime` (int, default: 3): Beam size for the real-time transcription model.
  29. - `--initial_prompt` (str, default: '...'): Initial prompt for the transcription model to guide its output format and style.
  30. - `--end_of_sentence_detection_pause` (float, default: 0.45): Duration of pause (in seconds) to consider as the end of a sentence.
  31. - `--unknown_sentence_detection_pause` (float, default: 0.7): Duration of pause (in seconds) to consider as an unknown or incomplete sentence.
  32. - `--mid_sentence_detection_pause` (float, default: 2.0): Duration of pause (in seconds) to consider as a mid-sentence break.
  33. - `--control_port` (int, default: 8011): Port for the control WebSocket connection.
  34. - `--data_port` (int, default: 8012): Port for the data WebSocket connection.
  35. ### WebSocket Interface:
  36. The server supports two WebSocket connections:
  37. 1. **Control WebSocket**: Used to send and receive commands, such as setting parameters or calling recorder methods.
  38. 2. **Data WebSocket**: Used to send audio data for transcription and receive real-time transcription updates.
  39. The server will broadcast real-time transcription updates to all connected clients on the data WebSocket.
  40. """
  41. extended_logging = True
  42. send_recorded_chunk = False
  43. log_incoming_chunks = False
  44. stt_optimizations = False
  45. from .install_packages import check_and_install_packages
  46. from datetime import datetime
  47. import asyncio
  48. import base64
  49. import sys
  50. if sys.platform == 'win32':
  51. asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
  52. check_and_install_packages([
  53. {
  54. 'module_name': 'RealtimeSTT', # Import module
  55. 'attribute': 'AudioToTextRecorder', # Specific class to check
  56. 'install_name': 'RealtimeSTT', # Package name for pip install
  57. },
  58. {
  59. 'module_name': 'websockets', # Import module
  60. 'install_name': 'websockets', # Package name for pip install
  61. },
  62. {
  63. 'module_name': 'numpy', # Import module
  64. 'install_name': 'numpy', # Package name for pip install
  65. },
  66. {
  67. 'module_name': 'scipy.signal', # Submodule of scipy
  68. 'attribute': 'resample', # Specific function to check
  69. 'install_name': 'scipy', # Package name for pip install
  70. }
  71. ])
  72. # Define ANSI color codes for terminal output
  73. class bcolors:
  74. HEADER = '\033[95m' # Magenta
  75. OKBLUE = '\033[94m' # Blue
  76. OKCYAN = '\033[96m' # Cyan
  77. OKGREEN = '\033[92m' # Green
  78. WARNING = '\033[93m' # Yellow
  79. FAIL = '\033[91m' # Red
  80. ENDC = '\033[0m' # Reset to default
  81. BOLD = '\033[1m'
  82. UNDERLINE = '\033[4m'
  83. print(f"{bcolors.BOLD}{bcolors.OKCYAN}Starting server, please wait...{bcolors.ENDC}")
  84. import threading
  85. import json
  86. import websockets
  87. from RealtimeSTT import AudioToTextRecorder
  88. import numpy as np
  89. from scipy.signal import resample
  90. global_args = None
  91. recorder = None
  92. recorder_config = {}
  93. recorder_ready = threading.Event()
  94. stop_recorder = False
  95. prev_text = ""
  96. # Define allowed methods and parameters for security
  97. allowed_methods = [
  98. 'set_microphone',
  99. 'abort',
  100. 'stop',
  101. 'clear_audio_queue',
  102. 'wakeup',
  103. 'shutdown',
  104. 'text',
  105. ]
  106. allowed_parameters = [
  107. 'silero_sensitivity',
  108. 'wake_word_activation_delay',
  109. 'post_speech_silence_duration',
  110. 'listen_start',
  111. 'recording_stop_time',
  112. 'last_transcription_bytes',
  113. 'last_transcription_bytes_b64',
  114. ]
  115. # Queues and connections for control and data
  116. control_connections = set()
  117. data_connections = set()
  118. control_queue = asyncio.Queue()
  119. audio_queue = asyncio.Queue()
  120. def preprocess_text(text):
  121. # Remove leading whitespaces
  122. text = text.lstrip()
  123. # Remove starting ellipses if present
  124. if text.startswith("..."):
  125. text = text[3:]
  126. # Remove any leading whitespaces again after ellipses removal
  127. text = text.lstrip()
  128. # Uppercase the first letter
  129. if text:
  130. text = text[0].upper() + text[1:]
  131. return text
  132. def text_detected(text, loop):
  133. global prev_text
  134. text = preprocess_text(text)
  135. if stt_optimizations:
  136. sentence_end_marks = ['.', '!', '?', '。']
  137. if text.endswith("..."):
  138. recorder.post_speech_silence_duration = global_args.mid_sentence_detection_pause
  139. elif text and text[-1] in sentence_end_marks and prev_text and prev_text[-1] in sentence_end_marks:
  140. recorder.post_speech_silence_duration = global_args.end_of_sentence_detection_pause
  141. else:
  142. recorder.post_speech_silence_duration = global_args.unknown_sentence_detection_pause
  143. prev_text = text
  144. # Put the message in the audio queue to be sent to clients
  145. message = json.dumps({
  146. 'type': 'realtime',
  147. 'text': text
  148. })
  149. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  150. # Get current timestamp in HH:MM:SS.nnn format
  151. timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
  152. if extended_logging:
  153. print(f" [{timestamp}] Realtime text: {bcolors.OKCYAN}{text}{bcolors.ENDC}\n", flush=True, end="")
  154. else:
  155. print(f"\r[{timestamp}] {bcolors.OKCYAN}{text}{bcolors.ENDC}", flush=True, end='')
  156. def on_recording_start(loop):
  157. # Send a message to the client indicating recording has started
  158. message = json.dumps({
  159. 'type': 'recording_start'
  160. })
  161. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  162. def on_recording_stop(loop):
  163. # Send a message to the client indicating recording has stopped
  164. message = json.dumps({
  165. 'type': 'recording_stop'
  166. })
  167. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  168. def on_vad_detect_start(loop):
  169. message = json.dumps({
  170. 'type': 'vad_detect_start'
  171. })
  172. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  173. def on_vad_detect_stop(loop):
  174. message = json.dumps({
  175. 'type': 'vad_detect_stop'
  176. })
  177. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  178. def on_wakeword_detected(loop):
  179. # Send a message to the client when wake word detection starts
  180. message = json.dumps({
  181. 'type': 'wakeword_detected'
  182. })
  183. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  184. def on_wakeword_detection_start(loop):
  185. # Send a message to the client when wake word detection starts
  186. message = json.dumps({
  187. 'type': 'wakeword_detection_start'
  188. })
  189. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  190. def on_wakeword_detection_end(loop):
  191. # Send a message to the client when wake word detection ends
  192. message = json.dumps({
  193. 'type': 'wakeword_detection_end'
  194. })
  195. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  196. def on_transcription_start(loop):
  197. # Send a message to the client when transcription starts
  198. message = json.dumps({
  199. 'type': 'transcription_start'
  200. })
  201. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  202. def on_realtime_transcription_update(text, loop):
  203. # Send real-time transcription updates to the client
  204. text = preprocess_text(text)
  205. message = json.dumps({
  206. 'type': 'realtime_update',
  207. 'text': text
  208. })
  209. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  210. def on_recorded_chunk(chunk, loop):
  211. if send_recorded_chunk:
  212. bytes_b64 = base64.b64encode(chunk.tobytes()).decode('utf-8')
  213. message = json.dumps({
  214. 'type': 'recorded_chunk',
  215. 'bytes': bytes_b64
  216. })
  217. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  218. # Define the server's arguments
  219. def parse_arguments():
  220. import argparse
  221. parser = argparse.ArgumentParser(description='Start the Speech-to-Text (STT) server with various configuration options.')
  222. parser.add_argument('--model', type=str, default='large-v2',
  223. help='Path to the STT model or model size. Options include: tiny, tiny.en, base, base.en, small, small.en, medium, medium.en, large-v1, large-v2, or any huggingface CTranslate2 STT model such as deepdml/faster-whisper-large-v3-turbo-ct2. Default is large-v2.')
  224. parser.add_argument('--realtime_model_type', type=str, default='tiny.en',
  225. help='Model size for real-time transcription. The options are the same as --model. This is used only if real-time transcription is enabled. Default is tiny.en.')
  226. parser.add_argument('--language', type=str, default='en',
  227. help='Language code for the STT model to transcribe in a specific language. Leave this empty for auto-detection based on input audio. Default is en.')
  228. parser.add_argument('--input_device_index', type=int, default=1,
  229. help='Index of the audio input device to use. Use this option to specify a particular microphone or audio input device based on your system. Default is 1.')
  230. parser.add_argument('--silero_sensitivity', type=float, default=0.05,
  231. help='Sensitivity level for Silero Voice Activity Detection (VAD), with a range from 0 to 1. Lower values make the model less sensitive, useful for noisy environments. Default is 0.05.')
  232. parser.add_argument('--silero_use_onnx', action='store_true', default=False,
  233. help='Enable ONNX version of Silero model for faster performance with lower resource usage. Default is False.')
  234. parser.add_argument('--webrtc_sensitivity', type=int, default=3,
  235. help='Sensitivity level for WebRTC Voice Activity Detection (VAD), with a range from 0 to 3. Higher values make the model less sensitive, useful for cleaner environments. Default is 3.')
  236. parser.add_argument('--min_length_of_recording', type=float, default=1.1,
  237. help='Minimum duration of valid recordings in seconds. This prevents very short recordings from being processed, which could be caused by noise or accidental sounds. Default is 1.1 seconds.')
  238. parser.add_argument('--min_gap_between_recordings', type=float, default=0,
  239. help='Minimum time (in seconds) between consecutive recordings. Setting this helps avoid overlapping recordings when there’s a brief silence between them. Default is 0 seconds.')
  240. parser.add_argument('--enable_realtime_transcription', action='store_true', default=True,
  241. help='Enable continuous real-time transcription of audio as it is received. When enabled, transcriptions are sent in near real-time. Default is True.')
  242. parser.add_argument('--realtime_processing_pause', type=float, default=0.02,
  243. help='Time interval (in seconds) between processing audio chunks for real-time transcription. Lower values increase responsiveness but may put more load on the CPU. Default is 0.02 seconds.')
  244. parser.add_argument('--silero_deactivity_detection', action='store_true', default=True,
  245. help='Use the Silero model for end-of-speech detection. This option can provide more robust silence detection in noisy environments, though it consumes more GPU resources. Default is True.')
  246. parser.add_argument('--early_transcription_on_silence', type=float, default=0.2,
  247. help='Start transcription after the specified seconds of silence. This is useful when you want to trigger transcription mid-speech when there is a brief pause. Should be lower than post_speech_silence_duration. Set to 0 to disable. Default is 0.2 seconds.')
  248. parser.add_argument('--beam_size', type=int, default=5,
  249. help='Beam size for the main transcription model. Larger values may improve transcription accuracy but increase the processing time. Default is 5.')
  250. parser.add_argument('--beam_size_realtime', type=int, default=3,
  251. help='Beam size for the real-time transcription model. A smaller beam size allows for faster real-time processing but may reduce accuracy. Default is 3.')
  252. parser.add_argument('--initial_prompt', type=str,
  253. default='End incomplete sentences with ellipses. Examples: Complete: The sky is blue. Incomplete: When the sky... Complete: She walked home. Incomplete: Because he...',
  254. help='Initial prompt that guides the transcription model to produce transcriptions in a particular style or format. The default provides instructions for handling sentence completions and ellipsis usage.')
  255. parser.add_argument('--end_of_sentence_detection_pause', type=float, default=0.45,
  256. help='The duration of silence (in seconds) that the model should interpret as the end of a sentence. This helps the system detect when to finalize the transcription of a sentence. Default is 0.45 seconds.')
  257. parser.add_argument('--unknown_sentence_detection_pause', type=float, default=0.7,
  258. help='The duration of pause (in seconds) that the model should interpret as an incomplete or unknown sentence. This is useful for identifying when a sentence is trailing off or unfinished. Default is 0.7 seconds.')
  259. parser.add_argument('--mid_sentence_detection_pause', type=float, default=2.0,
  260. help='The duration of pause (in seconds) that the model should interpret as a mid-sentence break. Longer pauses can indicate a pause in speech but not necessarily the end of a sentence. Default is 2.0 seconds.')
  261. parser.add_argument('--control_port', type=int, default=8011,
  262. help='The port number used for the control WebSocket connection. Control connections are used to send and receive commands to the server. Default is port 8011.')
  263. parser.add_argument('--data_port', type=int, default=8012,
  264. help='The port number used for the data WebSocket connection. Data connections are used to send audio data and receive transcription updates in real time. Default is port 8012.')
  265. parser.add_argument('--wake_words', type=str, default="Jarvis",
  266. help='Specify the wake word(s) that will trigger the server to start listening. For example, setting this to "Jarvis" will make the system start transcribing when it detects the wake word "Jarvis". Default is "Jarvis".')
  267. parser.add_argument('--wake_words_sensitivity', type=float, default=0.5,
  268. help='Sensitivity level for wake word detection, with a range from 0 (most sensitive) to 1 (least sensitive). Adjust this value based on your environment to ensure reliable wake word detection. Default is 0.5.')
  269. parser.add_argument('--wake_word_timeout', type=float, default=5.0,
  270. help='Maximum time in seconds that the system will wait for a wake word before timing out. After this timeout, the system stops listening for wake words until reactivated. Default is 5.0 seconds.')
  271. parser.add_argument('--wake_word_activation_delay', type=float, default=20,
  272. help='The delay in seconds before the wake word detection is activated after the system starts listening. This prevents false positives during the start of a session. Default is 0.5 seconds.')
  273. parser.add_argument('--wakeword_backend', type=str, default='pvporcupine',
  274. help='The backend used for wake word detection. You can specify different backends such as "default" or any custom implementations depending on your setup. Default is "pvporcupine".')
  275. parser.add_argument('--openwakeword_model_paths', type=str, nargs='*',
  276. help='A list of file paths to OpenWakeWord models. This is useful if you are using OpenWakeWord for wake word detection and need to specify custom models.')
  277. parser.add_argument('--openwakeword_inference_framework', type=str, default='tensorflow',
  278. help='The inference framework to use for OpenWakeWord models. Supported frameworks could include "tensorflow", "pytorch", etc. Default is "tensorflow".')
  279. parser.add_argument('--wake_word_buffer_duration', type=float, default=1.0,
  280. help='Duration of the buffer in seconds for wake word detection. This sets how long the system will store the audio before and after detecting the wake word. Default is 1.0 seconds.')
  281. parser.add_argument('--use_main_model_for_realtime', action='store_true',
  282. help='Enable this option if you want to use the main model for real-time transcription, instead of the smaller, faster real-time model. Using the main model may provide better accuracy but at the cost of higher processing time.')
  283. parser.add_argument('--use_extended_logging', action='store_true',
  284. help='Writes extensive log messages for the recording worker, that processes the audio chunks.')
  285. # Parse arguments
  286. args = parser.parse_args()
  287. # Replace escaped newlines with actual newlines in initial_prompt
  288. if args.initial_prompt:
  289. args.initial_prompt = args.initial_prompt.replace("\\n", "\n")
  290. return args
  291. def _recorder_thread(loop):
  292. global recorder, prev_text, stop_recorder
  293. print(f"{bcolors.OKGREEN}Initializing RealtimeSTT server with parameters:{bcolors.ENDC}")
  294. for key, value in recorder_config.items():
  295. print(f" {bcolors.OKBLUE}{key}{bcolors.ENDC}: {value}")
  296. recorder = AudioToTextRecorder(**recorder_config)
  297. print(f"{bcolors.OKGREEN}{bcolors.BOLD}RealtimeSTT initialized{bcolors.ENDC}")
  298. recorder_ready.set()
  299. def process_text(full_sentence):
  300. full_sentence = preprocess_text(full_sentence)
  301. message = json.dumps({
  302. 'type': 'fullSentence',
  303. 'text': full_sentence
  304. })
  305. # Use the passed event loop here
  306. asyncio.run_coroutine_threadsafe(audio_queue.put(message), loop)
  307. timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
  308. if extended_logging:
  309. print(f" [{timestamp}] Full text: {bcolors.BOLD}Sentence:{bcolors.ENDC} {bcolors.OKGREEN}{full_sentence}{bcolors.ENDC}\n", flush=True, end="")
  310. else:
  311. print(f"\r[{timestamp}] {bcolors.BOLD}Sentence:{bcolors.ENDC} {bcolors.OKGREEN}{full_sentence}{bcolors.ENDC}\n")
  312. try:
  313. while not stop_recorder:
  314. recorder.text(process_text)
  315. except KeyboardInterrupt:
  316. print(f"{bcolors.WARNING}Exiting application due to keyboard interrupt{bcolors.ENDC}")
  317. def decode_and_resample(
  318. audio_data,
  319. original_sample_rate,
  320. target_sample_rate):
  321. # Decode 16-bit PCM data to numpy array
  322. if original_sample_rate == target_sample_rate:
  323. return audio_data
  324. audio_np = np.frombuffer(audio_data, dtype=np.int16)
  325. # Calculate the number of samples after resampling
  326. num_original_samples = len(audio_np)
  327. num_target_samples = int(num_original_samples * target_sample_rate /
  328. original_sample_rate)
  329. # Resample the audio
  330. resampled_audio = resample(audio_np, num_target_samples)
  331. return resampled_audio.astype(np.int16).tobytes()
  332. async def control_handler(websocket, path):
  333. print(f"{bcolors.OKGREEN}Control client connected{bcolors.ENDC}")
  334. global recorder
  335. control_connections.add(websocket)
  336. try:
  337. async for message in websocket:
  338. if not recorder_ready.is_set():
  339. print(f"{bcolors.WARNING}Recorder not ready{bcolors.ENDC}")
  340. continue
  341. if isinstance(message, str):
  342. # Handle text message (command)
  343. try:
  344. command_data = json.loads(message)
  345. command = command_data.get("command")
  346. if command == "set_parameter":
  347. parameter = command_data.get("parameter")
  348. value = command_data.get("value")
  349. if parameter in allowed_parameters and hasattr(recorder, parameter):
  350. setattr(recorder, parameter, value)
  351. # Format the value for output
  352. if isinstance(value, float):
  353. value_formatted = f"{value:.2f}"
  354. else:
  355. value_formatted = value
  356. timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
  357. if extended_logging:
  358. print(f" [{timestamp}] {bcolors.OKGREEN}Set recorder.{parameter} to: {bcolors.OKBLUE}{value_formatted}{bcolors.ENDC}")
  359. # Optionally send a response back to the client
  360. await websocket.send(json.dumps({"status": "success", "message": f"Parameter {parameter} set to {value}"}))
  361. else:
  362. if not parameter in allowed_parameters:
  363. print(f"{bcolors.WARNING}Parameter {parameter} is not allowed (set_parameter){bcolors.ENDC}")
  364. await websocket.send(json.dumps({"status": "error", "message": f"Parameter {parameter} is not allowed (set_parameter)"}))
  365. else:
  366. print(f"{bcolors.WARNING}Parameter {parameter} does not exist (set_parameter){bcolors.ENDC}")
  367. await websocket.send(json.dumps({"status": "error", "message": f"Parameter {parameter} does not exist (set_parameter)"}))
  368. elif command == "get_parameter":
  369. parameter = command_data.get("parameter")
  370. request_id = command_data.get("request_id") # Get the request_id from the command data
  371. if parameter in allowed_parameters and hasattr(recorder, parameter):
  372. value = getattr(recorder, parameter)
  373. if isinstance(value, float):
  374. value_formatted = f"{value:.2f}"
  375. else:
  376. value_formatted = f"{value}"
  377. value_truncated = value_formatted[:39] + "…" if len(value_formatted) > 40 else value_formatted
  378. timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
  379. if extended_logging:
  380. print(f" [{timestamp}] {bcolors.OKGREEN}Get recorder.{parameter}: {bcolors.OKBLUE}{value_truncated}{bcolors.ENDC}")
  381. response = {"status": "success", "parameter": parameter, "value": value}
  382. if request_id is not None:
  383. response["request_id"] = request_id
  384. await websocket.send(json.dumps(response))
  385. else:
  386. if not parameter in allowed_parameters:
  387. print(f"{bcolors.WARNING}Parameter {parameter} is not allowed (get_parameter){bcolors.ENDC}")
  388. await websocket.send(json.dumps({"status": "error", "message": f"Parameter {parameter} is not allowed (get_parameter)"}))
  389. else:
  390. print(f"{bcolors.WARNING}Parameter {parameter} does not exist (get_parameter){bcolors.ENDC}")
  391. await websocket.send(json.dumps({"status": "error", "message": f"Parameter {parameter} does not exist (get_parameter)"}))
  392. elif command == "call_method":
  393. method_name = command_data.get("method")
  394. if method_name in allowed_methods:
  395. method = getattr(recorder, method_name, None)
  396. if method and callable(method):
  397. args = command_data.get("args", [])
  398. kwargs = command_data.get("kwargs", {})
  399. method(*args, **kwargs)
  400. timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
  401. print(f" [{timestamp}] {bcolors.OKGREEN}Called method recorder.{bcolors.OKBLUE}{method_name}{bcolors.ENDC}")
  402. await websocket.send(json.dumps({"status": "success", "message": f"Method {method_name} called"}))
  403. else:
  404. print(f"{bcolors.WARNING}Recorder does not have method {method_name}{bcolors.ENDC}")
  405. await websocket.send(json.dumps({"status": "error", "message": f"Recorder does not have method {method_name}"}))
  406. else:
  407. print(f"{bcolors.WARNING}Method {method_name} is not allowed{bcolors.ENDC}")
  408. await websocket.send(json.dumps({"status": "error", "message": f"Method {method_name} is not allowed"}))
  409. else:
  410. print(f"{bcolors.WARNING}Unknown command: {command}{bcolors.ENDC}")
  411. await websocket.send(json.dumps({"status": "error", "message": f"Unknown command {command}"}))
  412. except json.JSONDecodeError:
  413. print(f"{bcolors.WARNING}Received invalid JSON command{bcolors.ENDC}")
  414. await websocket.send(json.dumps({"status": "error", "message": "Invalid JSON command"}))
  415. else:
  416. print(f"{bcolors.WARNING}Received unknown message type on control connection{bcolors.ENDC}")
  417. except websockets.exceptions.ConnectionClosed as e:
  418. print(f"{bcolors.WARNING}Control client disconnected: {e}{bcolors.ENDC}")
  419. finally:
  420. control_connections.remove(websocket)
  421. async def data_handler(websocket, path):
  422. print(f"{bcolors.OKGREEN}Data client connected{bcolors.ENDC}")
  423. data_connections.add(websocket)
  424. try:
  425. while True:
  426. message = await websocket.recv()
  427. if isinstance(message, bytes):
  428. if log_incoming_chunks:
  429. print(".", end='', flush=True)
  430. # Handle binary message (audio data)
  431. metadata_length = int.from_bytes(message[:4], byteorder='little')
  432. metadata_json = message[4:4+metadata_length].decode('utf-8')
  433. metadata = json.loads(metadata_json)
  434. sample_rate = metadata['sampleRate']
  435. chunk = message[4+metadata_length:]
  436. resampled_chunk = decode_and_resample(chunk, sample_rate, 16000)
  437. recorder.feed_audio(resampled_chunk)
  438. else:
  439. print(f"{bcolors.WARNING}Received non-binary message on data connection{bcolors.ENDC}")
  440. except websockets.exceptions.ConnectionClosed as e:
  441. print(f"{bcolors.WARNING}Data client disconnected: {e}{bcolors.ENDC}")
  442. finally:
  443. data_connections.remove(websocket)
  444. recorder.clear_audio_queue() # Ensure audio queue is cleared if client disconnects
  445. async def broadcast_audio_messages():
  446. while True:
  447. message = await audio_queue.get()
  448. for conn in list(data_connections):
  449. try:
  450. timestamp = datetime.now().strftime('%H:%M:%S.%f')[:-3]
  451. if extended_logging:
  452. print(f" [{timestamp}] Sending message: {bcolors.OKBLUE}{message}{bcolors.ENDC}\n", flush=True, end="")
  453. await conn.send(message)
  454. except websockets.exceptions.ConnectionClosed:
  455. data_connections.remove(conn)
  456. # Helper function to create event loop bound closures for callbacks
  457. def make_callback(loop, callback):
  458. def inner_callback(*args, **kwargs):
  459. callback(*args, **kwargs, loop=loop)
  460. return inner_callback
  461. async def main_async():
  462. global stop_recorder, recorder_config, global_args
  463. args = parse_arguments()
  464. global_args = args
  465. # Get the event loop here and pass it to the recorder thread
  466. loop = asyncio.get_event_loop()
  467. recorder_config = {
  468. 'model': args.model,
  469. 'realtime_model_type': args.realtime_model_type,
  470. 'language': args.language,
  471. 'input_device_index': args.input_device_index,
  472. 'silero_sensitivity': args.silero_sensitivity,
  473. 'silero_use_onnx': args.silero_use_onnx,
  474. 'webrtc_sensitivity': args.webrtc_sensitivity,
  475. 'post_speech_silence_duration': args.unknown_sentence_detection_pause,
  476. 'min_length_of_recording': args.min_length_of_recording,
  477. 'min_gap_between_recordings': args.min_gap_between_recordings,
  478. 'enable_realtime_transcription': args.enable_realtime_transcription,
  479. 'realtime_processing_pause': args.realtime_processing_pause,
  480. 'silero_deactivity_detection': args.silero_deactivity_detection,
  481. 'early_transcription_on_silence': args.early_transcription_on_silence,
  482. 'beam_size': args.beam_size,
  483. 'beam_size_realtime': args.beam_size_realtime,
  484. 'initial_prompt': args.initial_prompt,
  485. 'wake_words': args.wake_words,
  486. 'wake_words_sensitivity': args.wake_words_sensitivity,
  487. 'wake_word_timeout': args.wake_word_timeout,
  488. 'wake_word_activation_delay': args.wake_word_activation_delay,
  489. 'wakeword_backend': args.wakeword_backend,
  490. 'openwakeword_model_paths': args.openwakeword_model_paths,
  491. 'openwakeword_inference_framework': args.openwakeword_inference_framework,
  492. 'wake_word_buffer_duration': args.wake_word_buffer_duration,
  493. 'use_main_model_for_realtime': args.use_main_model_for_realtime,
  494. 'spinner': False,
  495. 'use_microphone': False,
  496. 'on_realtime_transcription_update': make_callback(loop, text_detected),
  497. 'on_recording_start': make_callback(loop, on_recording_start),
  498. 'on_recording_stop': make_callback(loop, on_recording_stop),
  499. 'on_vad_detect_start': make_callback(loop, on_vad_detect_start),
  500. 'on_vad_detect_stop': make_callback(loop, on_vad_detect_stop),
  501. 'on_wakeword_detected': make_callback(loop, on_wakeword_detected),
  502. 'on_wakeword_detection_start': make_callback(loop, on_wakeword_detection_start),
  503. 'on_wakeword_detection_end': make_callback(loop, on_wakeword_detection_end),
  504. 'on_transcription_start': make_callback(loop, on_transcription_start),
  505. 'on_recorded_chunk': make_callback(loop, on_recorded_chunk),
  506. 'no_log_file': True, # Disable logging to file
  507. 'use_extended_logging': args.use_extended_logging,
  508. }
  509. control_server = await websockets.serve(control_handler, "localhost", args.control_port)
  510. data_server = await websockets.serve(data_handler, "localhost", args.data_port)
  511. print(f"{bcolors.OKGREEN}Control server started on {bcolors.OKBLUE}ws://localhost:{args.control_port}{bcolors.ENDC}")
  512. print(f"{bcolors.OKGREEN}Data server started on {bcolors.OKBLUE}ws://localhost:{args.data_port}{bcolors.ENDC}")
  513. # Task to broadcast audio messages
  514. broadcast_task = asyncio.create_task(broadcast_audio_messages())
  515. recorder_thread = threading.Thread(target=_recorder_thread, args=(loop,))
  516. recorder_thread.start()
  517. recorder_ready.wait()
  518. print(f"{bcolors.OKGREEN}Server started. Press Ctrl+C to stop the server.{bcolors.ENDC}")
  519. try:
  520. await asyncio.gather(control_server.wait_closed(), data_server.wait_closed(), broadcast_task)
  521. except KeyboardInterrupt:
  522. print(f"{bcolors.WARNING}{bcolors.BOLD}Shutting down gracefully...{bcolors.ENDC}")
  523. finally:
  524. # Shut down the recorder
  525. if recorder:
  526. stop_recorder = True
  527. recorder.abort()
  528. recorder.stop()
  529. recorder.shutdown()
  530. print(f"{bcolors.OKGREEN}Recorder shut down{bcolors.ENDC}")
  531. recorder_thread.join()
  532. print(f"{bcolors.OKGREEN}Recorder thread finished{bcolors.ENDC}")
  533. # Cancel all active tasks in the event loop
  534. tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
  535. for task in tasks:
  536. task.cancel()
  537. # Run pending tasks and handle cancellation
  538. await asyncio.gather(*tasks, return_exceptions=True)
  539. print(f"{bcolors.OKGREEN}All tasks cancelled, closing event loop now.{bcolors.ENDC}")
  540. def main():
  541. try:
  542. asyncio.run(main_async())
  543. except KeyboardInterrupt:
  544. # Capture any final KeyboardInterrupt to prevent it from showing up in logs
  545. print(f"{bcolors.WARNING}Server interrupted by user.{bcolors.ENDC}")
  546. exit(0)
  547. if __name__ == '__main__':
  548. main()