Google Speech-to-Text API 분석

Google Cloud Speech-to-Text 변환 및 화자 분할 분석

오디오 파일은 FLAC, LINEAR16(WAV) 등 몇 가지 타입만 지원 합니다.
Streaming 방식은 긴 오디오 파일에 대해서도 빠른 응답을 받을 수 있지만, 10M 파일 사이즈 제한이 있습니다. (Google Cloud Speech-to-Text API Streaming example 참조)
Google Cloud Storage를 이용한 Long_Running 방식은 10M 이상의 긴 오디오 파일도 처리 할 수 있지만, 응답시간이 길고, Timeout이 발생할 수 있습니다. (Google Cloud Speech-to-Text API Long_Running example 참조)
화자 분리는 en-US, en-IN, es-ES 언어 만 지원합니다. (Google Cloud Speech-to-Text API speaker_diarization (English only) 참조)
2019.6월 현재 시점에 Google STT에서 지원하는 긴 오디오의 한국어 음성에 사용 가능한 솔루션은 Long_Running 방식(위 3번 항목) 입니다. (이 방법은 화자 분할은 지원 안됨)
10M 이상의 긴 오디오 파일을 Streaming 방식으로 처리하기 위해서는, 오디오 파일을 10M 이하로 잘라서 처리하는 방법이 있습니다. (Google Cloud Speech-to-Text API Streaming example 참조, 결과를 text 파일로 저장하는 기능 포함)
Microphone Streaming API를 이용해서, 긴 파일을 실시간 Streaming 방식으로 처리하는 방법은 Google Cloud Speech-to-Text API Microphone Streaming emulate 파일과 Google STT Python functions 코드를 참고하세요.

JSON representation : google.cloud.speech.types.RecognitionConfig
encoding : 오디오 인코딩 방식(LINEAR16, FLAC 등), Google 문서 Link
sample_rate_hertz : 오디오 샘플링 Rate(8000, 16000, 44100, 48000 등), Wikipedia 문서 Link
language_code : 언어 코드 (한국어: ‘ko-KR’, 영어: en-US)
enable_automatic_punctuation : 구두점(punctuation) 자동 추가(True or False)
enable_word_time_offsets : 단어별 시작/종료 시간 정보(True or False)
enable_speaker_diarization : 화자 분할 여부(True or False)
diarization_speaker_count : 화자 분할 시 화자 숫자
model : 텍스트 변환 모델(video, phone_call, command_and_search, default) 선택, Google 문서 Link