When processing images with OpenCV, we sometimes want to process audio at the same time. The audio is processed in a separate Python library.
This article explains points to keep in mind when using OpenCV to process audio while displaying video.
Video frame and Audio frame
The main function is shown below.
def main(video_file, audio_file):
print(f"{video_file=},{audio_file=},{use_pyaudio=}")
cap = cv2.VideoCapture(video_file)
player = WithPyAudio(audio_file)
start_time = time.time()
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
cv2.imshow(video_file, frame)
elapsed = (time.time() - start_time) * 1000 # msec
play_time = int(cap.get(cv2.CAP_PROP_POS_MSEC))
sleep = max(1, int(play_time - elapsed))
if cv2.waitKey(sleep) & 0xFF == ord("q"):
break
player.close()
cap.release()
cv2.destroyAllWindows()
Code language: Python (python)
The video file (video_file) and audio file (audio_file) are passed to the main function separately. Normal video files also include audio, which must be split and passed to the main function beforehand. To split the audio from the video, use ffmpeg and execute the following
Code language: Bash (bash)ffmpeg -i test.mp4 -vn -f wav test.wav
cap.read() to obtain video frames. No special processing is required for audio data in the loop.
ret, frame = cap.read()
if not ret:
break
Code language: Python (python)
Sets elapsed
to the actual time since playback began. Set play_time
to the time that represents the current playback position of the video. cv2.CAP_PROP_POS_MSEC
can be used to obtain the playback position time. By adjusting the play_time
to match the elapsed
time, the video playback speed can be adjusted to real time. No special processing is required to adjust the playback speed of the audio to the real time.
play_time = int(cap.get(cv2.CAP_PROP_POS_MSEC))
sleep = max(1, int(play_time - elapsed))
if cv2.waitKey(sleep) & 0xFF == ord("q"):
break
Code language: Python (python)
Finally, the pyaudio process is described.
class WithPyAudio:
def __init__(self, audio_file):
super().__init__()
wf = wave.open(audio_file, "rb")
p = pyaudio.PyAudio()
self.wf = wf
self.p = p
print(f"{wf.getsampwidth()=},{wf.getnchannels()=},{wf.getframerate()=}")
self.stream = p.open(
format=p.get_format_from_width(wf.getsampwidth()),
channels=wf.getnchannels(),
rate=wf.getframerate(),
output=True,
stream_callback=self._stream_cb,
)
def close(self):
self.stream.stop_stream()
self.stream.close
self.p.terminate()
self.wf.close()
def _stream_cb(self, in_data, frame_count, time_info, status):
data = self.wf.readframes(frame_count)
# process audio here
return (data, pyaudio.paContinue)
Code language: Python (python)
Create wave file object wf
and PyAudio object p
. When p.open
is called, the callback function _stream_cb
passed as the stream_callback
keyword argument is called periodically.
def __init__(self, audio_file):
super().__init__()
wf = wave.open(audio_file, "rb")
p = pyaudio.PyAudio()
self.wf = wf
self.p = p
print(f"{wf.getsampwidth()=},{wf.getnchannels()=},{wf.getframerate()=}")
self.stream = p.open(
format=p.get_format_from_width(wf.getsampwidth()),
channels=wf.getnchannels(),
rate=wf.getframerate(),
output=True,
stream_callback=self._stream_cb,
)
Code language: Python (python)
In _stream_cb, the audio frame is read from the wave file object wf and the frame is returned. If an audio file is to be processed, the audio frame retrieved in this callback function should be processed.
def _stream_cb(self, in_data, frame_count, time_info, status):
data = self.wf.readframes(frame_count)
# process audio here
return (data, pyaudio.paContinue)
Code language: Python (python)
Source code is available at https://github.com/otamajakusi/opencv_video_with_audio
Application Examples
As an example of application, consider the case where video object recognition and audio recognition are performed simultaneously, and when a specific sound is to be detected, the corresponding object is highlighted. Please refer to the video below for a more detailed explanation.
This is an example of how YOLOv5 recognizes the commentator’s voice such as “45 fu(like pone in chess)”, “88 gyoku(king in chess)”, “79 gyoku(king in chess)”, etc., while recognizing the Shogi board with YOLOv5, and highlighting the position of the coordinates on the Shogi board. The position of the coordinates are highlighted on the shogi board. Speech recognition is done using wav2vec2.
That’s all.
Reference

