Explains how to process audio while processing video with OpenCV

When processing images with OpenCV, we sometimes want to process audio at the same time. The audio is processed in a separate Python library.

This article explains points to keep in mind when using OpenCV to process audio while displaying video.

Video frame and Audio frame
Application Examples
Reference

Video frame and Audio frame

The main function is shown below.

def main(video_file, audio_file):
    print(f"{video_file=},{audio_file=},{use_pyaudio=}")
    cap = cv2.VideoCapture(video_file)

    player = WithPyAudio(audio_file)
    start_time = time.time()

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        cv2.imshow(video_file, frame)

        elapsed = (time.time() - start_time) * 1000  # msec
        play_time = int(cap.get(cv2.CAP_PROP_POS_MSEC))
        sleep = max(1, int(play_time - elapsed))
        if cv2.waitKey(sleep) & 0xFF == ord("q"):
            break

    player.close()
    cap.release()
    cv2.destroyAllWindows()Code language: Python (python)

The video file (video_file) and audio file (audio_file) are passed to the main function separately. Normal video files also include audio, which must be split and passed to the main function beforehand. To split the audio from the video, use ffmpeg and execute the following

ffmpeg -i test.mp4 -vn -f wav test.wavCode language: Bash (bash)

cap.read() to obtain video frames. No special processing is required for audio data in the loop.

        ret, frame = cap.read()
        if not ret:
            breakCode language: Python (python)

Sets elapsed to the actual time since playback began. Set play_time to the time that represents the current playback position of the video. cv2.CAP_PROP_POS_MSEC can be used to obtain the playback position time. By adjusting the play_time to match the elapsed time, the video playback speed can be adjusted to real time. No special processing is required to adjust the playback speed of the audio to the real time.

        play_time = int(cap.get(cv2.CAP_PROP_POS_MSEC))
        sleep = max(1, int(play_time - elapsed))
        if cv2.waitKey(sleep) & 0xFF == ord("q"):
            breakCode language: Python (python)

Finally, the pyaudio process is described.

class WithPyAudio:
    def __init__(self, audio_file):
        super().__init__()
        wf = wave.open(audio_file, "rb")
        p = pyaudio.PyAudio()

        self.wf = wf
        self.p = p
        print(f"{wf.getsampwidth()=},{wf.getnchannels()=},{wf.getframerate()=}")

        self.stream = p.open(
            format=p.get_format_from_width(wf.getsampwidth()),
            channels=wf.getnchannels(),
            rate=wf.getframerate(),
            output=True,
            stream_callback=self._stream_cb,
        )

    def close(self):
        self.stream.stop_stream()
        self.stream.close
        self.p.terminate()
        self.wf.close()

    def _stream_cb(self, in_data, frame_count, time_info, status):
        data = self.wf.readframes(frame_count)
        # process audio here
        return (data, pyaudio.paContinue)Code language: Python (python)

Create wave file object wf and PyAudio object p. When p.open is called, the callback function _stream_cb passed as the stream_callback keyword argument is called periodically.

    def __init__(self, audio_file):
        super().__init__()
        wf = wave.open(audio_file, "rb")
        p = pyaudio.PyAudio()

        self.wf = wf
        self.p = p
        print(f"{wf.getsampwidth()=},{wf.getnchannels()=},{wf.getframerate()=}")

        self.stream = p.open(
            format=p.get_format_from_width(wf.getsampwidth()),
            channels=wf.getnchannels(),
            rate=wf.getframerate(),
            output=True,
            stream_callback=self._stream_cb,
        )Code language: Python (python)

In _stream_cb, the audio frame is read from the wave file object wf and the frame is returned. If an audio file is to be processed, the audio frame retrieved in this callback function should be processed.

    def _stream_cb(self, in_data, frame_count, time_info, status):
        data = self.wf.readframes(frame_count)
        # process audio here
        return (data, pyaudio.paContinue)Code language: Python (python)

Source code is available at https://github.com/otamajakusi/opencv_video_with_audio

Application Examples

As an example of application, consider the case where video object recognition and audio recognition are performed simultaneously, and when a specific sound is to be detected, the corresponding object is highlighted. Please refer to the video below for a more detailed explanation.

This is an example of how YOLOv5 recognizes the commentator’s voice such as “45 fu(like pone in chess)”, “88 gyoku(king in chess)”, “79 gyoku(king in chess)”, etc., while recognizing the Shogi board with YOLOv5, and highlighting the position of the coordinates on the Shogi board. The position of the coordinates are highlighted on the shogi board. Speech recognition is done using wav2vec2.

That’s all.