How to handle Video and Audio with OpenCV

OpenCV

When processing images with OpenCV, we sometimes want to process audio at the same time. The audio is processed in a separate Python library.

This article explains points to keep in mind when using OpenCV to process audio while displaying video.

Video frame and Audio frame

The main function is shown below.

def main(video_file, audio_file):
    print(f"{video_file=},{audio_file=},{use_pyaudio=}")
    cap = cv2.VideoCapture(video_file)

    player = WithPyAudio(audio_file)
    start_time = time.time()

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        cv2.imshow(video_file, frame)

        elapsed = (time.time() - start_time) * 1000  # msec
        play_time = int(cap.get(cv2.CAP_PROP_POS_MSEC))
        sleep = max(1, int(play_time - elapsed))
        if cv2.waitKey(sleep) & 0xFF == ord("q"):
            break

    player.close()
    cap.release()
    cv2.destroyAllWindows()Code language: Python (python)

The video file (video_file) and audio file (audio_file) are passed to the main function separately. Normal video files also include audio, which must be split and passed to the main function beforehand. To split the audio from the video, use ffmpeg and execute the following

ffmpeg -i test.mp4 -vn -f wav test.wavCode language: Bash (bash)

cap.read() to obtain video frames. No special processing is required for audio data in the loop.

        ret, frame = cap.read()
        if not ret:
            breakCode language: Python (python)

Sets elapsed to the actual time since playback began. Set play_time to the time that represents the current playback position of the video. cv2.CAP_PROP_POS_MSEC can be used to obtain the playback position time. By adjusting the play_time to match the elapsed time, the video playback speed can be adjusted to real time. No special processing is required to adjust the playback speed of the audio to the real time.

        play_time = int(cap.get(cv2.CAP_PROP_POS_MSEC))
        sleep = max(1, int(play_time - elapsed))
        if cv2.waitKey(sleep) & 0xFF == ord("q"):
            breakCode language: Python (python)

Finally, the pyaudio process is described.

class WithPyAudio:
    def __init__(self, audio_file):
        super().__init__()
        wf = wave.open(audio_file, "rb")
        p = pyaudio.PyAudio()

        self.wf = wf
        self.p = p
        print(f"{wf.getsampwidth()=},{wf.getnchannels()=},{wf.getframerate()=}")

        self.stream = p.open(
            format=p.get_format_from_width(wf.getsampwidth()),
            channels=wf.getnchannels(),
            rate=wf.getframerate(),
            output=True,
            stream_callback=self._stream_cb,
        )

    def close(self):
        self.stream.stop_stream()
        self.stream.close
        self.p.terminate()
        self.wf.close()

    def _stream_cb(self, in_data, frame_count, time_info, status):
        data = self.wf.readframes(frame_count)
        # process audio here
        return (data, pyaudio.paContinue)Code language: Python (python)

Create wave file object wf and PyAudio object p. When p.open is called, the callback function _stream_cb passed as the stream_callback keyword argument is called periodically.

    def __init__(self, audio_file):
        super().__init__()
        wf = wave.open(audio_file, "rb")
        p = pyaudio.PyAudio()

        self.wf = wf
        self.p = p
        print(f"{wf.getsampwidth()=},{wf.getnchannels()=},{wf.getframerate()=}")

        self.stream = p.open(
            format=p.get_format_from_width(wf.getsampwidth()),
            channels=wf.getnchannels(),
            rate=wf.getframerate(),
            output=True,
            stream_callback=self._stream_cb,
        )Code language: Python (python)

In _stream_cb, the audio frame is read from the wave file object wf and the frame is returned. If an audio file is to be processed, the audio frame retrieved in this callback function should be processed.

    def _stream_cb(self, in_data, frame_count, time_info, status):
        data = self.wf.readframes(frame_count)
        # process audio here
        return (data, pyaudio.paContinue)Code language: Python (python)

Source code is available at https://github.com/otamajakusi/opencv_video_with_audio


Application Examples

As an example of application, consider the case where video object recognition and audio recognition are performed simultaneously, and when a specific sound is to be detected, the corresponding object is highlighted. Please refer to the video below for a more detailed explanation.

This is an example of how YOLOv5 recognizes the commentator’s voice such as “45 fu(like pone in chess)”, “88 gyoku(king in chess)”, “79 gyoku(king in chess)”, etc., while recognizing the Shogi board with YOLOv5, and highlighting the position of the coordinates on the Shogi board. The position of the coordinates are highlighted on the shogi board. Speech recognition is done using wav2vec2.

That’s all.

Reference

GitHub - otamajakusi/opencv_video_with_audio: opencv video with audio play
opencv video with audio play. Contribute to otamajakusi/opencv_video_with_audio development by creating an account on GitHub.
【これも将棋AIです】将棋の解説を音声認識し盤面に表示
将棋観戦時に解説者の先生に説明していただく盤の位置、44歩、15歩、84飛車、、、の場所がパットわからないので、解説の盤の位置を音声認識と物体認識を使って自動表示するシステムを作りました。物体認識は YOLOv5, 音声認識は wav2vec2 を使っています。日本語はアルファベットで認識(デコード)しています。...
Audio and video synchronization with OpenCV and ffpyplayer
I am currently working on a "game" that allows it user to shift the position of a circle while watching the video. The video shows two individuals, wh...