How Video Translation Works | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

Thanks to rapid advancements in AI technology, what was once a highly challenging task – video translation – has now become much more accessible, though the results may not yet be perfect.

Compared to text translation, video translation is more intricate, but at its heart, it's still rooted in text-based translation. (While technologies capable of directly converting sound from one language to another exist, they are currently not mature enough and have limited practical applications.)

The general workflow for video translation can be broken down into several stages:

Speech Recognition: Extract human voice from the video and convert it into text.
Text Translation: Translate the extracted text into the target language.
Voice Synthesis: Generate target language speech based on the translated text.
Synchronization Adjustment: Ensure the dubbed audio and subtitle files are synchronized with the video content.
Embedding and Output: Embed the translated subtitles and dubbed audio into the video, generating a new video file.

Detailed Discussion of Each Stage:

Speech Recognition

The goal of this step is to accurately convert the spoken content within the video into text, complete with timestamps. Currently, there are multiple methods to achieve this, including using OpenAI's Whisper model, Alibaba's FunASR series models, or directly calling online speech recognition APIs like Baidu Speech Recognition.

When selecting a model, you can choose from various sizes, ranging from small (tiny) to large (large-v3), depending on your requirements. Generally, larger models offer higher recognition accuracy.

Text Translation

Once the text is obtained, translation can begin. It's crucial to note that subtitle translation differs from general text translation, primarily because subtitle translation requires careful consideration of timestamp matching.

When using traditional translation engines (e.g., Baidu Translate, Tencent Translate), only the subtitle text lines should be transmitted for translation. Avoid including line numbers or timestamp lines to prevent exceeding character limits or altering the subtitle format.

Ideally, the translated subtitles should maintain the same number of lines as the original, without any blank lines.

However, different translation engines, especially AI-powered ones, might intelligently merge lines based on context. This is particularly common when the subsequent line contains only a few isolated characters or one or two words and is semantically coherent with the preceding sentence; in such cases, it's highly probable that it will be merged into the previous line.

While this often results in a smoother and more elegant translation, it can also lead to subtitles not strictly matching the original line count and the appearance of blank lines.

Voice Synthesis

After translation is complete, voiceovers can be generated based on the translated subtitles.

Currently, EdgeTTS offers a virtually unlimited and free channel for voice synthesis. By sending subtitles line by line to EdgeTTS, individual voiceover audio files can be obtained, which are then merged into a complete audio file.

Synchronization Adjustment

Ensuring subtitles, audio, and video are synchronized is the biggest challenge in video translation.

It's inevitable that different languages have varying pronunciation durations, which leads to synchronization issues. Strategies to resolve this include accelerating the audio playback speed, extending the duration of video segments, and utilizing the silent intervals between subtitles for adjustments to achieve optimal synchronization.

If no adjustments are made and the translated content is simply embedded according to the original subtitle timestamps, it will inevitably lead to scenarios where the subtitles have disappeared but someone is still speaking, or the person in the video has long finished speaking, yet the audio continues to play.

There are two relatively simple ways to solve this problem:

Accelerate Audio Playback: Forcing the audio to finish within the subtitle's designated time interval can achieve synchronization. The downside is that the speaking speed can fluctuate, leading to a poor user experience.
Slow Down Video Playback: Slowing down the video segment corresponding to that subtitle, extending the segment until its length matches the new voiceover duration, can also achieve synchronization. The drawback is that the video might appear to stutter.

Both methods can be used simultaneously: accelerating the audio while extending the video segment. This prevents the audio from speeding up too much and the video from extending excessively.

Depending on the video's actual content, the silent intervals between two subtitle lines can also be utilized. One approach is to first try whether the audio can finish playing within the subtitle's designated interval by accelerating it only during the silent gap, without a general audio speed-up. If this works, overall audio acceleration is unnecessary, leading to a better result. The downside here is that the person in the video might visually appear to have finished speaking, while the audio is still playing.

Embedding and Output

After completing the above steps, embedding the translated subtitles and voiceover into the original video can be easily achieved using tools like ffmpeg. The final generated video file then completes the translation process.

bash

ffmpeg -y -i original_video.mp4 -i dubbed_audio.m4a -c:v libx264 -c:a aac -vf subtitles=subtitles.srt out.mp4

Unresolved Problem: Multiple Speaker Recognition

Speaker diarization, which means synthesizing different voiceovers for different characters in a video, involves identifying individual speakers. This requires pre-specifying the number of speaker roles, which is barely feasible for simple one or two-person dialogues. However, for most videos, it's difficult to determine the number of speakers in advance, and the resulting synthesized effect is often poor. Therefore, this aspect has not been considered for now.

Summary

The above outlines only the basic principles of the process. In practice, achieving high-quality translation involves many additional considerations, such as pre-processing various original video input formats (mov/mp4/avi/mkv), separating audio from silent video, isolating human voices from background noise in the audio, handling results from batch subtitle translation for speed, re-splitting subtitles when blank lines appear, generating and embedding dual subtitles, and more.

Through this series of steps, video translation tasks can be successfully accomplished, seamlessly converting video content into the target language. Although technical challenges may arise during the process, continuous technological advancement and optimization promise further improvements in the quality and efficiency of video translation in the future.

Detailed Discussion of Each Stage: ​

Speech Recognition ​

Text Translation ​

Voice Synthesis ​

Synchronization Adjustment ​

Embedding and Output ​

Unresolved Problem: Multiple Speaker Recognition ​

Summary ​