Speech Recognition Model Classification and Explanation | pyVideoTrans-Open Source Video Translation Tool -pyvideotrans.com github.com/jianchang512/pyvideotrans

There are 14 speech recognition models, classified into 3 categories, all used to recognize human speech in videos into subtitle text.

To reduce the download size, the software only includes the smallest "tiny" model by default. This model has the lowest recognition accuracy. For better results, please download other larger models.

Models Usable in Both OpenAI and Faster Modes

tiny, tiny.en: Smallest model, fastest speed, least resource consumption, and lowest accuracy.
base, base.en: Slightly larger than tiny.
small, small.en: Slightly larger than base.
medium, medium.en: Medium model, for Chinese recognition, you should choose at least medium or larger.
large-v1, large-v2, large-v3: Largest model, highest accuracy, requires 8G or 12G or more of available video memory (VRAM).

Models ending with .en are only for audio and video with English pronunciation.

Models Only Usable in Faster Mode

distil-whisper-small.en: Only for English videos.
distil-whisper-medium.en: Only for English videos.
distil-whisper-large-v2: Requires 8G or more of VRAM, currently performs well for English videos, but very poorly for other languages.

Category 1: Models with the .en Suffix

For example, tiny.en, base.en, medium.en, etc. As the name suggests, these models are only used for video processing where the original language is English. That is, if the spoken language in the video you are processing is English, then choosing a model with the .en suffix will yield better results than an equivalent model without the .en suffix.

Category 2: Models without the .en Suffix

These can be used for all supported languages, such as tiny, large-v1, etc.

Category 3: Models Starting with distil

There are currently only three models in this category, and all of them can only process videos where the original language is English. Even if they don't have the .en suffix, it is recommended to only use them for processing videos with English pronunciation. The results will be very poor for videos in other languages.

The characteristic of these models is that they are faster. Note that distil models can only be used in "faster" mode and cannot be used in "openai" mode.

distil-whisper-small.en
distil-whisper-medium.en
distil-whisper-large-v2

Faster Model Download

All models are downloaded from this address: https://github.com/jianchang512/stt/releases/tag/0.0

After opening the page, choose according to the mode you want to use. It is recommended to choose the faster model for faster speed.

After downloading the faster model, the package contains a folder. Copy the folder inside to the "models" folder in the software directory.

For example, after downloading the "medium" model, you will see a folder inside the package. Copy this folder to the "models" directory.

OpenAI Model Download

The same address: https://github.com/jianchang512/stt/releases/tag/0.0

Scroll down and download the file with the .pt suffix. Copy this file directly to the "models" directory.

Models Usable in Both OpenAI and Faster Modes ​

Models Only Usable in Faster Mode ​

Category 1: Models with the .en Suffix ​

Category 2: Models without the .en Suffix ​

Category 3: Models Starting with distil ​

Faster Model Download ​

OpenAI Model Download ​