Mac M1 使用 whisper 和 ffmpeg 将视频转换为文字

2023-05-29 tech mac whisper ffmpeg 7 mins 4 图 2776 字

工作生活中我们总会遇到要求看某某视频写心得之类的这种要求。

目前我经常看 b 站，b 站已经有相关的插件可以给你总结视频内容了。比如我前几天在《ChatGPT 的一些资料总结》记录的这个插件Glarity Summary，可以利用 ChatGPT为谷歌搜索、YouTube视频、以及各种网页内容生成摘要。

毕竟涉及到工作和生活，有一些视频我们没有办法上传到 SNS 上生成摘要，所以只好在本地生成了。我目前的办法是将视频转换成音频，再转换成文字。通过特定的 prompt ，脱敏后交给 gpt 帮我生成我想要的数据。虽然不是很完美的方案，暂时能解决问题就好。这篇文章简要记录下过程。

本方案还有诸多需要优化的地方，比如 whisper 使用的是 cpu 的方案。 GPU 运行的报错我一直没办法解决。待有空了再研究。

我估计可能是我 Python 版本的问题，我在《 Mac M1 运行 conda 和 jupyter notebook 备忘》使用的 Python 版本是 3.8，有可能不行。

1. ffmpeg

安装 ffmpeg

brew install ffmpeg

将视频转换为音频：

ffmpeg -i "input.mp4" -vn -acodec libmp3lame output.mp3
ffmpeg -i "input.mov" -vn -acodec libmp3lame output.mp3

-i input.mov 指定输入文件路径和文件名。
-vn 告诉 FFmpeg 不包含视频流，只处理音频流。
-acodec libmp3lame 指定音频编解码器为 libmp3lame，用于将音频流编码为 MP3 格式。
output.mp3 指定输出的 MP3 文件路径和文件名。

image-20230530午後85304231

将音频截取为 30s 一份的音频。

ffmpeg -i output.mp3 -f segment -segment_time 30 -c copy output_%03d.mp3

image-20230530午前91813357

这里多写一些关于 ffmpeg 的命令，将 mp3转为 ogg格式：

ffmpeg -i zzzpv.mp3 -c:a libvorbis -q:a 10 -map_metadata 0 -id3v2_version 3 -write_id3v1 1 zzzpv.ogg

2. whisper

https://github.com/openai/whisper

whisper是一款多任务语音识别模型，可进行多语言语音识别、语音翻译和语言识别。它使用Transformer序列到序列模型训练在一个大量的多样化音频数据集上，并兼容Python 3.8-3.11和最新的PyTorch版本。它提供了五种不同的模型大小，其中包括仅支持英语的版本，其性能取决于语言。它可以通过命令行或Python使用，其代码和模型权重在MIT许可证下发布。

whisper 有几种模型，我在 mac m1 下（CPU 模式）使用 small 的模型很快，medium 的很慢。

激活虚拟环境，安装 whisper：

conda activate ~/Workspace/pytorch-test/env
pip install --upgrade git+https://github.com/openai/whisper.git

编写hello world，试试效果：

import whisper
# model = whisper.load_model("small")
model = whisper.load_model("medium")
audio = whisper.load_audio("/Users/kelu/Desktop/output_000.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to("cpu")

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
# options = whisper.DecodingOptions(fp16 = False, prompt="以下是普通话的句子")  # 简体中文增加 prompt
options = whisper.DecodingOptions(fp16 = False)
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

下载模型需要一些时间：

image-20230530午後01835016

我的两个音频文件也放来做个备份：output_1.mp3, output_2.mp3,

image-20230530午後91228422

写一个循环的逻辑：

import whisper

options = whisper.DecodingOptions(fp16 = False, prompt="以下是普通话的句子")
model = whisper.load_model("medium")

for i in range(361):
    file_name = f"output_{i:03d}.mp3"
    audio = whisper.load_audio("/Users/kelu/Desktop/"+file_name)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to("cpu")
    result = whisper.decode(model, mel, options)
    print(result.text)

3. 未解决问题

我尝试像参考资料 1 里的文章使用 mps 运行 whisper，但没有成功。网上也看到了很多讨论。待有精力再跟进了。