Have you ever needed a transcript from a video but found manual transcription tedious and time-consuming? In this post, I’ll show you how to automate transcription from any video source using OpenAI’s Whisper AI.
We’ll build a Python script that:
- Downloads audio from a video source (YouTube in our example)
- Transcribes it accurately with Whisper
- Saves the transcript as a text file
This approach works with any video platform that yt-dlp supports, not just YouTube!
Prerequisites
Before we start, make sure you have the following installed:
- Python 3.7+
- FFmpeg (required for audio processing)
Setting Up Your Environment
First, install the required Python packages:
1
|
pip install yt-dlp openai-whisper torch
|
The Complete Solution
Here’s the full code to download and transcribe videos:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
|
import asyncio
import os
import yt_dlp
import whisper
import torch
async def download_audio(video_url):
"""Download audio from a video URL and return the file path."""
print("Downloading audio...")
# Extract video info first to get the title
ydl_info_opts = {
'quiet': True,
'no_warnings': True,
}
loop = asyncio.get_event_loop()
video_info = await loop.run_in_executor(
None,
lambda: yt_dlp.YoutubeDL(ydl_info_opts).extract_info(video_url, download=False)
)
# Create a safe filename from the title
safe_title = "".join([c if c.isalnum() or c in " -_" else "_" for c in video_info.get('title', 'video')])
filename = f"{safe_title}.mp3"
ydl_opts = {
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
'preferredquality': '192',
}],
'outtmpl': f'{safe_title}.%(ext)s',
'progress_hooks': [lambda d: print(f"Download progress: {d.get('_percent_str', 'downloading...')}")],
'overwrites': True,
}
# Download the video and extract audio
await loop.run_in_executor(
None,
lambda: yt_dlp.YoutubeDL(ydl_opts).download([video_url])
)
return filename
async def transcribe_video(video_url, language="en", output_file=None):
"""Download a video and transcribe it using Whisper."""
try:
# Download the audio file
audio_file = await download_audio(video_url)
# Load Whisper model with GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
print("Loading Whisper model...")
# Use the turbo model for faster transcription
model = whisper.load_model("turbo", device=device)
# Transcribe the audio
print("Transcribing...")
result = model.transcribe(
audio_file,
language=language,
without_timestamps=True,
fp16=(device == "cuda")
)
transcript = result["text"]
# Write transcript to file if specified
if output_file:
with open(output_file, "w", encoding="utf-8") as f:
f.write(transcript)
print(f"Transcript saved to {output_file}")
return transcript
except Exception as e:
print(f"Error during transcription: {e}")
raise
finally:
# Clean up downloaded files if needed
# Uncomment if you want to remove the audio file after transcription
# if 'audio_file' in locals() and os.path.exists(audio_file):
# os.remove(audio_file)
# print(f"Removed temporary file: {audio_file}")
pass
async def main():
url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ" # Replace with your video URL
language = "en" # Change to desired language code (e.g., "zh" for Chinese)
output_file = "transcript.txt"
transcript = await transcribe_video(url, language, output_file)
print("\nTranscript:")
print(transcript)
if __name__ == "__main__":
asyncio.run(main())
|
How It Works
Think of this script as a recording assistant that handles the tedious parts of transcription for you. Here’s how the process works:
The download_audio
function is like a smart audio recorder. It:
- Takes any video URL (not just YouTube)
- Extracts just the audio track (saving bandwidth)
- Converts it to MP3 format
- Returns the path to the audio file
This is handled asynchronously so your program doesn’t freeze during downloads.
2. Transcription with Whisper
Next, our transcribe_video
function:
- Feeds the audio to OpenAI’s Whisper model
- Automatically detects the language (or uses your specified language)
- Converts speech to text with high accuracy
- Returns the transcript and optionally saves it to a file
Whisper is like having a professional transcriber working at superhuman speed. It handles accents, background noise, and technical terminology surprisingly well.
3. Hardware Acceleration
The script automatically uses your GPU if available, making transcription much faster. If you’re transcribing long videos, this can reduce waiting time from hours to minutes.
Customization Options
Language Support
Whisper supports multiple languages. To transcribe in a specific language, change the language parameter:
1
2
3
4
5
|
# For Chinese transcription
transcript = await transcribe_video(url, language="zh", output_file="chinese_transcript.txt")
# For Spanish transcription
transcript = await transcribe_video(url, language="es", output_file="spanish_transcript.txt")
|
Model Size
You can choose different Whisper models based on your needs:
- “tiny” - Fastest but less accurate
- “base” - Good balance for short clips
- “small” - Better accuracy for most uses
- “medium” - High accuracy
- “large” - Highest accuracy but slower
- “turbo” - Optimized for speed
To change the model:
1
|
model = whisper.load_model("medium", device=device)
|
Working with Local Files
If you already have a video or audio file locally, you can skip the download step:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
import whisper
def transcribe_local_file(file_path, language="en", output_file=None):
device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("turbo", device=device)
result = model.transcribe(file_path, language=language)
transcript = result["text"]
if output_file:
with open(output_file, "w", encoding="utf-8") as f:
f.write(transcript)
return transcript
# Usage
transcript = transcribe_local_file("my_video.mp4", "en", "transcript.txt")
|
Practical Applications
This transcription tool can be useful for:
- Creating subtitles for your own videos
- Researching content from lectures or talks
- Converting interviews to text for analysis
- Making video content more accessible
- Creating searchable archives of speech content
Conclusion
With just a few lines of Python code, you can harness the power of OpenAI’s Whisper to generate accurate transcripts from virtually any video source. This approach saves hours of manual transcription work while providing high-quality results.
Give it a try with your own videos or any online content you need to transcribe. The ability to quickly convert speech to text opens up new possibilities for content analysis, accessibility, and productivity.
Happy transcribing!