Effortless Video Transcription with OpenAI Whisper

Have you ever needed a transcript from a video but found manual transcription tedious and time-consuming? In this post, I’ll show you how to automate transcription from any video source using OpenAI’s Whisper AI.

We’ll build a Python script that:

Downloads audio from a video source (YouTube in our example)
Transcribes it accurately with Whisper
Saves the transcript as a text file

This approach works with any video platform that yt-dlp supports, not just YouTube!

Prerequisites

Before we start, make sure you have the following installed:

Python 3.7+
FFmpeg (required for audio processing)

Setting Up Your Environment

First, install the required Python packages:

1

pip install yt-dlp openai-whisper torch

The Complete Solution

Here’s the full code to download and transcribe videos:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102


import asyncio
import os
import yt_dlp
import whisper
import torch

async def download_audio(video_url):
    """Download audio from a video URL and return the file path."""
    print("Downloading audio...")
    
    # Extract video info first to get the title
    ydl_info_opts = {
        'quiet': True,
        'no_warnings': True,
    }
    
    loop = asyncio.get_event_loop()
    video_info = await loop.run_in_executor(
        None, 
        lambda: yt_dlp.YoutubeDL(ydl_info_opts).extract_info(video_url, download=False)
    )
    
    # Create a safe filename from the title
    safe_title = "".join([c if c.isalnum() or c in " -_" else "_" for c in video_info.get('title', 'video')])
    filename = f"{safe_title}.mp3"
    
    ydl_opts = {
        'format': 'bestaudio/best',
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
            'preferredquality': '192',
        }],
        'outtmpl': f'{safe_title}.%(ext)s',
        'progress_hooks': [lambda d: print(f"Download progress: {d.get('_percent_str', 'downloading...')}")],
        'overwrites': True,
    }
    
    # Download the video and extract audio
    await loop.run_in_executor(
        None, 
        lambda: yt_dlp.YoutubeDL(ydl_opts).download([video_url])
    )
    
    return filename

async def transcribe_video(video_url, language="en", output_file=None):
    """Download a video and transcribe it using Whisper."""
    try:
        # Download the audio file
        audio_file = await download_audio(video_url)
        
        # Load Whisper model with GPU if available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Using device: {device}")
        print("Loading Whisper model...")
        
        # Use the turbo model for faster transcription
        model = whisper.load_model("turbo", device=device)
        
        # Transcribe the audio
        print("Transcribing...")
        result = model.transcribe(
            audio_file, 
            language=language, 
            without_timestamps=True, 
            fp16=(device == "cuda")
        )
        
        transcript = result["text"]
        
        # Write transcript to file if specified
        if output_file:
            with open(output_file, "w", encoding="utf-8") as f:
                f.write(transcript)
            print(f"Transcript saved to {output_file}")
        
        return transcript
        
    except Exception as e:
        print(f"Error during transcription: {e}")
        raise
    finally:
        # Clean up downloaded files if needed
        # Uncomment if you want to remove the audio file after transcription
        # if 'audio_file' in locals() and os.path.exists(audio_file):
        #     os.remove(audio_file)
        #     print(f"Removed temporary file: {audio_file}")
        pass

async def main():
    url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"  # Replace with your video URL
    language = "en"  # Change to desired language code (e.g., "zh" for Chinese)
    output_file = "transcript.txt"
    
    transcript = await transcribe_video(url, language, output_file)
    
    print("\nTranscript:")
    print(transcript)

if __name__ == "__main__":
    asyncio.run(main())

How It Works

Think of this script as a recording assistant that handles the tedious parts of transcription for you. Here’s how the process works:

1. Audio Extraction

The download_audio function is like a smart audio recorder. It:

Takes any video URL (not just YouTube)
Extracts just the audio track (saving bandwidth)
Converts it to MP3 format
Returns the path to the audio file

This is handled asynchronously so your program doesn’t freeze during downloads.

2. Transcription with Whisper

Next, our transcribe_video function:

Feeds the audio to OpenAI’s Whisper model
Automatically detects the language (or uses your specified language)
Converts speech to text with high accuracy
Returns the transcript and optionally saves it to a file

Whisper is like having a professional transcriber working at superhuman speed. It handles accents, background noise, and technical terminology surprisingly well.

3. Hardware Acceleration

The script automatically uses your GPU if available, making transcription much faster. If you’re transcribing long videos, this can reduce waiting time from hours to minutes.

Customization Options

Language Support

Whisper supports multiple languages. To transcribe in a specific language, change the language parameter:

1
2
3
4
5


# For Chinese transcription
transcript = await transcribe_video(url, language="zh", output_file="chinese_transcript.txt")

# For Spanish transcription
transcript = await transcribe_video(url, language="es", output_file="spanish_transcript.txt")

Model Size

You can choose different Whisper models based on your needs:

“tiny” - Fastest but less accurate
“base” - Good balance for short clips
“small” - Better accuracy for most uses
“medium” - High accuracy
“large” - Highest accuracy but slower
“turbo” - Optimized for speed

To change the model:

1

model = whisper.load_model("medium", device=device)

Working with Local Files

If you already have a video or audio file locally, you can skip the download step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


import whisper

def transcribe_local_file(file_path, language="en", output_file=None):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model("turbo", device=device)
    
    result = model.transcribe(file_path, language=language)
    transcript = result["text"]
    
    if output_file:
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(transcript)
    
    return transcript

# Usage
transcript = transcribe_local_file("my_video.mp4", "en", "transcript.txt")

Practical Applications

This transcription tool can be useful for:

Creating subtitles for your own videos
Researching content from lectures or talks
Converting interviews to text for analysis
Making video content more accessible
Creating searchable archives of speech content

Conclusion

With just a few lines of Python code, you can harness the power of OpenAI’s Whisper to generate accurate transcripts from virtually any video source. This approach saves hours of manual transcription work while providing high-quality results.

Give it a try with your own videos or any online content you need to transcribe. The ability to quickly convert speech to text opens up new possibilities for content analysis, accessibility, and productivity.

Happy transcribing!