{"id":2695,"date":"2025-03-28T07:02:21","date_gmt":"2025-03-28T07:02:21","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/03\/28\/talk-to-videos\/"},"modified":"2025-03-28T07:02:21","modified_gmt":"2025-03-28T07:02:21","slug":"talk-to-videos","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/03\/28\/talk-to-videos\/","title":{"rendered":"Talk to Videos"},"content":{"rendered":"<p>    Talk to Videos<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1743036560566\" class=\"mdspan-comment\">Large language models<\/mdspan> (LLMs) are improving in efficiency and are now able to understand different data formats, offering possibilities for myriads of applications in different domains. Initially, LLMs were inherently able to process only text. The image understanding feature was integrated by coupling an LLM with another image encoding model. However, <code>gpt-4o<\/code> was trained on both text and images and is the first true multimodal LLM that can understand both text and images. Other modalities such as audio are integrated into modern LLMs through other AI models, e.g., OpenAI\u2019s Whisper models.<\/p>\n<p class=\"wp-block-paragraph\">LLMs are now being used more as information processors where they can process data in different formats. Integrating multiple modalities into LLMs opens areas of numerous applications in education, <a href=\"https:\/\/towardsdatascience.com\/tag\/business\/\" title=\"Business\">Business<\/a>, and other sectors. One such application is the processing of educational videos, documentaries, webinars, presentations, business meetings, lectures, and other content using LLMs and interacting with this content more naturally. The audio modality in these videos contains rich information that could be used in a number of applications. In educational settings, it can be used for personalized learning, enhancing accessibility of students with special needs, study aid creation, remote learning support without requiring a teacher\u2019s presence to understand content, and assessing students\u2019 knowledge about a topic. In business settings, it can be used for training new employees with onboarding videos, extracting and generating knowledge from recording meetings and presentations, customized learning materials from product demonstration videos, and extracting insights from recorded industry conferences without watching hours of videos, to name a few.<\/p>\n<p class=\"wp-block-paragraph\">This article discusses the development of an application to interact with videos in a natural way and create learning content from them. The application has the following features:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">It takes an input video either through a URL or from a local path and extracts audio from the video<\/li>\n<li class=\"wp-block-list-item\">Transcribes the audio using OpenAI\u2019s state-of-the-art model <code>gpt-4o-transcribe<\/code><em>,<\/em> which has demonstrated improved Word Error Rate (WER) performance over existing Whisper models across multiple established benchmarks<\/li>\n<li class=\"wp-block-list-item\">Creates a vector store of the transcript and develops a retrieval augment generation (RAG) to establish a conversation with the video transcript<\/li>\n<li class=\"wp-block-list-item\">Respond to users\u2019 questions in text and speech using different voices, selectable from the application\u2019s UI.<\/li>\n<li class=\"wp-block-list-item\">Creates learning content such as:\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Hierarchical representation of the video contents to provide users with quick insights into the main concepts and supporting details<\/li>\n<li class=\"wp-block-list-item\"><span style=\"color: var(--wp--custom--color--text--primary); font-size: var(--wp--custom--font--size--20); background-color: var(--wp--preset--color--surface-primary);\">Generate quizzes to transform passive video watching into active learning by challenging users to recall and apply information presented in the video.<\/span><\/li>\n<li class=\"wp-block-list-item\">Generates flashcards from the video content that support active recall and spaced repetition learning techniques<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The entire workflow of the application is shown in the following figure.<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-178.png?ssl=1\" alt=\"\" class=\"wp-image-600562\"><figcaption class=\"wp-element-caption\">Application workflow (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The whole codebase, along with detailed instructions for installation and usage, is available on <a href=\"https:\/\/github.com\/umairalipathan1980\/Talk-to-Videos\/tree\/main\">GitHub<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Here is the structure of the GitHub repository. The main Streamlit application implements the GUI interface and calls several other functions from other feature and helper modules (<code>.py<\/code> files).<\/p>\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-179.png?ssl=1\" alt=\"\" class=\"wp-image-600563\"><figcaption class=\"wp-element-caption\">GitHub code structure (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In addition, you can visualize the codebase by opening the \u201c<em>codebase visualization<\/em>\u201d HTML file in a browser, which describes the structures of each module.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-180.png?ssl=1\" alt=\"\" class=\"wp-image-600564\"><figcaption class=\"wp-element-caption\">Codebase visualization (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Let\u2019s delve into the step-by-step development of this application. I will not discuss the entire code, but only its major part. The whole code in the GitHub repository is adequately commented.<\/p>\n<p class=\"has-heading-6-font-size wp-block-paragraph\"><strong>Video Input and Processing<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Video input and processing logic are implemented in <code>transcriber.py<\/code>. When the application loads, it verifies whether FFMPEG is present (<code>verify_ffmpeg<\/code>) in the application\u2019s root directory. FFMPEG is required for downloading a video (if the input is a URL) and extracting audio from the video which is then used to create a transcript.<\/p>\n<pre class=\"wp-block-prismatic-blocks\" datatext=\"el1742994526403\"><code class=\"language-python\">def verify_ffmpeg():\n    \"\"\"Verify that FFmpeg is available and print its location.\"\"\"\n    # Add FFmpeg to PATH\n    os.environ['PATH'] = FFMPEG_LOCATION + os.pathsep + os.environ['PATH']\n    # Check if FFmpeg binaries exist\n    ffmpeg_path = os.path.join(FFMPEG_LOCATION, 'ffmpeg.exe')\n    ffprobe_path = os.path.join(FFMPEG_LOCATION, 'ffprobe.exe')\n    if not os.path.exists(ffmpeg_path):\n        raise FileNotFoundError(f\"FFmpeg executable not found at: {ffmpeg_path}\")\n    if not os.path.exists(ffprobe_path):\n        raise FileNotFoundError(f\"FFprobe executable not found at: {ffprobe_path}\")\n    print(f\"FFmpeg found at: {ffmpeg_path}\")\n    print(f\"FFprobe found at: {ffprobe_path}\")\n    # Try to execute FFmpeg to make sure it works\n    try:\n        # Add shell=True for Windows and capture errors properly\n        result = subprocess.run([ffmpeg_path, '-version'], \n                               stdout=subprocess.PIPE, \n                               stderr=subprocess.PIPE,\n                               shell=True,  # This can help with permission issues on Windows\n                               check=False)\n        if result.returncode == 0:\n            print(f\"FFmpeg version: {result.stdout.decode().splitlines()[0]}\")\n        else:\n            error_msg = result.stderr.decode()\n            print(f\"FFmpeg error: {error_msg}\")\n            # Check for specific permission errors\n            if \"Access is denied\" in error_msg:\n                print(\"Permission error detected. Trying alternative approach...\")\n                # Try an alternative approach - just check file existence without execution\n                if os.path.exists(ffmpeg_path) and os.path.exists(ffprobe_path):\n                    print(\"FFmpeg files exist but execution test failed due to permissions.\")\n                    print(\"WARNING: The app may fail when trying to process videos.\")\n                    # Return paths anyway and hope for the best when actually used\n                    return ffmpeg_path, ffprobe_path\n                \n            raise RuntimeError(f\"FFmpeg execution failed: {error_msg}\")\n    except Exception as e:\n        print(f\"Error checking FFmpeg: {e}\")\n        # Fallback option if verification fails but files exist\n        if os.path.exists(ffmpeg_path) and os.path.exists(ffprobe_path):\n            print(\"WARNING: FFmpeg files exist but verification failed.\")\n            print(\"Attempting to continue anyway, but video processing may fail.\")\n            return ffmpeg_path, ffprobe_path \n        raise\n    return ffmpeg_path, ffprobe_path\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The video input is in the form of a URL (for instance, YouTube URL) or a local file path. The <code>process_video<\/code> function determines the input type and routes it accordingly. If the input is a URL, the helper functions <code>get_video_info<\/code> and <code>get_video_id<\/code> extract video metadata (title, description, duration) without downloading it using <code>yt_dlp<\/code> package.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">#Function to determine the input type and route it appropriately\ndef process_video(youtube_url, output_dir, api_key, model=\"gpt-4o-transcribe\"):\n    \"\"\"\n    Process a YouTube video to generate a transcript\n    Wrapper function that combines download and transcription\n    Args:\n        youtube_url: URL of the YouTube video\n        output_dir: Directory to save the output\n        api_key: OpenAI API key\n        model: The model to use for transcription (default: gpt-4o-transcribe)\n    Returns:\n        dict: Dictionary containing transcript and file paths\n    \"\"\"\n    # First download the audio\n    print(\"Downloading video...\")\n    audio_path = process_video_download(youtube_url, output_dir)\n    \n    print(\"Transcribing video...\")\n    # Then transcribe the audio\n    transcript, transcript_path = process_video_transcribe(audio_path, output_dir, api_key, model=model)\n    \n    # Return the combined results\n    return {\n        'transcript': transcript,\n        'transcript_path': transcript_path,\n        'audio_path': audio_path\n    }\n\ndef get_video_info(youtube_url):\n    \"\"\"Get video information without downloading.\"\"\"\n    # Check local cache first\n    global _video_info_cache\n    if youtube_url in _video_info_cache:\n        return _video_info_cache[youtube_url]\n        \n    # Extract info if not cached\n    with yt_dlp.YoutubeDL() as ydl:\n        info = ydl.extract_info(youtube_url, download=False)\n        # Cache the result\n        _video_info_cache[youtube_url] = info\n        # Also cache the video ID separately\n        _video_id_cache[youtube_url] = info.get('id', 'video')\n        return info\n\ndef get_video_id(youtube_url):\n    \"\"\"Get just the video ID without re-extracting if already known.\"\"\"\n    global _video_id_cache\n    if youtube_url in _video_id_cache:\n        return _video_id_cache[youtube_url]\n    \n    # If not in cache, extract from URL directly if possible\n    if \"v=\" in youtube_url:\n        video_id = youtube_url.split(\"v=\")[1].split(\"&amp;\")[0]\n        _video_id_cache[youtube_url] = video_id\n        return video_id\n    elif \"youtu.be\/\" in youtube_url:\n        video_id = youtube_url.split(\"youtu.be\/\")[1].split(\"?\")[0]\n        _video_id_cache[youtube_url] = video_id\n        return video_id\n    \n    # If we can't extract directly, fall back to full info extraction\n    info = get_video_info(youtube_url)\n    video_id = info.get('id', 'video')\n    return video_id\n<\/code><\/pre>\n<p class=\"wp-block-paragraph\">After the video input is given, the code in <code>app.py<\/code> checks whether a transcript for the input video already exists (in the case of URL input). This is done by calling the following two helper functions from <code>transcriber.py<\/code>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def get_transcript_path(youtube_url, output_dir):\n    \"\"\"Get the expected transcript path for a given YouTube URL.\"\"\"\n    # Get video ID with caching\n    video_id = get_video_id(youtube_url)\n    # Return expected transcript path\n    return os.path.join(output_dir, f\"{video_id}_transcript.txt\")\n\ndef transcript_exists(youtube_url, output_dir):\n    \"\"\"Check if a transcript already exists for this video.\"\"\"\n    transcript_path = get_transcript_path(youtube_url, output_dir)\n    return os.path.exists(transcript_path)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">If <code>transcript_exists<\/code> returns the path of an existing transcript, the next step is to create the vector store for the RAG. If no existing transcript is found, the next step is to download audio from the URL and convert it to a standard audio format. The function <code>process_video_download<\/code> downloads audio from the URL using the FFMPEG library and converts it to <code>.mp3 <\/code>format. If the input is a local video file, <code>app.py<\/code> proceeds to convert it to <code>.mp3<\/code> file.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def process_video_download(youtube_url, output_dir):\n    \"\"\"\n    Download audio from a YouTube video\n    Args:\n        youtube_url: URL of the YouTube video\n        output_dir: Directory to save the output\n        \n    Returns:\n        str: Path to the downloaded audio file\n    \"\"\"\n    # Create output directory if it doesn't exist\n    os.makedirs(output_dir, exist_ok=True)\n    \n    # Extract video ID from URL\n    video_id = None\n    if \"v=\" in youtube_url:\n        video_id = youtube_url.split(\"v=\")[1].split(\"&amp;\")[0]\n    elif \"youtu.be\/\" in youtube_url:\n        video_id = youtube_url.split(\"youtu.be\/\")[1].split(\"?\")[0]\n    else:\n        raise ValueError(\"Could not extract video ID from URL\")\n    # Set output paths\n    audio_path = os.path.join(output_dir, f\"{video_id}.mp3\")\n    \n    # Configure yt-dlp options\n    ydl_opts = {\n        'format': 'bestaudio\/best',\n        'postprocessors': [{\n            'key': 'FFmpegExtractAudio',\n            'preferredcodec': 'mp3',\n            'preferredquality': '192',\n        }],\n        'outtmpl': os.path.join(output_dir, f\"{video_id}\"),\n        'quiet': True\n    }\n    \n    # Download audio\n    with yt_dlp.YoutubeDL(ydl_opts) as ydl:\n        ydl.download([youtube_url])\n    \n    # Verify audio file exists\n    if not os.path.exists(audio_path):\n        # Try with an extension that yt-dlp might have used\n        potential_paths = [\n            os.path.join(output_dir, f\"{video_id}.mp3\"),\n            os.path.join(output_dir, f\"{video_id}.m4a\"),\n            os.path.join(output_dir, f\"{video_id}.webm\")\n        ]\n        \n        for path in potential_paths:\n            if os.path.exists(path):\n                # Convert to mp3 if it's not already\n                if not path.endswith('.mp3'):\n                    ffmpeg_path = verify_ffmpeg()[0]\n                    output_mp3 = os.path.join(output_dir, f\"{video_id}.mp3\")\n                    subprocess.run([\n                        ffmpeg_path, '-i', path, '-c:a', 'libmp3lame', \n                        '-q:a', '2', output_mp3, '-y'\n                    ], check=True, capture_output=True)\n                    os.remove(path)  # Remove the original file\n                    audio_path = output_mp3\n                else:\n                    audio_path = path\n                break\n        else:\n            raise FileNotFoundError(f\"Could not find downloaded audio file for video {video_id}\")\n    return audio_path<\/code><\/pre>\n<p class=\"has-heading-6-font-size wp-block-paragraph\"><strong>Audio Transcription Using OpenAI\u2019s <code>gpt-4o-transcribe<\/code> Model<\/strong><\/p>\n<p class=\"wp-block-paragraph\">After extracting audio and converting it to a standard audio format, the next step is to transcribe the audio to text format. For this purpose, I used OpenAI\u2019s newly launched <code>gpt-4o-transcribe<\/code> speech-to-text\u00a0model accessible through <a href=\"https:\/\/platform.openai.com\/docs\/guides\/speech-to-text\">speech-to-text\u00a0API<\/a>. \u00a0This model has outperformed OpenAI\u2019s <em>Whisper<\/em> models in terms of both transcription accuracy and robust language coverage.<\/p>\n<p class=\"wp-block-paragraph\">The function <code>process_video_transcribe <\/code>in <code>transcriber.py<\/code> receives the converted audio file and interfaces with <code>gpt-4o-transcribe<\/code> model with OpenAI\u2019s speech-to-text API. The <code>gpt-4o-transcribe<\/code> model currently has an audio file limit of 25MB and 1500 duration. To overcome this limitation, I split the longer files into multiple chunks and transcribe these chunks separately. The <code>process_video_transcribe<\/code> function checks whether the input file exceeds the size and\/or duration limit. If either threshold is exceeded, it calls <code>split_and_transcribe <\/code>function, which first calculates the number of chunks needed based on both size and duration and takes the maximum of these two as the final number of chunks for transcription. It then finds the start and end times for each chunk and extracts these chunks from the audio file. Subsequently, it transcribes each chunk using <code>gpt-4o-transcribe<\/code> model with OpenAI\u2019s speech-to-text API and then combines transcripts of all chunks to generate the final transcript.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def process_video_transcribe(audio_path, output_dir, api_key, progress_callback=None, model=\"gpt-4o-transcribe\"):\n    \"\"\"\n    Transcribe an audio file using OpenAI API, with automatic chunking for large files\n    Always uses the selected model, with no fallback\n    \n    Args:\n        audio_path: Path to the audio file\n        output_dir: Directory to save the transcript\n        api_key: OpenAI API key\n        progress_callback: Function to call with progress updates (0-100)\n        model: The model to use for transcription (default: gpt-4o-transcribe)\n        \n    Returns:\n        tuple: (transcript text, transcript path)\n    \"\"\"\n    # Extract video ID from audio path\n    video_id = os.path.basename(audio_path).split('.')[0]\n    transcript_path = os.path.join(output_dir, f\"{video_id}_transcript.txt\")\n    \n    # Setup OpenAI client\n    client = OpenAI(api_key=api_key)\n    \n    # Update progress\n    if progress_callback:\n        progress_callback(10)\n    \n    # Get file size in MB\n    file_size_mb = os.path.getsize(audio_path) \/ (1024 * 1024)\n    \n    # Universal chunking thresholds - apply to both models\n    max_size_mb = 25  # 25MB chunk size for both models\n    max_duration_seconds = 1500  # 1500 seconds chunk duration for both models\n    \n    # Load the audio file to get its duration\n    try:\n        audio = AudioSegment.from_file(audio_path)\n        duration_seconds = len(audio) \/ 1000  # pydub uses milliseconds\n    except Exception as e:\n        print(f\"Error loading audio to check duration: {e}\")\n        audio = None\n        duration_seconds = 0\n    \n    # Determine if chunking is needed\n    needs_chunking = False\n    chunking_reason = []\n    \n    if file_size_mb &gt; max_size_mb:\n        needs_chunking = True\n        chunking_reason.append(f\"size ({file_size_mb:.2f}MB exceeds {max_size_mb}MB)\")\n    \n    if duration_seconds &gt; max_duration_seconds:\n        needs_chunking = True\n        chunking_reason.append(f\"duration ({duration_seconds:.2f}s exceeds {max_duration_seconds}s)\")\n    \n    # Log the decision\n    if needs_chunking:\n        reason_str = \" and \".join(chunking_reason)\n        print(f\"Audio needs chunking due to {reason_str}. Using {model} for transcription.\")\n    else:\n        print(f\"Audio file is within limits. Using {model} for direct transcription.\")\n    \n    # Check if file needs chunking\n    if needs_chunking:\n        if progress_callback:\n            progress_callback(15)\n        \n        # Split the audio file into chunks and transcribe each chunk using the selected model only\n        full_transcript = split_and_transcribe(\n            audio_path, client, model, progress_callback, \n            max_size_mb, max_duration_seconds, audio\n        )\n    else:\n        # File is small enough, transcribe directly with the selected model\n        with open(audio_path, \"rb\") as audio_file:\n            if progress_callback:\n                progress_callback(30)\n                \n            transcript_response = client.audio.transcriptions.create(\n                model=model, \n                file=audio_file\n            )\n            \n            if progress_callback:\n                progress_callback(80)\n            \n            full_transcript = transcript_response.text\n    \n    # Save transcript to file\n    with open(transcript_path, \"w\", encoding=\"utf-8\") as f:\n        f.write(full_transcript)\n    \n    # Update progress\n    if progress_callback:\n        progress_callback(100)\n    \n    return full_transcript, transcript_path\n\ndef split_and_transcribe(audio_path, client, model, progress_callback=None, \n                         max_size_mb=25, max_duration_seconds=1500, audio=None):\n    \"\"\"\n    Split an audio file into chunks and transcribe each chunk \n    \n    Args:\n        audio_path: Path to the audio file\n        client: OpenAI client\n        model: Model to use for transcription (will not fall back to other models)\n        progress_callback: Function to call with progress updates\n        max_size_mb: Maximum file size in MB\n        max_duration_seconds: Maximum duration in seconds\n        audio: Pre-loaded AudioSegment (optional)\n        \n    Returns:\n        str: Combined transcript from all chunks\n    \"\"\"\n    # Load the audio file if not provided\n    if audio is None:\n        audio = AudioSegment.from_file(audio_path)\n    \n    # Get audio duration in seconds\n    duration_seconds = len(audio) \/ 1000\n    \n    # Calculate the number of chunks needed based on both size and duration\n    file_size_mb = os.path.getsize(audio_path) \/ (1024 * 1024)\n    \n    chunks_by_size = math.ceil(file_size_mb \/ (max_size_mb * 0.9))  # Use 90% of max to be safe\n    chunks_by_duration = math.ceil(duration_seconds \/ (max_duration_seconds * 0.95))  # Use 95% of max to be safe\n    num_chunks = max(chunks_by_size, chunks_by_duration)\n    \n    print(f\"Splitting audio into {num_chunks} chunks based on size ({chunks_by_size}) and duration ({chunks_by_duration})\")\n    \n    # Calculate chunk duration in milliseconds\n    chunk_length_ms = len(audio) \/\/ num_chunks\n    \n    # Create temp directory for chunks if it doesn't exist\n    temp_dir = os.path.join(os.path.dirname(audio_path), \"temp_chunks\")\n    os.makedirs(temp_dir, exist_ok=True)\n    \n    # Split the audio into chunks and transcribe each chunk\n    transcripts = []\n    \n    for i in range(num_chunks):\n        if progress_callback:\n            # Update progress: 20% for splitting, 60% for transcribing\n            progress_percent = 20 + int((i \/ num_chunks) * 60)\n            progress_callback(progress_percent)\n        \n        # Calculate start and end times for this chunk\n        start_ms = i * chunk_length_ms\n        end_ms = min((i + 1) * chunk_length_ms, len(audio))\n        \n        # Extract the chunk\n        chunk = audio[start_ms:end_ms]\n        \n        # Save the chunk to a temporary file\n        chunk_path = os.path.join(temp_dir, f\"chunk_{i}.mp3\")\n        chunk.export(chunk_path, format=\"mp3\")\n        \n        # Log chunk information\n        chunk_size_mb = os.path.getsize(chunk_path) \/ (1024 * 1024)\n        chunk_duration = len(chunk) \/ 1000\n        print(f\"Chunk {i+1}\/{num_chunks}: {chunk_size_mb:.2f}MB, {chunk_duration:.2f}s\")\n        \n        # Transcribe the chunk \n        try:\n            with open(chunk_path, \"rb\") as chunk_file:\n                transcript_response = client.audio.transcriptions.create(\n                    model=model,\n                    file=chunk_file\n                )\n                \n                # Add to our list of transcripts\n                transcripts.append(transcript_response.text)\n        except Exception as e:\n            print(f\"Error transcribing chunk {i+1} with {model}: {e}\")\n            # Add a placeholder for the failed chunk\n            transcripts.append(f\"[Transcription failed for segment {i+1}]\")\n        \n        # Clean up the temporary chunk file\n        os.remove(chunk_path)\n    \n    # Clean up the temporary directory\n    try:\n        os.rmdir(temp_dir)\n    except:\n        print(f\"Note: Could not remove temporary directory {temp_dir}\")\n    \n    # Combine all transcripts with proper spacing\n    full_transcript = \" \".join(transcripts)\n    \n    return full_transcript<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The following screenshot of the Streamlit app shows the video processing and transcribing workflow for one of my webinars, <em>\u201c<\/em><a href=\"https:\/\/www.youtube.com\/watch?v=BJC-mqdRXgw\"><em>Integrating LLMs into Business<\/em><\/a><em>,\u201d<\/em> available on my YouTube channel.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"516\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-187-1024x516.png?resize=1024%2C516&#038;ssl=1\" alt=\"\" class=\"wp-image-600572\"><figcaption class=\"wp-element-caption\">Snapshot of the Streamlit app showing the process of extracting audio and transcribing (image by author)<\/figcaption><\/figure>\n<p class=\"has-heading-6-font-size wp-block-paragraph\"><strong>Retrieval Augmented Generation (RAG) for Interactive Conversations<\/strong><\/p>\n<p class=\"wp-block-paragraph\">After generating the video transcript, the application develops a RAG to facilitate both text and speech-based interactions. The conversational intelligence is implemented through <code>VideoRAG <\/code>class in <code>rag_system.py<\/code> which initializes chunk size and overlap, OpenAI embeddings, <code>ChatOpenAI<\/code> instance to generate responses with <code>gpt-4o<\/code> model, and <code>ConversationBufferMemory<\/code> to maintain chat history for contextual continuity.<\/p>\n<p class=\"wp-block-paragraph\">The <code>create_vector_store<\/code> method splits the documents into chunks and creates a vector store using the FAISS vector database. The <code>handle_question_submission <\/code>method processes text questions and appends each new question and its answer to the conversation history. The handle_speech_input function implements the complete voice-to-text-to-voice pipeline. It first records the question audio, transcribes the question, processes the query through the RAG system, and synthesizes speech for the response.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">class VideoRAG:\n    def __init__(self, api_key=None, chunk_size=1000, chunk_overlap=200):\n        \"\"\"Initialize the RAG system with OpenAI API key.\"\"\"\n        # Use provided API key or try to get from environment\n        self.api_key = api_key if api_key else st.secrets[\"OPENAI_API_KEY\"]\n        if not self.api_key:\n            raise ValueError(\"OpenAI API key is required either as parameter or environment variable\")\n            \n        self.embeddings = OpenAIEmbeddings(openai_api_key=self.api_key)\n        self.llm = ChatOpenAI(\n            openai_api_key=self.api_key,\n            model=\"gpt-4o\",\n            temperature=0\n        )\n        self.chunk_size = chunk_size\n        self.chunk_overlap = chunk_overlap\n        self.vector_store = None\n        self.chain = None\n        self.memory = ConversationBufferMemory(\n            memory_key=\"chat_history\",\n            return_messages=True\n        )\n    \n    def create_vector_store(self, transcript):\n        \"\"\"Create a vector store from the transcript.\"\"\"\n        # Split the text into chunks\n        text_splitter = RecursiveCharacterTextSplitter(\n            chunk_size=self.chunk_size,\n            chunk_overlap=self.chunk_overlap,\n            separators=[\"nn\", \"n\", \" \", \"\"]\n        )\n        chunks = text_splitter.split_text(transcript)\n        \n        # Create vector store\n        self.vector_store = FAISS.from_texts(chunks, self.embeddings)\n        \n        # Create prompt template for the RAG system\n        system_template = \"\"\"You are a specialized AI assistant that answers questions about a specific video. \n        \n        You have access to snippets from the video transcript, and your role is to provide accurate information ONLY based on these snippets.\n        \n        Guidelines:\n        1. Only answer questions based on the information provided in the context from the video transcript, otherwise say that \"I don't know. The video doesn't cover that information.\"\n        2. The question may ask you to summarize the video or tell what the video is about. In that case, present a summary of the context. \n        3. Don't make up information or use knowledge from outside the provided context\n        4. Keep your answers concise and directly related to the question\n        5. If asked about your capabilities or identity, explain that you're an AI assistant that specializes in answering questions about this specific video\n        \n        Context from the video transcript:\n        {context}\n        \n        Chat History:\n        {chat_history}\n        \"\"\"\n        user_template = \"{question}\"\n        \n        # Create the messages for the chat prompt\n        messages = [\n            SystemMessagePromptTemplate.from_template(system_template),\n            HumanMessagePromptTemplate.from_template(user_template)\n        ]\n        \n        # Create the chat prompt\n        qa_prompt = ChatPromptTemplate.from_messages(messages)\n        \n        # Initialize the RAG chain with the custom prompt\n        self.chain = ConversationalRetrievalChain.from_llm(\n            llm=self.llm,\n            retriever=self.vector_store.as_retriever(\n                search_kwargs={\"k\": 5}\n            ),\n            memory=self.memory,\n            combine_docs_chain_kwargs={\"prompt\": qa_prompt},\n            verbose=True\n        )\n        \n        return len(chunks)\n    \n    def set_chat_history(self, chat_history):\n        \"\"\"Set chat history from external session state.\"\"\"\n        if not self.memory:\n            return\n            \n        # Clear existing memory\n        self.memory.clear()\n        \n        # Convert standard chat history format to LangChain message format\n        for message in chat_history:\n            if message[\"role\"] == \"user\":\n                self.memory.chat_memory.add_user_message(message[\"content\"])\n            elif message[\"role\"] == \"assistant\":\n                self.memory.chat_memory.add_ai_message(message[\"content\"])\n    \n    def ask(self, question, chat_history=None):\n        \"\"\"Ask a question to the RAG system.\"\"\"\n        if not self.chain:\n            raise ValueError(\"Vector store not initialized. Call create_vector_store first.\")\n        \n        # If chat history is provided, update the memory\n        if chat_history:\n            self.set_chat_history(chat_history)\n        \n        # Get response\n        response = self.chain.invoke({\"question\": question})\n        return response[\"answer\"]<\/code><\/pre>\n<p class=\"wp-block-paragraph\">See the following snapshot of the Streamlit app, showing the interactive conversation interface with the video.<\/p>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-182.png?ssl=1\" alt=\"\" class=\"wp-image-600567\"><figcaption class=\"wp-element-caption\">Snapshot showing conversational interface and interactive learning content (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The following snapshot shows a conversation with the video with speech input and text+speech output.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"922\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-184-922x1024.png?resize=922%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-600569\"><figcaption class=\"wp-element-caption\">Conversation with video (image by author)<\/figcaption><\/figure>\n<p class=\"has-heading-6-font-size wp-block-paragraph\"><strong>Feature Generation<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The application generates three features: hierarchical summary, quiz, and flashcards. Please refer to their respective commented codes in the <a href=\"https:\/\/github.com\/umairalipathan1980\/Talk-to-Videos\/tree\/main\">GitHub repo<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">The <code>SummaryGenerator <\/code>class in <code>summary.py<\/code> provides structured content summarization by creating a hierarchical representation of the video content to provide users with quick insights into the main concepts and supporting details. The system retrieves key contextual segments from the transcript using RAG. Using a prompt (see <code>generate_summary<\/code>), it creates a hierarchical summary with three levels: main points, sub-points, and additional details. The <code>create_summary_popup_html<\/code> method transforms the generated summary into an interactive visual representation using CSS and JavaScript.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># summary.py\nclass SummaryGenerator:\n    def __init__(self):\n        pass\n    \n    def generate_summary(self, rag_system, api_key, model=\"gpt-4o\", temperature=0.2):\n        \"\"\"\n        Generate a hierarchical bullet-point summary from the video transcript\n        \n        Args:\n            rag_system: The RAG system with vector store\n            api_key: OpenAI API key\n            model: Model to use for summary generation\n            temperature: Creativity level (0.0-1.0)\n            \n        Returns:\n            str: Hierarchical bullet-point summary text\n        \"\"\"\n        if not rag_system:\n            st.error(\"Please transcribe the video first before creating a summary!\")\n            return \"\"\n        \n        with st.spinner(\"Generating hierarchical summary...\"):\n            # Create LLM for summary generation\n            summary_llm = ChatOpenAI(\n                openai_api_key=api_key,\n                model=model,\n                temperature=temperature  # Lower temperature for more factual summaries\n            )\n            \n            # Use the RAG system to get relevant context\n            try:\n                # Get broader context since we're summarizing the whole video\n                relevant_docs = rag_system.vector_store.similarity_search(\n                    \"summarize the main points of this video\", k=10\n                )\n                context = \"nn\".join([doc.page_content for doc in relevant_docs])\n                \n                prompt = \"\"\"Based on the video transcript, create a hierarchical bullet-point summary of the content.\n                Structure your summary with exactly these levels:\n                \n                \u2022 Main points (use \u2022 or * at the start of the line for these top-level points)\n                  - Sub-points (use - at the start of the line for these second-level details)\n                    * Additional details (use spaces followed by * for third-level points)\n                \n                For example:\n                \u2022 First main point\n                  - Important detail about the first point\n                  - Another important detail\n                    * A specific example\n                    * Another specific example\n                \u2022 Second main point\n                  - Detail about second point\n                \n                Be consistent with the exact formatting shown above. Each bullet level must start with the exact character shown (\u2022 or *, -, and spaces+*).\n                Create 3-5 main points with 2-4 sub-points each, and add third-level details where appropriate.\n                Focus on the most important information from the video.\n                \"\"\"\n                \n                # Use the LLM with context to generate the summary\n                messages = [\n                    {\"role\": \"system\", \"content\": f\"You are given the following context from a video transcript:nn{context}nnUse this context to create a hierarchical summary according to the instructions.\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n                \n                response = summary_llm.invoke(messages)\n                return response.content\n            except Exception as e:\n                # Fallback to the regular RAG system if there's an error\n                st.warning(f\"Using standard summary generation due to error: {str(e)}\")\n                return rag_system.ask(prompt)\n    \n    def create_summary_popup_html(self, summary_content):\n        \"\"\"\n        Create HTML for the summary popup with properly formatted hierarchical bullets\n        \n        Args:\n            summary_content: Raw summary text with markdown bullet formatting\n            \n        Returns:\n            str: HTML for the popup with properly formatted bullets\n        \"\"\"\n        # Instead of relying on markdown conversion, let's manually parse and format the bullet points\n        lines = summary_content.strip().split('n')\n        formatted_html = []\n        \n        in_list = False\n        list_level = 0\n        \n        for line in lines:\n            line = line.strip()\n            \n            # Skip empty lines\n            if not line:\n                continue\n                \n            # Detect if this is a markdown header\n            if line.startswith('# '):\n                if in_list:\n                    # Close any open lists\n                    for _ in range(list_level):\n                        formatted_html.append('&lt;\/ul&gt;')\n                    in_list = False\n                    list_level = 0\n                formatted_html.append(f'&lt;h1&gt;{line[2:]}&lt;\/h1&gt;')\n                continue\n                \n            # Check line for bullet point markers\n            if line.startswith('\u2022 ') or line.startswith('* '):\n                # Top level bullet\n                content = line[2:].strip()\n                \n                if not in_list:\n                    # Start a new list\n                    formatted_html.append('&lt;ul class=\"top-level\"&gt;')\n                    in_list = True\n                    list_level = 1\n                elif list_level &gt; 1:\n                    # Close nested lists to get back to top level\n                    for _ in range(list_level - 1):\n                        formatted_html.append('&lt;\/ul&gt;&lt;\/li&gt;')\n                    list_level = 1\n                else:\n                    # Close previous list item if needed\n                    if formatted_html and not formatted_html[-1].endswith('&lt;\/ul&gt;&lt;\/li&gt;') and in_list:\n                        formatted_html.append('&lt;\/li&gt;')\n                        \n                formatted_html.append(f'&lt;li class=\"top-level-item\"&gt;{content}')\n                \n            elif line.startswith('- '):\n                # Second level bullet\n                content = line[2:].strip()\n                \n                if not in_list:\n                    # Start new lists\n                    formatted_html.append('&lt;ul class=\"top-level\"&gt;&lt;li class=\"top-level-item\"&gt;Second level items')\n                    formatted_html.append('&lt;ul class=\"second-level\"&gt;')\n                    in_list = True\n                    list_level = 2\n                elif list_level == 1:\n                    # Add a nested list\n                    formatted_html.append('&lt;ul class=\"second-level\"&gt;')\n                    list_level = 2\n                elif list_level &gt; 2:\n                    # Close deeper nested lists to get to second level\n                    for _ in range(list_level - 2):\n                        formatted_html.append('&lt;\/ul&gt;&lt;\/li&gt;')\n                    list_level = 2\n                else:\n                    # Close previous list item if needed\n                    if formatted_html and not formatted_html[-1].endswith('&lt;\/ul&gt;&lt;\/li&gt;') and list_level == 2:\n                        formatted_html.append('&lt;\/li&gt;')\n                        \n                formatted_html.append(f'&lt;li class=\"second-level-item\"&gt;{content}')\n                \n            elif line.startswith('  * ') or line.startswith('    * '):\n                # Third level bullet\n                content = line.strip()[2:].strip()\n                \n                if not in_list:\n                    # Start new lists (all levels)\n                    formatted_html.append('&lt;ul class=\"top-level\"&gt;&lt;li class=\"top-level-item\"&gt;Top level')\n                    formatted_html.append('&lt;ul class=\"second-level\"&gt;&lt;li class=\"second-level-item\"&gt;Second level')\n                    formatted_html.append('&lt;ul class=\"third-level\"&gt;')\n                    in_list = True\n                    list_level = 3\n                elif list_level == 2:\n                    # Add a nested list\n                    formatted_html.append('&lt;ul class=\"third-level\"&gt;')\n                    list_level = 3\n                elif list_level &lt; 3:\n                    # We missed a level, adjust\n                    formatted_html.append('&lt;li&gt;Missing level&lt;\/li&gt;')\n                    formatted_html.append('&lt;ul class=\"third-level\"&gt;')\n                    list_level = 3\n                else:\n                    # Close previous list item if needed\n                    if formatted_html and not formatted_html[-1].endswith('&lt;\/ul&gt;&lt;\/li&gt;') and list_level == 3:\n                        formatted_html.append('&lt;\/li&gt;')\n                        \n                formatted_html.append(f'&lt;li class=\"third-level-item\"&gt;{content}')\n            else:\n                # Regular paragraph\n                if in_list:\n                    # Close any open lists\n                    for _ in range(list_level):\n                        formatted_html.append('&lt;\/ul&gt;')\n                        if list_level &gt; 1:\n                            formatted_html.append('&lt;\/li&gt;')\n                    in_list = False\n                    list_level = 0\n                formatted_html.append(f'&lt;p&gt;{line}&lt;\/p&gt;')\n        \n        # Close any open lists\n        if in_list:\n            # Close final item\n            formatted_html.append('&lt;\/li&gt;')\n            # Close any open lists\n            for _ in range(list_level):\n                if list_level &gt; 1:\n                    formatted_html.append('&lt;\/ul&gt;&lt;\/li&gt;')\n                else:\n                    formatted_html.append('&lt;\/ul&gt;')\n        \n        summary_html = 'n'.join(formatted_html)\n        \n        html = f\"\"\"\n        &lt;div id=\"summary-popup\" class=\"popup-overlay\"&gt;\n            &lt;div class=\"popup-content\"&gt;\n                &lt;div class=\"popup-header\"&gt;\n                    &lt;h2&gt;Hierarchical Summary&lt;\/h2&gt;\n                    &lt;button onclick=\"closeSummaryPopup()\" class=\"close-button\"&gt;\u00d7&lt;\/button&gt;\n                &lt;\/div&gt;\n                &lt;div class=\"popup-body\"&gt;\n                    {summary_html}\n                &lt;\/div&gt;\n            &lt;\/div&gt;\n        &lt;\/div&gt;\n        \n        &lt;style&gt;\n        .popup-overlay {{\n            position: fixed;\n            top: 0;\n            left: 0;\n            width: 100%;\n            height: 100%;\n            background-color: rgba(0, 0, 0, 0.5);\n            z-index: 1000;\n            display: flex;\n            justify-content: center;\n            align-items: center;\n        }}\n        \n        .popup-content {{\n            background-color: white;\n            padding: 20px;\n            border-radius: 10px;\n            width: 80%;\n            max-width: 800px;\n            max-height: 80vh;\n            overflow-y: auto;\n            box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);\n        }}\n        \n        .popup-header {{\n            display: flex;\n            justify-content: space-between;\n            align-items: center;\n            border-bottom: 1px solid #ddd;\n            padding-bottom: 10px;\n            margin-bottom: 15px;\n        }}\n        \n        .close-button {{\n            background: none;\n            border: none;\n            font-size: 24px;\n            cursor: pointer;\n            color: #555;\n        }}\n        \n        .close-button:hover {{\n            color: #000;\n        }}\n        \n        \/* Style for hierarchical bullets *\/\n        .popup-body ul {{\n            padding-left: 20px;\n            margin-bottom: 5px;\n        }}\n        \n        .popup-body ul.top-level {{\n            list-style-type: disc;\n        }}\n        \n        .popup-body ul.second-level {{\n            list-style-type: circle;\n            margin-top: 5px;\n        }}\n        \n        .popup-body ul.third-level {{\n            list-style-type: square;\n            margin-top: 3px;\n        }}\n        \n        .popup-body li.top-level-item {{\n            margin-bottom: 12px;\n            font-weight: bold;\n        }}\n        \n        .popup-body li.second-level-item {{\n            margin-bottom: 8px;\n            font-weight: normal;\n        }}\n        \n        .popup-body li.third-level-item {{\n            margin-bottom: 5px;\n            font-weight: normal;\n            font-size: 0.95em;\n        }}\n        \n        .popup-body p {{\n            margin-bottom: 10px;\n        }}\n        &lt;\/style&gt;\n        \n        &lt;script&gt;\n        function closeSummaryPopup() {{\n            document.getElementById('summary-popup').style.display = 'none';\n            \n            \/\/ Send message to Streamlit\n            window.parent.postMessage({{\n                type: \"streamlit:setComponentValue\",\n                value: true\n            }}, \"*\");\n        }}\n        &lt;\/script&gt;\n        \"\"\"\n        return html<\/code><\/pre>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-186.png?ssl=1\" alt=\"\" class=\"wp-image-600571\"><figcaption class=\"wp-element-caption\">Heirarchical summary (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Talk-to-Videos app generates quizzes from the video through the <code>QuizGenerator <\/code>class in <code>quiz.py<\/code>. The quiz generator creates multiple-choice questions targeting specific facts and concepts presented in the video. Unlike RAG, where I use a zero temperature, I increased the LLM temperature to 0.4 to encourage some creativity in quiz generation. A structured prompt guides the quiz generation process. The <code>parse_quiz_response<\/code> method extracts and validates the generated quiz elements to make sure that each question has all the required components. To prevent the users from recognizing the pattern and to promote real understanding, the quiz generator shuffles the answer options. Questions are presented one at a time, followed by immediate feedback on each answer. After completing all questions, the <code>calculate_quiz_results <\/code>method assesses user answers and the user is presented with an overall score, a visual breakdown of correct versus incorrect answers, and feedback on the performance level. In this way, the quiz generation functionality transforms passive video watching into active learning by challenging users to recall and apply information presented in the video.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># quiz.py\nclass QuizGenerator:\n    def __init__(self):\n        pass\n    \n    def generate_quiz(self, rag_system, api_key, transcript=None, model=\"gpt-4o\", temperature=0.4):\n        \"\"\"\n        Generate quiz questions based on the video transcript\n        \n        Args:\n            rag_system: The RAG system with vector store2\n            api_key: OpenAI API key\n            transcript: The full transcript text (optional)\n            model: Model to use for question generation\n            temperature: Creativity level (0.0-1.0)\n            \n        Returns:\n            list: List of question objects\n        \"\"\"\n        if not rag_system:\n            st.error(\"Please transcribe the video first before creating a quiz!\")\n            return []\n        \n        # Create a temporary LLM with slightly higher temperature for more creative questions\n        creative_llm = ChatOpenAI(\n            openai_api_key=api_key,\n            model=model,\n            temperature=temperature\n        )\n\n        num_questions = 10\n        \n        # Prompt to generate quiz\n        prompt = f\"\"\"Based on the video transcript, generate {num_questions} multiple-choice questions to test understanding of the content.\n        For each question:\n        1. The question should be specific to information mentioned in the video\n        2. Include 4 options (A, B, C, D)\n        3. Clearly indicate the correct answer\n        \n        Format your response exactly as follows for each question:\n        QUESTION: [question text]\n        A: [option A]\n        B: [option B]\n        C: [option C]\n        D: [option D]\n        CORRECT: [letter of correct answer]\n       \n        Make sure all questions are based on facts from the video.\"\"\"\n        \n        try:\n            if transcript:\n                # If we have the full transcript, use it\n                messages = [\n                    {\"role\": \"system\", \"content\": f\"You are given the following transcript from a video:nn{transcript}nnUse this transcript to create quiz questions according to the instructions.\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n                \n                response = creative_llm.invoke(messages)\n                response_text = response.content\n            else:\n                # Fallback to RAG approach if no transcript is provided\n                relevant_docs = rag_system.vector_store.similarity_search(\n                    \"what are the main topics covered in this video?\", k=5\n                )\n                context = \"nn\".join([doc.page_content for doc in relevant_docs])\n                \n                # Use the creative LLM with context to generate questions\n                messages = [\n                    {\"role\": \"system\", \"content\": f\"You are given the following context from a video transcript:nn{context}nnUse this context to create quiz questions according to the instructions.\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n                \n                response = creative_llm.invoke(messages)\n                response_text = response.content\n        except Exception as e:\n            # Fallback to the regular RAG system if there's an error\n            st.warning(f\"Using standard question generation due to error: {str(e)}\")\n            response_text = rag_system.ask(prompt)\n        \n        return self.parse_quiz_response(response_text)\n\n    # The rest of the class remains unchanged\n    def parse_quiz_response(self, response_text):\n        \"\"\"\n        Parse the LLM response to extract questions, options, and correct answers\n        \n        Args:\n            response_text: Raw text response from LLM\n            \n        Returns:\n            list: List of parsed question objects\n        \"\"\"\n        quiz_questions = []\n        current_question = {}\n        \n        for line in response_text.strip().split('n'):\n            line = line.strip()\n            if line.startswith('QUESTION:'):\n                if current_question and 'question' in current_question and 'options' in current_question and 'correct' in current_question:\n                    quiz_questions.append(current_question)\n                current_question = {\n                    'question': line[len('QUESTION:'):].strip(),\n                    'options': [],\n                    'correct': None\n                }\n            elif line.startswith(('A:', 'B:', 'C:', 'D:')):\n                option_letter = line[0]\n                option_text = line[2:].strip()\n                current_question.setdefault('options', []).append((option_letter, option_text))\n            elif line.startswith('CORRECT:'):\n                current_question['correct'] = line[len('CORRECT:'):].strip()\n        \n        # Add the last question\n        if current_question and 'question' in current_question and 'options' in current_question and 'correct' in current_question:\n            quiz_questions.append(current_question)\n        \n        # Randomize options for each question\n        randomized_questions = []\n        for q in quiz_questions:\n            # Get the original correct answer\n            correct_letter = q['correct']\n            correct_option = None\n            \n            # Find the correct option text\n            for letter, text in q['options']:\n                if letter == correct_letter:\n                    correct_option = text\n                    break\n            \n            if correct_option is None:\n                # If we can't find the correct answer, keep the question as is\n                randomized_questions.append(q)\n                continue\n                \n            # Create a list of options texts and shuffle them\n            option_texts = [text for _, text in q['options']]\n            \n            # Create a copy of the original letters\n            option_letters = [letter for letter, _ in q['options']]\n            \n            # Create a list of (letter, text) pairs\n            options_pairs = list(zip(option_letters, option_texts))\n            \n            # Shuffle the pairs\n            random.shuffle(options_pairs)\n            \n            # Find the new position of the correct answer\n            new_correct_letter = None\n            for letter, text in options_pairs:\n                if text == correct_option:\n                    new_correct_letter = letter\n                    break\n            \n            # Create a new question with randomized options\n            new_q = {\n                'question': q['question'],\n                'options': options_pairs,\n                'correct': new_correct_letter\n            }\n            \n            randomized_questions.append(new_q)\n        \n        return randomized_questions\n    \n    def calculate_quiz_results(self, questions, user_answers):\n        \"\"\"\n        Calculate quiz results based on user answers\n        \n        Args:\n            questions: List of question objects\n            user_answers: Dictionary of user answers keyed by question_key\n            \n        Returns:\n            tuple: (results dict, correct count)\n        \"\"\"\n        correct_count = 0\n        results = {}\n        \n        for i, question in enumerate(questions):\n            question_key = f\"quiz_q_{i}\"\n            user_answer = user_answers.get(question_key)\n            correct_answer = question['correct']\n            \n            # Only count as correct if user selected an answer and it matches\n            is_correct = user_answer is not None and user_answer == correct_answer\n            if is_correct:\n                correct_count += 1\n            \n            results[question_key] = {\n                'user_answer': user_answer,\n                'correct_answer': correct_answer,\n                'is_correct': is_correct\n            }\n        \n        return results, correct_count<\/code><\/pre>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"1024\" width=\"980\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-188-980x1024.png?resize=980%2C1024&#038;ssl=1\" alt=\"\" class=\"wp-image-600573\"><figcaption class=\"wp-element-caption\">Quiz result (image by author)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Talk-to-Videos also generates flashcards from the video content, which support active recall and spaced repetition learning techniques. This is done through the <code>FlashcardGenerator<\/code> class in <code>flashcards.py<\/code><em>,<\/em> which creates a mix of different flashcards focusing on key term definitions, conceptual questions, fill-in-the-blank statements, and true\/False questions with explanations. A prompt guides the LLM to output flashcards in a structured JSON format, with each card containing distinct \u201cfront\u201d and \u201cback\u201d elements. The <code>shuffle_flashcards <\/code>produces a randomized presentation, and each flashcard is validated to ensure that it contains both front and back components before being presented to the user. The answer to each flashcard is initially hidden. It is revealed at the user\u2019s input using a classic flashcard reveal functionality. Users can generate a new set of flashcards for more practice. The flashcard and quiz systems are interconnected with each other so that users can switch between them as needed.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># flashcards.py\nclass FlashcardGenerator:\n    \"\"\"Class to generate flashcards from video content using the RAG system.\"\"\"\n    \n    def __init__(self):\n        \"\"\"Initialize the flashcard generator.\"\"\"\n        pass\n    \n    def generate_flashcards(self, rag_system, api_key, transcript=None, num_cards=10, model=\"gpt-4o\") -&gt; List[Dict[str, str]]:\n        \"\"\"\n        Generate flashcards based on the video content.\n        \n        Args:\n            rag_system: The initialized RAG system with video content\n            api_key: OpenAI API key\n            transcript: The full transcript text (optional)\n            num_cards: Number of flashcards to generate (default: 10)\n            model: The OpenAI model to use\n            \n        Returns:\n            List of flashcard dictionaries with 'front' and 'back' keys\n        \"\"\"\n        # Import here to avoid circular imports\n        from langchain_openai import ChatOpenAI\n        \n        # Initialize language model\n        llm = ChatOpenAI(\n            openai_api_key=api_key,\n            model=model,\n            temperature=0.4\n        )\n        \n        # Create the prompt for flashcard generation\n        prompt = f\"\"\"\n        Create {num_cards} educational flashcards based on the video content.\n        \n        Each flashcard should have:\n        1. A front side with a question, term, or concept\n        2. A back side with the answer, definition, or explanation\n        \n        Focus on the most important and educational content from the video. \n        Create a mix of different types of flashcards:\n        - Key term definitions\n        - Conceptual questions\n        - Fill-in-the-blank statements\n        - True\/False questions with explanations\n        \n        Format your response as a JSON array of objects with 'front' and 'back' properties.\n        Example:\n        [\n            {{\"front\": \"What is photosynthesis?\", \"back\": \"The process by which plants convert light energy into chemical energy.\"}},\n            {{\"front\": \"The three branches of government are: Executive, Legislative, and _____\", \"back\": \"Judicial\"}}\n        ]\n        \n        Make sure your output is valid JSON format with exactly {num_cards} flashcards.\n        \"\"\"\n        \n        try:\n            # Determine the context to use\n            if transcript:\n                # Use the full transcript if provided\n                # Create messages for the language model\n                messages = [\n                    {\"role\": \"system\", \"content\": f\"You are an educational content creator specializing in creating effective flashcards. Use the following transcript from a video to create educational flashcards:nn{transcript}\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n            else:\n                # Fallback to RAG system if no transcript is provided\n                relevant_docs = rag_system.vector_store.similarity_search(\n                    \"key points and educational concepts in the video\", k=15\n                )\n                context = \"nn\".join([doc.page_content for doc in relevant_docs])\n                \n                # Create messages for the language model\n                messages = [\n                    {\"role\": \"system\", \"content\": f\"You are an educational content creator specializing in creating effective flashcards. Use the following context from a video to create educational flashcards:nn{context}\"},\n                    {\"role\": \"user\", \"content\": prompt}\n                ]\n            \n            # Generate flashcards\n            response = llm.invoke(messages)\n            content = response.content\n            \n            # Extract JSON content in case there's text around it\n            json_start = content.find('[')\n            json_end = content.rfind(']') + 1\n            \n            if json_start &gt;= 0 and json_end &gt; json_start:\n                json_content = content[json_start:json_end]\n                flashcards = json.loads(json_content)\n            else:\n                # Fallback in case of improper JSON formatting\n                raise ValueError(\"Failed to extract valid JSON from response\")\n            \n            # Verify we have the expected number of cards (or adjust as needed)\n            actual_cards = min(len(flashcards), num_cards)\n            flashcards = flashcards[:actual_cards]\n            \n            # Validate each flashcard has required fields\n            validated_cards = []\n            for card in flashcards:\n                if 'front' in card and 'back' in card:\n                    validated_cards.append({\n                        'front': card['front'],\n                        'back': card['back']\n                    })\n            \n            return validated_cards\n        \n        except Exception as e:\n            # Handle errors gracefully\n            print(f\"Error generating flashcards: {str(e)}\")\n            # Return a few basic flashcards in case of error\n            return [\n                {\"front\": \"Error generating flashcards\", \"back\": f\"Please try again. Error: {str(e)}\"},\n                {\"front\": \"Tip\", \"back\": \"Try regenerating flashcards or using a different video\"}\n            ]\n    \n    def shuffle_flashcards(self, flashcards: List[Dict[str, str]]) -&gt; List[Dict[str, str]]:\n        \"\"\"Shuffle the order of flashcards\"\"\"\n        shuffled = flashcards.copy()\n        random.shuffle(shuffled)\n        return shuffled<\/code><\/pre>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/03\/image-189.png?ssl=1\" alt=\"\" class=\"wp-image-600574\"><figcaption class=\"wp-element-caption\">Flashcards (image by author)<\/figcaption><\/figure>\n<p class=\"has-heading-6-font-size wp-block-paragraph\"><strong>Potential Extensions and Improvements<\/strong><\/p>\n<p class=\"wp-block-paragraph\">This application can be extended and improved in a number of ways. For instance:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Integration of visual features in video (such as keyframes) may be explored with audio to extract more meaningful information.<\/li>\n<li class=\"wp-block-list-item\">Team-based learning experiences can be enabled where office colleagues or classmates can share notes, quiz scores, and summaries.<\/li>\n<li class=\"wp-block-list-item\">Creating navigable transcripts that allow users to click on specific sections to jump to that point in the video<\/li>\n<li class=\"wp-block-list-item\">Creating step-by-step action plans for implementing concepts from the video in real business settings<\/li>\n<li class=\"wp-block-list-item\">Modifying the RAG prompt to elaborate on the answers and provide simpler explanations to difficult concepts.<\/li>\n<li class=\"wp-block-list-item\">Generating questions that stimulate metacognitive skills in learners by stimulating them to think about their thinking process and learning strategies while engaging with video content.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong><em>That\u2019s all folks! If you liked the article, please follow me on <\/em><\/strong><a href=\"https:\/\/medium.com\/@umairali.khan\"><strong><em>Medium<\/em><\/strong><\/a><strong><em>\u00a0and\u00a0<\/em><\/strong><a href=\"http:\/\/www.linkedin.com\/in\/uakhan80\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><em>LinkedIn<\/em><\/strong><\/a><strong><em>.<\/em><\/strong><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/talk-to-videos\/\">Talk to Videos<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Umair Ali Khan<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/talk-to-videos\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Talk to Videos Large language models (LLMs) are improving in efficiency and are now able to understand different data formats, offering possibilities for myriads of applications in different domains. Initially, LLMs were inherently able to process only text. The image understanding feature was integrated by coupling an LLM with another image encoding model. However, gpt-4o [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,58,240,2160,71,2161],"tags":[318,73,2162],"class_list":["post-2695","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-business","category-editors-pick","category-education","category-large-language-models","category-videos","tag-llms","tag-models","tag-videos"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2695"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2695"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2695\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2695"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2695"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2695"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}