{"id":515,"date":"2024-12-12T07:03:47","date_gmt":"2024-12-12T07:03:47","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/12\/translating-a-memoir-a-technical-journey-08913ca60020\/"},"modified":"2024-12-12T07:03:47","modified_gmt":"2024-12-12T07:03:47","slug":"translating-a-memoir-a-technical-journey-08913ca60020","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/12\/translating-a-memoir-a-technical-journey-08913ca60020\/","title":{"rendered":"Translating a Memoir: A Technical Journey"},"content":{"rendered":"<p>    Translating a Memoir: A Technical Journey<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4><em>Leveraging GPT-3.5 and unstructured APIs for translations<\/em><\/h4>\n<p><strong>This blog post details how I utilised GPT to translate the personal memoir of a family friend, making it accessible to a broader audience.<\/strong> Specifically, I employed GPT-3.5 for translation and Unstructured\u2019s APIs for efficient content extraction and formatting.<\/p>\n<p>The memoir, a heartfelt account by my family friend Carmen Rosa, chronicles her upbringing in Bolivia and her romantic journey in Paris with an Iranian man during the vibrant 1970s. Originally written in Spanish, we aimed to preserve the essence of her narrative while expanding its reach to English-speaking readers through the application of LLM technologies.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"Cover image of \u201cUn Destino Sorprendente\u201d, used with permission of author Carmen Rosa Wichtendahl.\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2ATDfCHL6K8wYGU9fGiWHnWw.png?ssl=1\"><figcaption>Cover image of \u201cUn Destino Sorprendente\u201d, used with permission of author Carmen Rosa Wichtendahl.<\/figcaption><\/figure>\n<p>Below you can read the translation process in more detail or you can <a href=\"https:\/\/colab.research.google.com\/drive\/1FxdtBO8iy1vnXG3VpjRIJ5yZEIEPHk9u#scrollTo=HDEwiI1j3mwH\">access here the Colab Notebook.<\/a><\/p>\n<h3>Translating the\u00a0document<\/h3>\n<p>I followed the next steps for the translation of the\u00a0book:<\/p>\n<ol>\n<li>\n<strong>Import Book Data:<\/strong> I imported the book from a Docx document using the Unstructured API and divided it into chapters and paragraphs.<\/li>\n<li>\n<strong>Translation Technique: <\/strong>I translated each chapter using GPT-3.5. For each paragraph, I provided the latest three translated sentences (if available) from the same chapter. This approach served two purposes:<\/li>\n<\/ol>\n<ul>\n<li>\n<strong><em>Style Consistency:<\/em><\/strong> Maintaining a consistent style throughout the translation by providing context from previous translations.<\/li>\n<li>\n<strong><em>Token Limit:<\/em><\/strong><em> <\/em>Limiting the number of tokens processed at once to avoid exceeding the model\u2019s context\u00a0limit.<\/li>\n<\/ul>\n<p><strong>3. Exporting translation as Docx: <\/strong>I used Unstructured\u2019s API once again to save the translated content in Docx\u00a0format.<\/p>\n<h3>Technical implementation<\/h3>\n<h4>1. Libraries<\/h4>\n<p>We\u2019 ll start with the installation and import of the necessary libraries.<\/p>\n<pre>pip install --upgrade openai <br>pip install python-dotenv<br>pip install unstructured<br>pip install python-docx<\/pre>\n<pre>import openai<br><br># Unstructured<br>from unstructured.partition.docx import partition_docx<br>from unstructured.cleaners.core import group_broken_paragraphs<br><br># Data and other libraries<br>import pandas as pd<br>import re<br>from typing import List, Dict<br>import os<br>from dotenv import load_dotenv<\/pre>\n<h4>2. Connecting to OpenAI\u2019s\u00a0API<\/h4>\n<p>The code below sets up the OpenAI API key for use in a Python project. You need to save your API key in an\u00a0.env\u00a0file.<\/p>\n<pre>import openai<br><br># Specify the path to the .env file<br>dotenv_path = '\/content\/.env'<br><br>_ = load_dotenv(dotenv_path) # read local .env file<br>openai.api_key  = os.environ['OPENAI_API_KEY']<\/pre>\n<h4>3. Loading the\u00a0book<\/h4>\n<p>The code allows us to import the book in Docx format and divide it into individual paragraphs.<\/p>\n<pre>elements = partition_docx(<br>    filename=\"\/content\/libro.docx\", <br>    paragraph_grouper=group_broken_paragraphs<br>)<\/pre>\n<p>The code below returns the paragraph in the 10th index of elements.<\/p>\n<pre>print(elements[10])<br><br># Returns: Destino sorprendente, es el t\u00edtulo que la autora le puso ...<\/pre>\n<h4>4. Group book into titles and\u00a0chapters<\/h4>\n<p>The next step involves creating a list of chapters. Each chapter will be represented as a dictionary containing a title and a list of paragraphs. This structure simplifies the process of translating each chapter and paragraph individually. Here\u2019s an example of this\u00a0format:<\/p>\n<pre>[<br>  {\"title\": title 1, \"content\": [paragraph 1, paragraph 2, ..., paragraph n]},<br>  {\"title\": title 2, \"content\": [paragraph 1, paragraph 2, ..., paragraph n]},<br>  ...<br>  {\"title\": title n, \"content\": [paragraph 1, paragraph 2, ..., paragraph n]},<br>]<\/pre>\n<p>To achieve this, we\u2019ll create a function called group_by_chapter. Here are the key steps involved:<\/p>\n<ol>\n<li>\n<strong>Extract Relevant Information: <\/strong>We can get each narrative text and title by calling element.category<em>. <\/em>Those are the only categories we\u2019re interested in translating at this\u00a0point.<\/li>\n<li>\n<strong>Identify Narrative Titles:<\/strong> We recognise that some titles should be part of the narrative text. To account for this, we assume that italicised titles belong to the narrative paragraph.<\/li>\n<\/ol>\n<pre>def group_by_chapter(elements: List) -&gt; List[Dict]:<br>    chapters = []<br>    current_title = None<br><br>    for element in elements:<br><br>      text_style = element.metadata.emphasized_text_tags # checks if it is 'b' or 'i' and returns list<br>      unique_text_style = list(set(text_style)) if text_style is not None else None<br><br>      # we consider an element a title if it is a title category and the style is bold<br>      is_title = (element.category == \"Title\") &amp; (unique_text_style == ['b'])<br><br>      # we consider an element a narrative content if it is a narrative text category or<br>      # if it is a title category, but it is italic or italic and bold<br>      is_narrative = (element.category == \"NarrativeText\") | (<br>          ((element.category == \"Title\") &amp; (unique_text_style is None)) |<br>          ((element.category == \"Title\") &amp; (unique_text_style == ['i'])) |<br>          ((element.category == \"Title\") &amp; (unique_text_style == ['b', 'i']))<br>      )<br><br>      # for new titles<br>      if is_title:<br>        print(f\"Adding title {element.text}\")<br><br>        # Add previous chapter when a new one comes in, unless current title is None<br>        if current_title is not None:<br>          chapters.append(current_chapter)<br><br>        current_title = element.text<br>        current_chapter = {\"title\": current_title, \"content\": []}<br><br>      elif is_narrative:<br>        print(f\"Adding Narrative {element.text}\")<br>        current_chapter[\"content\"].append(element.text)<br><br>      else:<br>        print(f'### No need to convert. Element type: {element.category}')<br><br><br>    return chapters<\/pre>\n<p>In the example below, we can see an\u00a0example:<\/p>\n<pre>book_chapters[2] <br><br># Returns <br>{'title': 'Proemio',<br> 'content': [<br>    'La autobiograf\u00eda es considerada ...',<br>    'Dentro de las artes literarias, ...',<br>    'Se encuentra m\u00e1s pr\u00f3xima a los, ...',<br>  ]<br>}<\/pre>\n<h4>5. Book translation<\/h4>\n<p>To translate the book, we follow these\u00a0steps:<\/p>\n<ol>\n<li>\n<strong>Translate Chapter Titles:<\/strong> We translate the title of each\u00a0chapter.<\/li>\n<li>\n<strong>Translate Paragraphs:<\/strong> We translate each paragraph, providing the model with the latest three translated sentences as\u00a0context.<\/li>\n<li>\n<strong>Save Translations:<\/strong> We save both the translated titles and\u00a0content.<\/li>\n<\/ol>\n<p>The function below automates this\u00a0process.<\/p>\n<pre>def translate_book(book_chapters: List[Dict]) -&gt; Dict:<br>  translated_book = []<br>  for chapter in book_chapters:<br>    print(f\"Translating following chapter: {chapter['title']}.\")<br>    translated_title = translate_title(chapter['title'])<br>    translated_chapter_content = translate_chapter(chapter['content'])<br>    translated_book.append({<br>        \"title\": translated_title,<br>        \"content\": translated_chapter_content<br>        })<br>  return translated_book<\/pre>\n<p>For the title, we ask GPT a simple translation as\u00a0follows:<\/p>\n<pre>def translate_title(title: str) -&gt; str:<br>  response = client.chat.completions.create(<br>    model=\"gpt-3.5-turbo\",<br>    messages= [{<br>        \"role\": \"system\",<br>        \"content\": f\"Translate the following book title into English:n{title}\"<br>        }]<br>  )<br>  return response.choices[0].message.content<\/pre>\n<p>To translate a single chapter, we provide the model with the corresponding paragraphs. We instruct the model as\u00a0follows:<\/p>\n<ol>\n<li>\n<strong>Identify the role:<\/strong> We inform the model that it is a helpful translator for a\u00a0book.<\/li>\n<li>\n<strong>Provide context:<\/strong> We share the latest three translated sentences from the\u00a0chapter.<\/li>\n<li>\n<strong>Request translation:<\/strong> We ask the model to translate the next paragraph.<\/li>\n<\/ol>\n<p>During this process, the function combines all translated paragraphs into a single\u00a0string.<\/p>\n<pre># Function to translate a chapter using OpenAI API<br>def translate_chapter(chapter_paragraphs: List[str]) -&gt; str:<br>    translated_content = \"\"<br><br>    for i, paragraph in enumerate(chapter_paragraphs):<br><br>        print(f\"Translating paragraph {i + 1} out of {len(chapter_paragraphs)}\")<br><br>        # Builds the message dynamically based on whether there is previous translated content<br>        messages = [{<br>          \"role\": \"system\", <br>          \"content\": \"You are a helpful translator for a book.\"<br>        }]<br><br>        if translated_content:<br>            latest_content = get_last_three_sentences(translated_content)<br>            messages.append(<br>                {<br>                  \"role\": \"system\",<br>                  \"content\": f\"This is the latest text from the book that you've translated from Spanish into English:n{latest_content}\"<br>                }<br>            )<br><br>        # Adds the user message for the current paragraph<br>        messages.append(<br>            {<br>              \"role\": \"user\", <br>              \"content\": f\"Translate the following text from the book into English:n{paragraph}\"<br>            }<br>        )<br><br>        # Calls the API<br>        response = client.chat.completions.create(<br>            model=\"gpt-3.5-turbo\",<br>            messages=messages<br>        )<br><br>        # Extracts the translated content and appends it<br>        paragraph_translation = response.choices[0].message.content<br>        translated_content += paragraph_translation + 'nn'<br><br>    return translated_content<\/pre>\n<p>Finally, below we can see the supporting function to get the latest three sentences.<\/p>\n<pre>def get_last_three_sentences(paragraph: str) -&gt; str:<br>    # Use regex to split the text into sentences<br>    sentences = re.split(r'(?&lt;!w.w.)(?&lt;![A-Z][a-z].)(?&lt;=.|?)s', paragraph)<br><br>    # Get the last three sentences (or fewer if the paragraph has less than 3 sentences)<br>    last_three = sentences[-3:]<br><br>    # Join the sentences into a single string<br>    return ' '.join(last_three)<\/pre>\n<h4>6. Book\u00a0export<\/h4>\n<p>Finally, we pass the dictionary of chapters to a function that adds each title as a heading and each content as a paragraph. After each paragraph, a page break is added to separate the chapters. The resulting document is then saved locally as a Docx\u00a0file.<\/p>\n<pre>from docx import Document<br><br>def create_docx_from_chapters(chapters: Dict, output_filename: str) -&gt; None:<br>    doc = Document()<br><br>    for chapter in chapters:<br>        # Add chapter title as Heading 1<br>        doc.add_heading(chapter['title'], level=1)<br><br>        # Add chapter content as normal text<br>        doc.add_paragraph(chapter['content'])<br><br>        # Add a page break after each chapter<br>        doc.add_page_break()<br><br>    # Save the document<br>    doc.save(output_filename)<\/pre>\n<h3>Limitations<\/h3>\n<p>While using GPT and APIs for translation is fast and efficient, there are key limitations compared to human translation:<\/p>\n<ul>\n<li>\n<strong>Pronoun and Reference Errors: <\/strong>GPT did misinterpret pronouns or references in few cases, potentially attributing actions or statements to the wrong person in the narrative. A human translator can better resolve such ambiguities.<\/li>\n<li>\n<strong>Cultural Context:<\/strong> GPT missed subtle cultural references and idioms that a human translator could interpret more accurately. In this case, several slang terms unique to Santa Cruz, Bolivia, were retained in the original language without additional context or explanation.<\/li>\n<\/ul>\n<p>Combining AI with human review can balance speed and quality, ensuring translations are both accurate and authentic.<\/p>\n<h3>Conclusion<\/h3>\n<p>This project demonstrates an approach to translating a book using a combination of GPT-3 and Unstructured APIs. By automating the translation process, we significantly reduced the manual effort required. While the initial translation output may require some minor human revisions to refine the nuances and ensure the highest quality, this approach serves as a strong foundation for efficient and effective book translation<\/p>\n<p>If you have any feedback or suggestions on how to improve this process or the quality of the translations, please feel free to share them in the comments\u00a0below.<\/p>\n<h3>Appendix<\/h3>\n<ul>\n<li>Link to <a href=\"https:\/\/colab.research.google.com\/drive\/1FxdtBO8iy1vnXG3VpjRIJ5yZEIEPHk9u#scrollTo=XaX0tVYlP9NU\">Colab\u00a0Notebook<\/a>\n<\/li>\n<li>Link to <a href=\"https:\/\/docs.google.com\/document\/d\/1slQQXHqSq4n3d4zCNg9KNF4cIq3RpY4Y\/edit?usp=sharing&amp;ouid=114182714261805101186&amp;rtpof=true&amp;sd=true\">book in original language (Spanish)<\/a>\n<\/li>\n<li>Link to <a href=\"https:\/\/docs.google.com\/document\/d\/1DGosQWbXsMlnFlgxha8UOyeX1OqxN2Lk\/edit\">translated book (English)<\/a>\n<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=08913ca60020\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/translating-a-memoir-a-technical-journey-08913ca60020\">Translating a Memoir: A Technical Journey<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Valeria Cortez<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Ftranslating-a-memoir-a-technical-journey-08913ca60020\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Translating a Memoir: A Technical Journey Leveraging GPT-3.5 and unstructured APIs for translations This blog post details how I utilised GPT to translate the personal memoir of a family friend, making it accessible to a broader audience. Specifically, I employed GPT-3.5 for translation and Unstructured\u2019s APIs for efficient content extraction and formatting. The memoir, a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,626,367,83,71,627],"tags":[628,630,629],"class_list":["post-515","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-automated-translation","category-chatgpt","category-data-science","category-large-language-models","category-unstructured-data","tag-import","tag-translation","tag-unstructured"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/515"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=515"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/515\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=515"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=515"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=515"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}