{"id":3770,"date":"2025-05-13T07:02:33","date_gmt":"2025-05-13T07:02:33","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/05\/13\/empowering-llms-to-think-deeper-by-erasing-thoughts\/"},"modified":"2025-05-13T07:02:33","modified_gmt":"2025-05-13T07:02:33","slug":"empowering-llms-to-think-deeper-by-erasing-thoughts","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/05\/13\/empowering-llms-to-think-deeper-by-erasing-thoughts\/","title":{"rendered":"Empowering LLMs to Think Deeper by Erasing Thoughts"},"content":{"rendered":"<p>    Empowering LLMs to Think Deeper by Erasing Thoughts<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><mdspan datatext=\"el1747095842970\" class=\"mdspan-comment\">Introduction<\/mdspan><\/h2>\n<p class=\"wp-block-paragraph\">Recent <strong>large language models (LLMs)<\/strong> \u2014 such as <a href=\"https:\/\/openai.com\/index\/learning-to-reason-with-llms\/\">OpenAI\u2019s o1\/o3<\/a>, DeepSeek\u2019s R1 and Anthropic\u2019s Claude 3.7 \u2014 demonstrate that allowing the model to think deeper and longer at test time can significantly enhance model\u2019s reasoning capability. The core approach underlying their deep thinking capability is called <strong><a href=\"https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2022\/file\/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf\">chain-of-thought (CoT)<\/a><\/strong>, where the model iteratively generates intermediate reasoning steps and appends them to the current context until producing the final answer.<\/p>\n<p class=\"wp-block-paragraph\">However, as tasks become increasingly complex, the steps needed to solve them grow dramatically. For instance, consider solving NP-hard problems using CoT \u2014 the reasoning trace would inevitably span exponential steps, assuming a fixed-size Transformer as the base model and P \u2260 NP.\u00a0This raises an important question:<\/p>\n<p class=\"has-text-align-center wp-block-paragraph\"><strong><em>Will CoT-based test-time scaling hit hard ceilings? <\/em><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Unfortunately, probably yes. Various limitations will emerge for harder tasks: <strong>(1)<\/strong> chains will inevitably exceed model\u2019s context windows, <strong>(2)<\/strong> critical information becomes buried and nearly impossible to retrieve from numerous preceding tokens, and\u00a0<strong>(3)<\/strong>\u00a0the self-attention complexity makes generating each new token prohibitively expensive.<\/p>\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" height=\"683\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/image-103-1024x683.png?resize=1024%2C683&#038;ssl=1\" alt=\"\" class=\"wp-image-603837\" style=\"width:564px;height:auto\"><figcaption class=\"wp-element-caption\">Generated by ChatGPT, prompted by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In this article, we challenge the conventional \u201cwrite-only\u201d CoT reasoning paradigm that dominates current LLM architectures, from both theoretical and practical perspectives. Furthermore, we will explore a fundamentally different reasoning approach that allows LLM to not only generate thoughts, but also erase thoughts. This capacity for thought erasure not only offers significant practical benefits in performance and efficiency, but proves fundamental for achieving optimal reasoning efficiency from a computational theory perspective.<\/p>\n<p class=\"wp-block-paragraph\">This post is based on the paper <em><a href=\"https:\/\/arxiv.org\/pdf\/2503.14337?\">C. Yang et al., \u201cPENCIL: Long thoughts with short memory\u201d<\/a><\/em> accepted in International Conference on <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a> 2025, a collaboration with Nathan Srebro, David McAllester, Zhiyuan Li. <a href=\"https:\/\/github.com\/chr26195\/PENCIL\">Code<\/a> is also available.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"161\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_7647c44b45bb639c4b3dcc0c8cb4cfd4-1024x161.png?resize=1024%2C161&#038;ssl=1\" alt=\"\" class=\"wp-image-603838\"><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Not Everything Needs to Be Remembered<\/h2>\n<p class=\"wp-block-paragraph\">The idea of selectively discarding information has deep roots in computer science history, from the earliest computational models to modern systems. The classic Turing machine overwrites symbols on its tape rather than preserving every state; programming languages reclaim memory through stack frames that are automatically released when functions complete their execution; and modern garbage collectors continuously identify and remove objects no longer accessible to the program. These mechanisms weren\u2019t merely efficiency optimizations \u2014 they were essential design choices that made complex computation possible within finite resources.<\/p>\n<p class=\"wp-block-paragraph\">This idea also applies to human reasoning. In theorem proving, once a lemma is established, we discard its detailed derivation while preserving the result; when exploring problem-solving approaches, we simply mark unproductive paths as \u201cfailed\u201d without retaining their full traces. Throughout complex reasoning, we naturally compress information, retaining conclusions while discarding the scaffolding used to reach them.<\/p>\n<h2 class=\"wp-block-heading\">\n<img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/270f.png?ssl=1\" alt=\"\u270f\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> PENCIL: A New Reasoning Paradigm<\/h2>\n<p class=\"wp-block-paragraph\">Therefore, we propose <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/270f.png?ssl=1\" alt=\"\u270f\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> PENCIL, a new reasoning paradigm for LLMs. Unlike <img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/2712.png?ssl=1\" alt=\"\u2712\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> CoT that only generates thoughts, PENCIL recursively generates and erases thoughts until reaching the final answer. It maintains only the minimal context required for generating future thoughts, so the model can think longer and deeper to solve harder tasks using shorter working memory. The following figure illustrates how PENCIL works<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"493\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_ccc44eabe5cd182766cc2126acf3949d-1-1024x493.png?resize=1024%2C493&#038;ssl=1\" alt=\"\" class=\"wp-image-603847\"><figcaption class=\"wp-element-caption\">Chain-of-Thought (left) preserves all reasoning steps in context, creating lengthy outputs. PENCIL (right) alternates between generation (bold) and reduction (blue): discarding intermediate thoughts when no longer needed. After reaching the solution, PENCIL returns only the final answer, hiding the thinking process.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">How Do Models Erase Thoughts?<\/h3>\n<p class=\"wp-block-paragraph\">PENCIL\u2019s erasure mechanism draws on two classical ideas. First, from <strong>rewriting rules in logic and classical automated theorem proving<\/strong>, which continuously apply predefined rules to simplify complex logical or arithmetic expressions into canonical forms until reaching a final answer. Second, from <strong>functional programming languages<\/strong>, which creates stack frames to store local variables when calling functions and releases corresponding memory when functions return, automatically discarding intermediate states that are no longer needed.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Specifically, we introduce three special tokens, called [CALL], [SEP], and [RETURN], and use the following reduction rule to implement erasure:<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/unnamed-1.png?ssl=1\" alt=\"\" class=\"wp-image-603848\" style=\"width:381px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\">where <strong>C <\/strong>stands for context, <strong>T <\/strong>stands for intermediate thoughts, and <strong>A <\/strong>stands for answer. Whenever the generated sequence completely matches the pattern on the left, PENCIL triggers the reduction rule, erasing thoughts and merging the answer<strong> <\/strong>back into the context. It is important to note that <strong>C<\/strong>, <strong>T<\/strong> and <strong>A<\/strong> can themselves contain special tokens, thereby supporting <strong>recursive<\/strong> structures similar to nested function calls \u2014 for example, <strong>C<\/strong> may contain another [CALL] token, indicating that a new thinking subroutine has been initiated.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">How to Use PENCIL?<\/h3>\n<p class=\"wp-block-paragraph\">PENCIL\u2019s erasure mechanism flexibly supports various reasoning patterns, such as:<\/p>\n<p class=\"wp-block-paragraph\">1\u20e3 <strong>Task Decomposition<\/strong>: Using [CALL] to initiate subproblems, generate intermediate results, and then use [SEP] and [RETURN] to merge outputs and erase subproblem reasoning details;<\/p>\n<p class=\"wp-block-paragraph\">2\u20e3 <strong>Branch and Backtrack<\/strong>: Using a [CALL], [SEP], [RETURN] triplet to manage an exploration branch in a search tree, erasing invalid paths upon conflicts or failures.<\/p>\n<p class=\"wp-block-paragraph\">3\u20e3 <strong>Summarization \/ Tail Recursion<\/strong>: Condensing a lengthy reasoning trace into concise summary, similar to tail recursion optimization in programming:<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/unnamed-2.png?ssl=1\" alt=\"\" class=\"wp-image-603849\" style=\"width:542px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\">where <strong>T<\/strong> represents the original complex reasoning process (or a more difficult problem), and <strong>T\u2019<\/strong> represents the summarized or simplified content (or an equivalent, more tractable problem).<\/p>\n<h3 class=\"wp-block-heading\">Example on a NP-Complete Task<\/h3>\n<p class=\"wp-block-paragraph\">For example, consider a classic NP-Complete problem Boolean Satisfiability (SAT): given a Boolean formula, determine whether there exists a variable assignment that makes it true. This problem is (widely believed to) require <strong>exponential time<\/strong> but only <strong>polynomial space<\/strong> to solve, with the simplest approach being traversing a binary search tree of depth n. <\/p>\n<p class=\"wp-block-paragraph\">Traditional CoT would accumulate intermediate calculations, causing the context length to grow proportionally with the number of nodes in the search tree, which is exponential time complexity of O(2^n). In comparison, PENCIL can recursively branch to try True\/False for a variable, backtracking upon conflict and erasing all thoughts within that branch. This thus keeps the context length proportional to the search depth, which is space complexity of only O(n).<\/p>\n<p class=\"wp-block-paragraph\">The following figure compares the maximum context length of the vanilla CoT without reduction (blue) and PENCIL with reduction (red). As problem complexity increases, PENCIL achieves dramatic space efficiency, notably reducing context length from 151,192 to just 3,335 tokens for Einstein\u2019s Puzzle.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"234\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_fef59a87380cb75144e50dd7feb2b7cb-1024x234.png?resize=1024%2C234&#038;ssl=1\" alt=\"\" class=\"wp-image-603846\"><figcaption class=\"wp-element-caption\">Maximal sequence length with and without the reduction rule.<\/figcaption><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Training and Experiments<\/h2>\n<p class=\"wp-block-paragraph\">The core difference between CoT and PENCIL during training is the calculation of the loss function:<\/p>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/unnamed-3.png?ssl=1\" alt=\"\" class=\"wp-image-603850\" style=\"width:470px;height:auto\"><\/figure>\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/unnamed-4.png?ssl=1\" alt=\"\" class=\"wp-image-603851\" style=\"width:505px;height:auto\"><\/figure>\n<p class=\"wp-block-paragraph\">For CoT, the loss for each new token is based on the complete historical context; for PENCIL, after each \u201cwrite-erase\u201d iteration, the model calculates loss for new tokens only on the reduced sequence. Although both generate the same number of tokens, PENCIL significantly shortens the context length corresponding to each token and thus is more efficient. <\/p>\n<p class=\"wp-block-paragraph\">It\u2019s also worthwhile to note that after each reduction, the KV cache for the shared prefix <strong>C<\/strong> can be directly reused, with only the cache for the shorter part <strong>A<\/strong> needing recalculation.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">Experimental Results<\/h3>\n<p class=\"wp-block-paragraph\">Our experiments focus on three inherently hard reasoning tasks: 3-SAT (NP-Complete), QBF (PSPACE-Complete), and Einstein\u2019s Puzzle (natural language reasoning). For each task, we wrote a generator to generate a training set where special tokens are included. We train a small transformer (SAT\/QBF with 10.6M parameters; Einstein\u2019s Puzzle with 25.2M parameters) starting with random initialization for these tasks.<\/p>\n<p class=\"wp-block-paragraph\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f4ca.png?ssl=1\" alt=\"\ud83d\udcca\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> Compared to CoT, we found PENCIL can solve larger-scale reasoning problems. As shown in the figure below, in SAT (left) and QBF (right) tasks, when problem size is small, both CoT and PENCIL perfectly solve problems; but as size increases, traditional CoT accuracy drops significantly (e.g., only about 50% for SAT at n=10), while PENCIL maintains high accuracy \u2265 99%. This is primarily because CoT\u2019s context sequence length explodes exponentially, whereas PENCIL avoids explosion by dynamic reduction.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"208\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_6749e4a9cffc711e4a140c67bbafb1cd-1024x208.png?resize=1024%2C208&#038;ssl=1\" alt=\"\" class=\"wp-image-603840\"><figcaption class=\"wp-element-caption\">Performance comparison on 3-SAT (left) and QBF (right)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/26a1.png?ssl=1\" alt=\"\u26a1\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> Additionally, PENCIL significantly saves computational resources. As shown in the figure, for QBF (n=3\u20136) tasks, we compared the convergence speed of CoT (blue) and PENCIL (red) under the same FLOPs budget. PENCIL quickly reaches 100% accuracy while CoT, due to continuously expanding context length, requires more FLOPs to approach optimality. As the problem size increases, the gap between the two becomes more pronounced.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"419\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_9a14c6b33e35d657204c34941405f132-1024x419.png?resize=1024%2C419&#038;ssl=1\" alt=\"\" class=\"wp-image-603839\"><figcaption class=\"wp-element-caption\">Comparison of convergence speed for training on the QBF problem (with n ranges from 3<br \/>to 6). Circles and vertical lines indicate the first time each method reaches optimal performance.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><img data-recalc-dims=\"1\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/s.w.org\/images\/core\/emoji\/15.0.3\/72x72\/1f9e9.png?ssl=1\" alt=\"\ud83e\udde9\" class=\"wp-smiley\" style=\"height: 1em; max-height: 1em;\"> We further considered a very difficult logical reasoning problem: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Zebra_Puzzle\">Einstein\u2019s Puzzle<\/a>. Each problem consists of 5 houses and 5 attribute categories of people living in them \u2014 color, nationality, drink, cigarette, and pet (e.g., Red\/Green\/Blue, Brit\/German\/Swede, Bird\/Dog\/Fish, etc.). Given clues like \u201cthe green house is right next to the bird owner\u2019s\u201d and \u201cthe dog owner lives in the red house,\u201d the task is to deduce \u201cwho owns the fish?\u201d This problem presents an extreme challenge for existing LLMs: <a href=\"https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/deb3c28192f979302c157cb653c15e90-Paper-Conference.pdf\">even GPT-4 struggles to solve it<\/a>. The figure below shows a simplified version with only 3 houses and 3 attribute categories:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"280\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/unnamed-5-1024x280.png?resize=1024%2C280&#038;ssl=1\" alt=\"\" class=\"wp-image-603852\"><figcaption class=\"wp-element-caption\">Illustration of Einstein\u2019s Puzzle.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As shown below, for this problem that even large models struggle with, PENCIL achieves 97% accuracy using only a small 25.2M parameter model, while traditional CoT achieves only 25% accuracy (close to random guessing).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"234\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_fef59a87380cb75144e50dd7feb2b7cb-1024x234.png?resize=1024%2C234&#038;ssl=1\" alt=\"\" class=\"wp-image-603846\"><figcaption class=\"wp-element-caption\">Performance on Einstein\u2019s Puzzle<\/figcaption><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Theory: Universal Efficient Computation<\/h2>\n<p class=\"wp-block-paragraph\">We further demonstrate PENCIL\u2019s fundamental advantage over traditional CoT from the theoretical <strong>expressive power <\/strong>perspective: PENCIL is Turing complete with optimal space complexity, and thus can solve arbitrary computable tasks efficiently. This is something fundamentally impossible for CoT!<\/p>\n<h3 class=\"wp-block-heading\">Main Results<\/h3>\n<p class=\"wp-block-paragraph\">Specifically, we prove: <em>Using a fixed, finite-sized Transformer, PENCIL can simulate any Turing machine with optimal time and space complexity, thereby efficiently solving all computable problems.<\/em><\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"137\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_b729e68f993210394f395adbb2bc6b31-1-1024x137.png?resize=1024%2C137&#038;ssl=1\" alt=\"\" class=\"wp-image-603844\"><\/figure>\n<p class=\"wp-block-paragraph\">In other words, for any Turing machine running in T time and S space, PENCIL requires only O(T) tokens while maintaining a maximum context length of O(S) to produce identical results. While <a href=\"https:\/\/arxiv.org\/pdf\/2310.07923\">previous work<\/a> established that traditional CoT can make Transformers Turing complete, it demands O(T) context length with each token representing an intermediate computation step. This distinction between maximum context length becomes crucial because for most algorithms, space complexity S is substantially smaller than time complexity T, especially for harder problems.<\/p>\n<p class=\"wp-block-paragraph\">Consider NP-Complete problems like Traveling Salesman or Hamiltonian Circuit, which are widely believed to require exponential time but solvable in polynomial space. Traditional CoT cannot solve these within polynomial context length constraints, and requires at least exponential length that exceeds practical memory limitations of any real system. PENCIL, in contrast, can solve them using only polynomial maximum context length, making previously intractable reasoning tasks feasible.<\/p>\n<h3 class=\"wp-block-heading\">Proof Sketch<\/h3>\n<p class=\"wp-block-paragraph\">We now briefly introduce our proof idea, where the key insight is to have PENCIL use a series of \u201cSimulation-Summarization\u201d iterations to clean the memory.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"312\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/upload_a7c8d1aec88775838c65371db6c01f21-1024x312.png?resize=1024%2C312&#038;ssl=1\" alt=\"\" class=\"wp-image-603843\"><figcaption class=\"wp-element-caption\">PENCIL simulates Turing machine iteratively using two phases: simulating computation steps from the previous state, and summarizing into the new state using the reduction rule.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><strong>Step 1: Using CoT to Encode Turing Machine Transitions\u00a0 <\/strong>As illustrated in the left part of the figure above, we encode each Turing machine state transition as a token encoding \u201cnew state\u201d, \u201cwritten symbol\u201d, and \u201chead movement direction\u201d triplet in the embedding. The model can use self-attention to calculate the current head position and determine the symbol at this position. Without reduction, this process generates T tokens with context length O(T).<\/p>\n<p class=\"wp-block-paragraph\"><strong>Step 2: Alternating \u201cSimulation-Summarization\u201d\u00a0 <\/strong>PENCIL achieves space\/time optimality through alternating:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Simulation<\/strong>: Continuously generate Turing machine state transition tokens, simulating multiple computation steps;<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Summarization<\/strong>: When new tokens exceed twice the space needed, summarize the computation using S tokens. The reduction rule then discards previous thoughts, keeping only the latest Turing machine state for the next round.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">This strategy maintains total token generation at O(T) while limiting context length to O(S).<\/p>\n<p class=\"wp-block-paragraph\"><strong>Step 3: Transformer Implementation <\/strong>To prove this process can be implemented by Transformers, we developed the Full-Access Sequence Processing (FASP) programming language and proved that any algorithm written in FASP can be implemented by a fixed-sized Transformer. In a FASP program, each variable corresponds to a Transformer sub-module, and each line of code transforms existing variables to a new variable through predefined functions, which is equivalent to constructing a more complex Transformer based on sub-modules. The variable returned by the program is the desired Transformer that encodes the algorithm. We wrote a FASP program that implements the \u201cSimulation-Summarization\u201d operation, which implies there exists a constant-sized Transformer that can perform the same function<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\">\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">In conclusion, we propose a new reasoning paradigm PENCIL, which alternates between generation and erasure, and enables models to think deeper to solve more complicated problems. Theoretically, we prove that PENCIL achieves Turing completeness with optimal time and space efficiency and thus can efficiently solve any computable problems. Looking forward, a promising direction would be to fine-tune LLMs to incorporate PENCIL\u2019s memory-efficient reasoning capabilities. We hope these findings will inspire reexamining current reasoning models from the perspective of theory of computation.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" height=\"189\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/05\/unnamed-6-1024x189.png?resize=1024%2C189&#038;ssl=1\" alt=\"\" class=\"wp-image-603872\"><\/figure>\n<p class=\"wp-block-paragraph\">\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/empowering-llms-to-think-deeper-by-erasing-thoughts\/\">Empowering LLMs to Think Deeper by Erasing Thoughts<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Chenxiao Yang<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/empowering-llms-to-think-deeper-by-erasing-thoughts\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Empowering LLMs to Think Deeper by Erasing Thoughts Introduction Recent large language models (LLMs) \u2014 such as OpenAI\u2019s o1\/o3, DeepSeek\u2019s R1 and Anthropic\u2019s Claude 3.7 \u2014 demonstrate that allowing the model to think deeper and longer at test time can significantly enhance model\u2019s reasoning capability. The core approach underlying their deep thinking capability is called [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,1030,67,71,87,70],"tags":[103,1400,2655],"class_list":["post-3770","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-chain-of-thought","category-deep-dives","category-large-language-models","category-llm","category-machine-learning","tag-model","tag-reasoning","tag-thoughts"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3770"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=3770"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/3770\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=3770"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=3770"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=3770"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}