{"id":1952,"date":"2025-02-20T07:02:49","date_gmt":"2025-02-20T07:02:49","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/20\/formulation-of-feature-circuits-with-sparse-autoencoders-in-llm\/"},"modified":"2025-02-20T07:02:49","modified_gmt":"2025-02-20T07:02:49","slug":"formulation-of-feature-circuits-with-sparse-autoencoders-in-llm","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/20\/formulation-of-feature-circuits-with-sparse-autoencoders-in-llm\/","title":{"rendered":"Formulation of Feature Circuits with Sparse Autoencoders in LLM"},"content":{"rendered":"<p>    Formulation of Feature Circuits with Sparse Autoencoders in LLM<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<p class=\"wp-block-paragraph\">Large Language models (LLMs) have witnessed impressive progress and these large models can do a variety of tasks, from generating human-like text to answering questions. However, understanding how these models work still remains challenging, especially due a phenomenon called superposition where features are mixed into one neuron, making it very difficult to extract human understandable representation from the original model structure. This is where methods like sparse Autoencoder appear to disentangle the features for interpretability.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">In this blog post, we will use the Sparse Autoencoder to find some feature circuits on a particular interesting case of subject-verb agreement ,and understand how the model components contribute to the task.<\/p>\n<h2 class=\"wp-block-heading\">Key concepts\u00a0<\/h2>\n<h3 class=\"wp-block-heading\">Feature circuits\u00a0<\/h3>\n<p class=\"wp-block-paragraph\">In the context of neural networks, <strong>feature circuits<\/strong> are how networks learn to combine input features to form complex patterns at higher levels. We use the metaphor of \u201ccircuits\u201d to describe how features are processed along layers in a neural network because such processes remind us of circuits in electronics processing and combining signals.<\/p>\n<p class=\"wp-block-paragraph\">These feature circuits form gradually through the connections between neurons and layers, where each neuron or layer is responsible for transforming input features, and their interactions lead to useful feature combinations that play together to make the final predictions.<\/p>\n<p class=\"wp-block-paragraph\">Here is one example of feature circuits: in lots of vision neural networks, we can find \u201ca circuit as a family of units detecting curves in different angular orientations. Curve detectors are primarily implemented from earlier, less sophisticated curve detectors and line detectors. These curve detectors are used in the next layer to create 3D geometry and complex shape detectors\u201d [1].\u00a0<\/p>\n<p class=\"wp-block-paragraph\">In the coming chapter, we will work on one feature circuit in LLMs for a subject-verb agreement task.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">Superposition and Sparse AutoEncoder\u00a0<\/h3>\n<p class=\"wp-block-paragraph\">In the context of <a href=\"https:\/\/towardsdatascience.com\/tag\/machine-learning\/\" title=\"Machine Learning\">Machine Learning<\/a>, we have sometimes observed superposition, referring to the phenomenon that one neuron in a model represents multiple overlapping features rather than a single, distinct one. For example, InceptionV1 contains one neuron that responds to cat faces, fronts of cars, and cat legs.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">This is where the <a href=\"https:\/\/towardsdatascience.com\/tag\/sparse-autoencoder\/\" title=\"Sparse Autoencoder\">Sparse Autoencoder<\/a> (SAE) comes in.<\/p>\n<p class=\"wp-block-paragraph\">The SAE helps us disentangle the network\u2019s activations into a set of sparse features. These sparse features are normally human understandable,m allowing us to get a better understanding of the model. By applying an SAE to the hidden layers activations of an LLM mode, we can isolate the features that contribute to the model\u2019s output.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">You can find the details of how the SAE works in my former <a href=\"https:\/\/medium.com\/towards-data-science\/sparse-autoencoder-from-superposition-to-interpretable-features-4764bb37927d\">blog post<\/a>.\u00a0<\/p>\n<h2 class=\"wp-block-heading\">Case study: Subject-Verb Agreement<\/h2>\n<h3 class=\"wp-block-heading\">Subject-Verb Agreement\u00a0<\/h3>\n<p class=\"wp-block-paragraph\">Subject-verb agreement is a fundamental grammar rule in English. The subject and the verb in a sentence must be consistent in numbers, aka singular or plural. For example:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u201cThe cat <strong>runs<\/strong>.\u201d (Singular subject, singular verb)<\/li>\n<li class=\"wp-block-list-item\">\u201cThe cats <strong>run<\/strong>.\u201d (Plural subject, plural verb)<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Understanding this rule simple for humans is important for tasks like text generation, translation, and question answering. But how do we know if an LLM has actually learned this rule?\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We will now explore in this chapter how the LLM forms a feature circuit for such a task.\u00a0<\/p>\n<h3 class=\"wp-block-heading\">Building the Feature Circuit<\/h3>\n<p class=\"wp-block-paragraph\">Let\u2019s now build the process of creating the feature circuit. We would do it in 4 steps:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We start by inputting sentences into the model. For this case study, we consider sentences like:\u00a0<\/li>\n<\/ol>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\u201cThe cat runs.\u201d (singular subject)<\/li>\n<li class=\"wp-block-list-item\">\u201cThe cats run.\u201d (plural subject)<\/li>\n<\/ul>\n<ol start=\"2\" class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We run the model on these sentences to get hidden activations. These activations stand for how the model processes the sentences at each layer.<\/li>\n<li class=\"wp-block-list-item\">We pass the activations to an SAE to \u201cdecompress\u201d the features.\u00a0<\/li>\n<li class=\"wp-block-list-item\">We construct a feature circuit as a computational graph:\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The input nodes represent the singular and plural sentences.<\/li>\n<li class=\"wp-block-list-item\">The hidden nodes represent the model layers to process the input.\u00a0<\/li>\n<li class=\"wp-block-list-item\">The sparse nodes represent obtained features from the SAE.<\/li>\n<li class=\"wp-block-list-item\">The output node represents the final decision. In this case: runs or run.\u00a0<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\">Toy Model\u00a0<\/h3>\n<p class=\"wp-block-paragraph\">We start by building a toy language model which might have no sense at all with the following code.\u00a0This is a network with two simple layers.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">For the subject-verb agreement, the model is supposed to:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Input a sentence with either singular or plural verbs.\u00a0<\/li>\n<li class=\"wp-block-list-item\">The hidden layer transforms such information into an abstract representation.\u00a0<\/li>\n<li class=\"wp-block-list-item\">The model selects the correct verb form as output.<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># ====== Define Base Model (Simulating Subject-Verb Agreement) ======\nclass SubjectVerbAgreementNN(nn.Module):\n   def __init__(self):\n       super().__init__()\n       self.hidden = nn.Linear(2, 4)  # 2 input \u2192 4 hidden activations\n       self.output = nn.Linear(4, 2)  # 4 hidden \u2192 2 output (runs\/run)\n       self.relu = nn.ReLU()\n\n\n   def forward(self, x):\n       x = self.relu(self.hidden(x))  # Compute hidden activations\n       return self.output(x)  # Predict verb<\/code><\/pre>\n<p class=\"wp-block-paragraph\">It is unclear what happens inside the hidden layer. So we introduce the following sparse AutoEncoder:\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># ====== Define Sparse Autoencoder (SAE) ======\nclass c(nn.Module):\n   def __init__(self, input_dim, hidden_dim):\n       super().__init__()\n       self.encoder = nn.Linear(input_dim, hidden_dim)  # Decompress to sparse features\n       self.decoder = nn.Linear(hidden_dim, input_dim)  # Reconstruct\n       self.relu = nn.ReLU()\n\n\n   def forward(self, x):\n       encoded = self.relu(self.encoder(x))  # Sparse activations\n       decoded = self.decoder(encoded)  # Reconstruct original activations\n       return encoded, decoded<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We train the original model <code>SubjectVerbAgreementNN<\/code> and the <code>SubjectVerbAgreementNN<\/code> with sentences designed to represent different singular and plural forms of verbs, such as \u201cThe cat runs\u201d, \u201cthe babies run\u201d. However, just like before, for the toy model, they may not have actual meanings.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Now we visualise the feature circuit. As introduced before, a feature circuit is a unit of neurons for processing specific features. In our model, the feature consists:\u00a0<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The <strong>hidden layer<\/strong> transforming language properties into abstract representation..<\/li>\n<li class=\"wp-block-list-item\">The <strong>SAE<\/strong> with <strong>independent features<\/strong> that contribute directly to the verb -subject agreement task.\u00a0<\/li>\n<\/ol>\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" data-dominant-color=\"f1f7fa\" data-has-transparency=\"true\" style=\"--dominant-color: #f1f7fa;\" loading=\"lazy\" decoding=\"async\" width=\"894\" height=\"549\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.53.18%25E2%2580%25AFPM.png?resize=894%2C549&#038;ssl=1\" alt=\"\" class=\"wp-image-598145 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.53.18\u202fPM.png 894w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.53.18\u202fPM-300x184.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.53.18\u202fPM-768x472.png 768w\" sizes=\"auto, (max-width: 894px) 100vw, 894px\"><figcaption class=\"wp-element-caption\">Trained Feature Circuit: Singular vs. Plural (Dog\/Dogs)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You can see in the plot that we visualize the feature circuit as a graph:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Hidden activations and the encoder\u2019s outputs are all nodes of the graph.<\/li>\n<li class=\"wp-block-list-item\">We also have the output nodes as the correct verb.<\/li>\n<li class=\"wp-block-list-item\">Edges in the graph are weighted by activation strength, showing which pathways are most important in the subject-verb agreement decision. For example, you can see that the path from H3 to F2 plays an important role.\u00a0<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\">GPT2-Small\u00a0<\/h3>\n<p class=\"wp-block-paragraph\">For a real case, we run the similar code on GPT2-small. We show the graph of a feature circuit representing the decision to choose the singular verb. <\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"c9dfec\" data-has-transparency=\"true\" style=\"--dominant-color: #c9dfec;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"121\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.54.48%25E2%2580%25AFPM-1024x121.png?resize=1024%2C121&#038;ssl=1\" alt=\"\" class=\"wp-image-598146 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.54.48\u202fPM-1024x121.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.54.48\u202fPM-300x35.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.54.48\u202fPM-768x91.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-19-at-12.54.48\u202fPM.png 1492w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Feature Circuit for Subject-Verb agreement (run\/runs). For code details and a larger version of the above, please refer to my <a href=\"https:\/\/colab.research.google.com\/drive\/1p50X-tTnUSA6wTxePkiiq_oTjhJftgFb#scrollTo=22HvXjpsKpw4\">notebook<\/a>. <\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Conclusion\u00a0<\/h2>\n<p class=\"wp-block-paragraph\">Feature circuits help us to understand how different parts in a complex LLM lead to a final output. We show the possibility to use an SAE to form a feature circuit for a subject-verb agreement task.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">However, we have to admit this method still needs some human-level intervention in the sense that we don\u2019t always know if a circuit can really form without a proper design.<\/p>\n<h2 class=\"wp-block-heading\">Reference\u00a0<\/h2>\n<p class=\"wp-block-paragraph\">[1] Zoom In: <a href=\"https:\/\/distill.pub\/2020\/circuits\/zoom-in\/\">An Introduction to Circuits<\/a><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/formulation-of-feature-circuits-with-sparse-autoencoders-in-llm\/\">Formulation of Feature Circuits with Sparse Autoencoders in LLM<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Shuyang<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/formulation-of-feature-circuits-with-sparse-autoencoders-in-llm\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Formulation of Feature Circuits with Sparse Autoencoders in LLM Large Language models (LLMs) have witnessed impressive progress and these large models can do a variety of tasks, from generating human-like text to answering questions. However, understanding how these models work still remains challenging, especially due a phenomenon called superposition where features are mixed into one [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,286,71,70,673,1801,1006],"tags":[1802,321,117],"class_list":["post-1952","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-explainable-ai","category-large-language-models","category-machine-learning","category-neural-networks","category-sparse-autoencoder","category-superposition","tag-circuits","tag-feature","tag-features"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1952"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=1952"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/1952\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=1952"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=1952"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=1952"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}