{"id":671,"date":"2024-12-19T07:01:23","date_gmt":"2024-12-19T07:01:23","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/19\/navigating-soft-actor-critic-reinforcement-learning-8e1a7406ce48\/"},"modified":"2024-12-19T07:01:23","modified_gmt":"2024-12-19T07:01:23","slug":"navigating-soft-actor-critic-reinforcement-learning-8e1a7406ce48","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/19\/navigating-soft-actor-critic-reinforcement-learning-8e1a7406ce48\/","title":{"rendered":"Navigating Soft Actor-Critic Reinforcement Learning"},"content":{"rendered":"<p>    Navigating Soft Actor-Critic Reinforcement Learning<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Understanding the theory and implementation of SAC RL in the context of Bioengineering<\/h4>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/905\/1%2AyfkhYMLsj6VZ_P-H2fPtFQ.png?ssl=1\"><figcaption>Image generated by the author using ChatGPT-4o<\/figcaption><\/figure>\n<h3>Introduction<\/h3>\n<p>The research domain of Reinforcement Learning (RL) has evolved greatly over the past years. The use of deep reinforcement learning methods such as Proximal Policy Optimisation (PPO) (Schulman, 2017) and Deep Deterministic Policy Gradient (DDPG) (Lillicrap, 2015) have enabled agents to solve tasks in high-dimensional environments. However, many of these model-free RL algorithms have struggled with stability during the training process. These challenges arise due to the brittle convergence properties, high variance in gradient estimation, very high sample complexity, and the sensitivity to hyperparameters in continuous action spaces. Given these problems, it is imperative to consider a newly devised RL algorithm that avoids such issues and expands applicability to complex, real-world problems. This new algorithm is the Soft Actor-Critic (SAC) deep RL network. (Haarnoja, 2018)<\/p>\n<figure><img decoding=\"async\" alt=\"\" src=\"https:\/\/cdn-images-1.medium.com\/max\/694\/0*KRk_7825CvmUF0uj\"><figcaption>Model Architecture of Soft Actor-Critic Networks. Image taken from <a href=\"https:\/\/arxiv.org\/abs\/2301.03220\">(Du,\u00a02023)<\/a><\/figcaption><\/figure>\n<p>SAC is an off-policy Actor-Critic deep RL algorithm which is designed to address the stability and efficiency constraints of its predecessors. The SAC algorithm is based on the maximum entropy RL framework which aims for the actor part of the network to maximise the expected reward, while maximising entropy. It combines off-policy updates with a more stable formulation of the stochastic Actor-Critic method. An off-policy algorithm enables faster learning and better sample efficiency using experience replay, unlike on-policy methods such as PPO, which require new samples for each gradient step. For on-policy methods such as PPO, for each gradient step in the learning process, new samples must be collected. The aim of using stochastic policies and maximising entropy comes to promote the robustness and exploration of the algorithm by encouraging more randomness in the actions. Additionally, unlike PPO and DDPG, SAC uses twin Q-networks with a separate Actor network and entropy tuning to improve the stability and convergence when combining off-policy learning with high dimensional, nonlinear function approximation.<\/p>\n<p>Off-policy RL methods have had a wide impact on bioengineering systems that improve patient lives. More specifically, RL has been applied to domains such as robotic arm control, drug delivery methods and most notably de novo drug design. (Svensson, 2024) Svensson et al. has used a number of on- and off-policy frameworks and different variants of replay buffers to learn a RNN-based molecule generation policy, to be active against DRD2 (a dopamine receptor). The paper realises that using experience replay across the board for high, intermediate and low scoring molecules has shown effects in improving the structural diversity and the number of active molecules generated. Replay buffers improve sample efficiency in training agents. They also reported that the use of off-policy methods and more specifically SAC, helps in promoting structural diversity by preventing mode collapse.<\/p>\n<h3>Theoretical Explanation<\/h3>\n<p>SAC uses \u2018soft\u2019 value functions by introducing the objective function with an entropy term, <strong>\u0397(\u03c0(a|s))<\/strong>. Accordingly, the network seeks to maximise both the expected return of lifetime rewards and the entropy of the policy. The entropy of the policy is defined as the unpredictability of a random variable, which increases with the range of possible values. Thus, the new entropy regularised objective becomes:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/656\/1%2AcfQ1iGi6F7XmMXbRoBPCPQ.png?ssl=1\"><figcaption>Entropy Regularised Objective<\/figcaption><\/figure>\n<p><strong>\u03b1 <\/strong>is the temperature parameter that balances between exploration and exploitation.<\/p>\n<p>In the implementation of soft value functions, we aim to maximise the entropy as the algorithm would assign equal probabilities to actions that have a similar Q-value. Maximising entropy also helps with preventing the agent from choosing actions that exploit inconsistencies in approximated Q-values. We can finally understand how SAC improves brittleness by allowing the network to explore more and not assign very high probabilities to one range of actions. This part is inspired by <a href=\"https:\/\/medium.com\/u\/61d2676ad14\">Vaishak V.Kumar<\/a>\u2019s explanation of the entropy maximisation in \u201cSoft Actor-Critic Demystified\u201d.<\/p>\n<p>The SAC paper authors discuss that since the state value function approximates the soft value, there is really no essential need to train separate function approximators for the policy, since they relate to the state value according to the following equation. However, training three separate approximators provided better convergence.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/439\/1%2Azg8hqrq-3jq3phU8jjGALA.png?ssl=1\"><figcaption>Soft State Value\u00a0Function<\/figcaption><\/figure>\n<p>The three function approximator networks are characterised as\u00a0follows:<\/p>\n<ul>\n<li>\n<strong>Policy Network (Actor): <\/strong>the stochastic policy outputs a set of actions sampled from a Gaussian distribution. The policy parameters are learned by minimising the Kullback-Leibler Divergence as provided in this equation:<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/528\/1%2A5-1Ih1trFi-ZKVSabMeUoQ.png?ssl=1\"><figcaption>Minimising KL-Divergence<\/figcaption><\/figure>\n<p>The KL-divergence compares the relative entropy or the difference between two probability distributions. So, in the equation, we are trying to minimise the difference between the distributions of the policy function and the exponentiated Q-function normalised by a function Z. Since the target density function is the Q-function, which is differentiable, we apply a reparametrisation trick on the policy to reduce the estimation of the variance.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/147\/1%2AuzEqV8Xxguc0Pai9BC5ywg.png?ssl=1\"><figcaption>Reparametrised Policy<\/figcaption><\/figure>\n<p>\u03f5\u209c is a vector sampled from a Gaussian distribution which describes the\u00a0noise.<\/p>\n<p>The policy objective is then updated to the following expression:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/635\/1%2A-OP-5Uq0l_yUNaalRUlnRw.png?ssl=1\"><figcaption>Policy Objective<\/figcaption><\/figure>\n<p>The policy objective is optimised using the following gradient estimation:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/548\/1%2A-X1VYmDQeGXtHSBqiEkr5g.png?ssl=1\"><figcaption>Policy Gradient Estimator<\/figcaption><\/figure>\n<ul>\n<li>\n<strong>Q-Network (Critic): <\/strong>includes two Q-value networks to estimate the expected reward for the state-action pairs. We minimise the soft Q-function parameters by using the soft Bellman residual provided\u00a0here:<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/518\/1%2Are4umf-pJdNWHN2XMevU4w.png?ssl=1\"><figcaption>Soft Q-function Objective<\/figcaption><\/figure>\n<p>where:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/459\/1%2AVSzO-HTTGM5ZYgbn0RhZyw.png?ssl=1\"><figcaption>Immediate Q-value<\/figcaption><\/figure>\n<p>The soft Q-function objective minimises the square differences between the networks Q-value estimation and the immediate Q-value. The immediate Q-value (Q hat) is obtained from the reward of the current state-action pair added to the discounted expectation of the target value function in the following time stamp. Finally, the objective is optimised using a stochastic gradient estimation given by the following:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/569\/1%2AUowPPIgsfR169BA0AbZLtg.png?ssl=1\"><figcaption>Stochastic Gradient Estimator<\/figcaption><\/figure>\n<p><strong>Target Value Network (Critic): <\/strong>a separate soft value function which helps in stabilising the training process. The soft value function approximator minimises the squared residual error as\u00a0follows:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/695\/1%2ADMu1dQLXR4DXD4vEMR7Y1g.png?ssl=1\"><figcaption>Soft Value Function Objective<\/figcaption><\/figure>\n<p>This soft value function objective minimises the square differences between the value function and the expectation of the Q-value plus the entropy of the policy function <strong>\u03c0<\/strong>. The negative log part of this objective describes the entropy of the policy function. We also know that the information entropy is calculated using a negative sign to output a positive entropy value, since the log of a probability value (between 0 and 1) will be negative. Similarly, the objective is optimised using an unbiased gradient estimator, given in the following expression:<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/566\/1%2AwC98khX7wHqu4a3gdCq0Qg.png?ssl=1\"><figcaption>Unbiased Gradient Estimator<\/figcaption><\/figure>\n<h3>Code Implementation<\/h3>\n<p>The code implemented in this article is taken from the following Github repository (quantumiracle, 2023):<\/p>\n<p><a href=\"https:\/\/github.com\/quantumiracle\/Popular-RL-Algorithms\">GitHub &#8211; quantumiracle\/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC\/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..<\/a><\/p>\n<pre>pip install gymnasium torch<\/pre>\n<p>SAC relies on environments that use continuous action spaces, so the simulation provided uses the robotic arm \u2018Reacher\u2019 environment for the most part and the Pendulum-v1 environment in the gymnasium package.<\/p>\n<p>The Pendulum environment was run on a different repository that implements the same algorithm but with less deprecated libraries given by (MrSyee,\u00a02020):<\/p>\n<p><a href=\"https:\/\/github.com\/MrSyee\/pg-is-all-you-need?tab=readme-ov-file\">GitHub &#8211; MrSyee\/pg-is-all-you-need: Policy Gradient is all you need! A step-by-step tutorial for well-known PG methods.<\/a><\/p>\n<p>In terms of the network architectures, as mentioned in the <em>Theory Explanation, <\/em>there are three main components:<\/p>\n<p><strong>Policy Network:<\/strong> implements a Gaussian Actor network computing the mean and log standard deviation for the action distribution.<\/p>\n<pre>class PolicyNetwork(nn.Module):<br>    def __init__(self, state_dim, action_dim, hidden_dim):<br>        super(PolicyNetwork, self).__init__()<br>        self.fc1 = nn.Linear(state_dim, hidden_dim)<br>        self.fc2 = nn.Linear(hidden_dim, hidden_dim)<br>        self.mean = nn.Linear(hidden_dim, action_dim)<br>        self.log_std = nn.Linear(hidden_dim, action_dim)<br><br>    def forward(self, state):<br>        x = F.relu(self.fc1(state))<br>        x = F.relu(self.fc2(x))<br>        mean = self.mean(x)<br>        log_std = torch.clamp(self.log_std(x), -20, 2)  # Limit log_std to prevent instability<br>        return mean, log_std<\/pre>\n<p><strong>Soft Q-Network: <\/strong>estimates the expected future reward given from a state-action pair for a defined optimal\u00a0policy.<\/p>\n<pre>class SoftQNetwork(nn.Module):<br>    def __init__(self, state_dim, action_dim, hidden_dim):<br>        super(SoftQNetwork, self).__init__()<br>        self.fc1 = nn.Linear(state_dim + action_dim, hidden_dim)<br>        self.fc2 = nn.Linear(hidden_dim, hidden_dim)<br>        self.out = nn.Linear(hidden_dim, 1)<br><br>    def forward(self, state, action):<br>        x = torch.cat([state, action], dim=-1)<br>        x = F.relu(self.fc1(x))<br>        x = F.relu(self.fc2(x))<br>        return self.out(x)<\/pre>\n<p><strong>Value Network: <\/strong>estimates the state\u00a0value.<\/p>\n<pre>class ValueNetwork(nn.Module):<br>    def __init__(self, state_dim, hidden_dim):<br>        super(ValueNetwork, self).__init__()<br>        self.fc1 = nn.Linear(state_dim, hidden_dim)<br>        self.fc2 = nn.Linear(hidden_dim, hidden_dim)<br>        self.out = nn.Linear(hidden_dim, 1)<br><br>    def forward(self, state):<br>        x = F.relu(self.fc1(state))<br>        x = F.relu(self.fc2(x))<br>        return self.out(x)<\/pre>\n<p>The following snippet offers the key steps in updating the different variables corresponding to the SAC algorithm. As it starts by sampling a batch from the replay buffer for experience replay. Then, before computing the gradients, they are initialised to zero to ensure that gradients from previous batches are not accumulated. Then performs backpropagation and updates the weights of the network during training. The target and loss values are then updated for the Q-networks. These steps take place for all three\u00a0methods.<\/p>\n<pre>def update(batch_size, reward_scale, gamma=0.99, soft_tau=1e-2):<br>    # Sample a batch<br>    state, action, reward, next_state, done = replay_buffer.sample(batch_size)<br>    state, next_state, action, reward, done = map(lambda x: torch.FloatTensor(x).to(device), <br>                                                  [state, next_state, action, reward, done])<br><br>    # Update Q-networks<br>    target_value = target_value_net(next_state)<br>    target_q = reward + (1 - done) * gamma * target_value<br>    q1_loss = F.mse_loss(soft_q_net1(state, action), target_q.detach())<br>    q2_loss = F.mse_loss(soft_q_net2(state, action), target_q.detach())<br><br>    soft_q_optimizer1.zero_grad()<br>    q1_loss.backward()<br>    soft_q_optimizer1.step()<br><br>    soft_q_optimizer2.zero_grad()<br>    q2_loss.backward()<br>    soft_q_optimizer2.step()<br><br>    # Update Value Network<br>    predicted_q = torch.min(soft_q_net1(state, action), soft_q_net2(state, action))<br>    value_loss = F.mse_loss(value_net(state), predicted_q - alpha * log_prob)<br>    value_optimizer.zero_grad()<br>    value_loss.backward()<br>    value_optimizer.step()<br><br>    # Update Policy Network<br>    new_action, log_prob, _, _, _ = policy_net.evaluate(state)<br>    policy_loss = (alpha * log_prob - predicted_q).mean()<br>    policy_optimizer.zero_grad()<br>    policy_loss.backward()<br>    policy_optimizer.step()<br><br>    # Soft Update Target Network<br>    for target_param, param in zip(target_value_net.parameters(), value_net.parameters()):<br>        target_param.data.copy_(soft_tau * param.data + (1 - soft_tau) * target_param.data)<\/pre>\n<p>Finally, to run the code in the sac.py file, just run the following commands:<\/p>\n<pre>python sac.py --train<br>python sac.py --test<\/pre>\n<h3>Results and Visualisation<\/h3>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AtcB7MHayhNcln57a1ABOJg.gif?ssl=1\"><figcaption>Training a \u2018Reacher\u2019 Robotic Arm, (generated by the\u00a0author)<\/figcaption><\/figure>\n<p>In training the SAC agent in both environments, I noticed that the action space of the problem affects the efficiency and the performance of the training. Indeed, when I trained the agent on the simple pendulum environment, the learning converged much faster and with lower oscillations. However, as the Reacher environment includes a more complicated continuous space of actions, the algorithm trained relatively well, but the big jump in the rewards was not seen as clearly. The Reacher was also trained on 4 times the number of episodes as that of the pendulum.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AIiNSqMC6_U6Ddpv1QLMAHw.png?ssl=1\"><figcaption>Learning Performance by Maximising Reward (generated by the\u00a0author)<\/figcaption><\/figure>\n<p>The action distribution below shows that the policy has a diverse range of actions that it explores through the training process until it converges on one optimal policy. The hallmark of entropy-regularised algorithms such as SAC comes from the increase in exploration. We can also notice that the peaks correspond to action values with high expected rewards which drives the policy to converge toward a more deterministic behaviour.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/806\/1%2AvDpqhd5qYW4oMQykpAE8uA.png?ssl=1\"><figcaption>Action Space Usage Distribution (generated by the\u00a0author)<\/figcaption><\/figure>\n<p>Speaking of a more deterministic behaviour, we observe that the entropy has decreased on average over the number of training episodes. However, this behaviour is expected, since the sole reason we want to maximise the entropy is to encourage more exploration. A higher exploration is mainly done early in the training process to exhaust most possible state-actions pairs that have higher\u00a0returns.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1000\/1%2AyHQg8NgM1lGxTdVm37C9Uw.png?ssl=1\"><figcaption>Entropy Valuation Over Training Episodes (generated by the\u00a0author)<\/figcaption><\/figure>\n<h3>Conclusion<\/h3>\n<p>The SAC algorithm is an off-policy RL framework that adopts a balance of exploitation and exploration through a new entropy term. The main objective function of the SAC algorithm includes maximising both the expected returns and the entropy during the training process, which address many of the issues the legacy frameworks suffer from. The use of twin Q-networks and automatic temperature tuning address high sample complexity, brittle convergence properties and complex hyperparameter tuning. SAC has proven to be highly effective in continuous control task domains. The results on action distribution and entropy reveal that the algorithm favours exploration in early training phases and diverse action sampling. As the agent trains, it converges to a more specific policy which reduces the entropy and reaches optimal actions. Consequently, it has been effectively used as an alternative for a wide range of domains in bioengineering for robotic control, drug discovery and drug delivery. Future implementations should focus on scaling the framework to more complex tasks and reducing its computational complexity.<\/p>\n<h3>References<\/h3>\n<p>Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015). Continuous control with deep reinforcement learning. [online] arXiv.org. Available at: <a href=\"https:\/\/arxiv.org\/abs\/1509.02971.\">https:\/\/arxiv.org\/abs\/1509.02971.<\/a><\/p>\n<p>Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal Policy Optimization Algorithms. [online] arXiv.org. Available at: <a href=\"https:\/\/arxiv.org\/abs\/1707.06347.\">https:\/\/arxiv.org\/abs\/1707.06347.<\/a><\/p>\n<p>Haarnoja, T., Zhou, A., Abbeel, P. and Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv:1801.01290 [cs, stat]. [online] Available at: <a href=\"https:\/\/arxiv.org\/abs\/1801.01290.\">https:\/\/arxiv.org\/abs\/1801.01290.<\/a><\/p>\n<p>Du, H., Li, Z., Niyato, D., Yu, R., Xiong, Z., Xuemin, Shen and Dong In Kim (2023). Enabling AI-Generated Content (AIGC) Services in Wireless Edge Networks. doi:https:\/\/doi.org\/10.48550\/arxiv.2301.03220.<\/p>\n<p>Svensson, H.G., Tyrchan, C., Engkvist, O. and Morteza Haghir Chehreghani (2024). Utilizing reinforcement learning for de novo drug design. Machine Learning, 113(7), pp.4811\u20134843. doi:https:\/\/doi.org\/10.1007\/s10994-024-06519-w.<\/p>\n<p>quantumiracle (2019). <em>GitHub\u200a\u2014\u200aquantumiracle\/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC\/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..<\/em> [online] GitHub. Available at: <a href=\"https:\/\/github.com\/quantumiracle\/Popular-RL-Algorithms\">https:\/\/github.com\/quantumiracle\/Popular-RL-Algorithms<\/a> [Accessed 12 Dec.\u00a02024].<\/p>\n<p>MrSyee (2019). <em>GitHub\u200a\u2014\u200aMrSyee\/pg-is-all-you-need: Policy Gradient is all you need! A step-by-step tutorial for well-known PG methods.<\/em> [online] GitHub. Available at: <a href=\"https:\/\/github.com\/MrSyee\/pg-is-all-you-need?tab=readme-ov-file.\">https:\/\/github.com\/MrSyee\/pg-is-all-you-need?tab=readme-ov-file.<\/a><\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/863\/1%2AiWaQe8bKNtMrIwfvkKakJw.png?ssl=1\"><\/figure>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=8e1a7406ce48\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/navigating-soft-actor-critic-reinforcement-learning-8e1a7406ce48\">Navigating Soft Actor-Critic Reinforcement Learning<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Mohammed AbuSadeh<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fnavigating-soft-actor-critic-reinforcement-learning-8e1a7406ce48\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Navigating Soft Actor-Critic Reinforcement Learning Understanding the theory and implementation of SAC RL in the context of Bioengineering Image generated by the author using ChatGPT-4o Introduction The research domain of Reinforcement Learning (RL) has evolved greatly over the past years. The use of deep reinforcement learning methods such as Proximal Policy Optimisation (PPO) (Schulman, 2017) [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,772,70,771,504,770],"tags":[774,773,775],"class_list":["post-671","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-bioengineering","category-machine-learning","category-optimisation","category-reinforcement-learning","category-soft-actor-critic","tag-actor","tag-policy","tag-rl"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/671"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=671"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/671\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}