{"id":427,"date":"2024-12-07T07:01:14","date_gmt":"2024-12-07T07:01:14","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2024\/12\/07\/combining-large-and-small-llms-for-inference-time-and-quality-boosts-1779b6b5100b\/"},"modified":"2024-12-07T07:01:14","modified_gmt":"2024-12-07T07:01:14","slug":"combining-large-and-small-llms-for-inference-time-and-quality-boosts-1779b6b5100b","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2024\/12\/07\/combining-large-and-small-llms-for-inference-time-and-quality-boosts-1779b6b5100b\/","title":{"rendered":"Combining Large and Small LLMs to Boost Inference Time and Quality"},"content":{"rendered":"<p>    Combining Large and Small LLMs to Boost Inference Time and Quality<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h4>Implementing Speculative and Contrastive Decoding<\/h4>\n<p>Large Language models are comprised of billions of parameters (weights). For each word it generates, the model has to perform computationally expensive calculations across all of these parameters.<\/p>\n<p>Large Language models accept a sentence, or sequence of tokens, and generate a probability distribution of the next most likely\u00a0token.<\/p>\n<p>Thus, typically decoding <strong>n<\/strong> tokens (or generating <strong>n <\/strong>words from the model) requires running the model <strong>n <\/strong>number of times. At each iteration, the new token is appended to the input sentence and passed to the model again. This can be\u00a0costly.<\/p>\n<p>Additionally, decoding strategy can influence the quality of the generated words. Generating tokens in a simple way, by just taking the token with the highest probability in the output distribution, can result in repetitive text. Random sampling from the distribution can result in unintended drift.<\/p>\n<p>Thus, a solid decoding strategy is required to ensure\u00a0both:<\/p>\n<ul>\n<li>High Quality\u00a0Outputs<\/li>\n<li>Fast Inference Time<\/li>\n<\/ul>\n<p><strong>Both requirements can be addressed by using a combination of a large and small language model, as long as the amateur and expert models are similar (e.g., same architecture but different sizes).<\/strong><\/p>\n<ul>\n<li>\n<strong>Target\/Large Model: <\/strong>Main LM with larger number of parameters (e.g.\u00a0OPT-13B)<\/li>\n<li>\n<strong>Amateur\/Small Model:<\/strong> Smaller version of Main LM with fewer parameters (e.g. OPT-125M)<\/li>\n<\/ul>\n<p><strong>Speculative<\/strong> and <strong>contrastive<\/strong> decoding leverage large and small LLMs to achieve reliable and efficient text generation.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AsYU-r355eE8LL8ug8tngmQ.png?ssl=1\"><\/figure>\n<h3>Contrastive Decoding for High Quality Inference<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2210.15097\">Contrastive Decoding<\/a> is a strategy that exploits the fact that that failures in large LLMs (such as repetition, incoherence) are even more pronounced in small LLMs. Thus, this strategy optimizes for the tokens with the highest probability difference between the small and large\u00a0model.<\/p>\n<p>For a single prediction, contrastive decoding generates two probability distributions:<\/p>\n<ul>\n<li>\n<em>q = <\/em>logit probabilities for amateur\u00a0model<\/li>\n<li>\n<em>p = <\/em>logit probabilities for expert\u00a0model<\/li>\n<\/ul>\n<p>The next token is chosen based on the following criteria:<\/p>\n<ul>\n<li>Discard all tokens that do not have sufficiently high probability under the expert model (discard <em>p(x) &lt; alpha *\u00a0max(p)<\/em>)<\/li>\n<li>From the remaining tokens, select the one the with the largest difference between large model and small model log probabilities, <em>max(p(x) &#8211;\u00a0q(x)).<\/em>\n<\/li>\n<\/ul>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AhkZ4H_FMF0VhW5y-pUFX6g.png?ssl=1\"><\/figure>\n<h4>Implementing Contrastive Decoding<\/h4>\n<pre>from transformers import AutoTokenizer, AutoModelForCausalLM<br>import torch<br><br># Load models and tokenizer<br>tokenizer = AutoTokenizer.from_pretrained('gpt2')<br>amateur_lm = AutoModelForCausalLM.from_pretrained('gpt2')<br>expert_lm = AutoModelForCausalLM.from_pretrained('gpt2-large')<br><br>def contrastive_decoding(prompt, max_length=50):<br>    input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids<br><br>    while input_ids.shape[1] &lt; max_length:<br><br>        # Generate amateur model output<br>        amateur_outputs = amateur_lm(input_ids, return_dict=True)<br>        amateur_logits = torch.softmax(amateur_outputs.logits[:, -1, :], dim=-1)<br>        log_probs_amateur = torch.log(amateur_logits)<br><br>        # Generate expert model output<br>        expert_outputs = expert_lm(input_ids, return_dict=True)<br>        expert_logits = torch.softmax(expert_outputs.logits[:, -1, :], dim=-1)<br>        log_probs_exp = torch.log(expert_logits)<br><br>        log_probs_diff = log_probs_exp - log_probs_amateur<br><br>        # Set an alpha threshold to eliminate less confident tokens in expert<br>        alpha = 0.1<br>        candidate_exp_prob = torch.max(expert_logits)<br><br>        # Mask tokens below threshold for expert model<br>        V_head = expert_logits &lt; alpha * candidate_exp_prob<br><br>        # Select the next token from the log-probabilities difference, ignoring masked values<br>        token = torch.argmax(log_probs_diff.masked_fill(V_head, -torch.inf)).unsqueeze(0)<br><br>        # Append token and accumulate generated text<br>        input_ids = torch.cat([input_ids, token.unsqueeze(1)], dim=-1)<br><br>    return tokenizer.batch_decode(input_ids)<br><br>prompt = \"Large Language Models are\"<br>generated_text = contrastive_decoding(prompt, max_length=25)<br>print(generated_text)<\/pre>\n<h3>Speculative Decoding For Fast Inference<\/h3>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2211.17192\">Speculative decoding<\/a> is based on the principle that the smaller model must sample from the same distribution as the larger model. Thus, this strategy aims to accept as many predictions from the smaller model as possible, provided they align with the distribution of the larger\u00a0model.<\/p>\n<p>The smaller model generates <strong>n<\/strong> tokens in sequence, as possible guesses. However, all <strong>n<\/strong> sequences are fed into the larger expert model as a single batch, which is faster than sequential generation.<\/p>\n<p>This results in a cache for each model, with <strong>n <\/strong>probability distributions in each\u00a0cache.<\/p>\n<ul>\n<li>\n<em>q = <\/em>logit probabilities for amateur\u00a0model<\/li>\n<li>\n<em>p = <\/em>logit probabilities for expert\u00a0model<\/li>\n<\/ul>\n<p>Next, the sampled tokens from the amateur model are accepted or rejected based on the following conditions:<\/p>\n<ul>\n<li>If probability of the token is higher in expert distribution (p) than amateur distribution (q), or <em>p(x) &gt; q(x), <\/em>accept\u00a0token<\/li>\n<li>If probability of token is lower in expert distribution (p) than amateur distribution (q), or <em>p(x) &lt; q(x)<\/em>, reject token with probability <em>1 &#8211; p(x) \/\u00a0q(x)<\/em>\n<\/li>\n<\/ul>\n<p>If a token is rejected, the next token is sampled from the expert distribution or adjusted distribution. Additionally, the amateur and expert model reset the cache and re-generate <strong>n <\/strong>guesses and probability distributions <em>p<\/em> and\u00a0<em>q<\/em>.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AQxlaFBJg4mEtmZfA33RBqA.png?ssl=1\"><figcaption>Here, the blue signifies accepted tokens, and red\/green signify tokens rejected and then sampled from the expert or adjusted distribution.<\/figcaption><\/figure>\n<h4>Implementing Speculative Decoding<\/h4>\n<pre>from transformers import AutoTokenizer, AutoModelForCausalLM<br>import torch<br><br># Load models and tokenizer<br>tokenizer = AutoTokenizer.from_pretrained('gpt2')<br>amateur_lm = AutoModelForCausalLM.from_pretrained('gpt2')<br>expert_lm = AutoModelForCausalLM.from_pretrained('gpt2-large')<br><br># Sample next token from output distribution<br>def sample_from_distribution(logits):<br>    sampled_index = torch.multinomial(logits, 1)<br>    return sampled_index<br><br>def generate_cache(input_ids, n_tokens):<br>    # Store logits at each step for amateur and expert models<br>    amateur_logits_per_step = []<br>    generated_tokens = []<br><br>    batch_input_ids = []<br><br>    with torch.no_grad():<br>        for _ in range(n_tokens):<br>            # Generate amateur model output<br>            amateur_outputs = amateur_lm(input_ids, return_dict=True)<br>            amateur_logits = torch.softmax(amateur_outputs.logits[:, -1, :], dim=-1)<br>            amateur_logits_per_step.append(amateur_logits)<br><br>            # Sampling from amateur logits<br>            next_token = sample_from_distribution(amateur_logits)<br>            generated_tokens.append(next_token)<br><br>            # Append to input_ids for next generation step<br>            input_ids = torch.cat([input_ids, next_token], dim=-1)<br>            batch_input_ids.append(input_ids.squeeze(0))<br><br>    # Feed IDs to expert model as batch <br>    batched_input_ids = torch.nn.utils.rnn.pad_sequence(batch_input_ids, batch_first=True, padding_value=0 )<br>    expert_outputs = expert_lm(batched_input_ids, return_dict=True)<br>    expert_logits = torch.softmax(expert_outputs.logits[:, -1, :], dim=-1)<br><br>    return amateur_logits_per_step, expert_logits, torch.cat(generated_tokens, dim=-1)<br><br>def speculative_decoding(prompt, n_tokens=5, max_length=50):<br>    input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids<br><br>    while input_ids.shape[1] &lt; max_length:<br>        amateur_logits_per_step, expert_logits, generated_ids = generate_cache(<br>            input_ids, n_tokens<br>        )<br><br>        accepted = 0<br>        for n in range(n_tokens):<br>            token = generated_ids[:, n][0]<br>            r = torch.rand(1).item()<br><br>            # Extract probabilities<br>            p_x = expert_logits[n][token].item()<br>            q_x = amateur_logits_per_step[n][0][token].item()<br><br>            # Speculative decoding acceptance criterion<br>            if ((q_x &gt; p_x) and (r &gt; (1 - p_x \/ q_x))):<br>                break  # Reject token and restart the loop<br>            else:<br>                accepted += 1<br>                <br>            # Check length<br>            if (input_ids.shape[1] + accepted) &gt;= max_length:<br>                return tokenizer.batch_decode(input_ids)<br><br>        input_ids = torch.cat([input_ids, generated_ids[:, :accepted]], dim=-1)<br><br>        if accepted &lt; n_tokens:<br>            diff = expert_logits[accepted] - amateur_logits_per_step[accepted][0]<br>            clipped_diff = torch.clamp(diff, min=0) <br><br>            # Sample a token from the adjusted expert distribution<br>            normalized_result = clipped_diff \/ torch.sum(clipped_diff, dim=0, keepdim=True)<br>            next_token = sample_from_distribution(normalized_result)<br>            input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=-1)<br>        else:<br>            # Sample directly from the expert logits for the last accepted token<br>            next_token = sample_from_distribution(expert_logits[-1])<br>            input_ids = torch.cat([input_ids, next_token.unsqueeze(1)], dim=-1)<br><br>    return tokenizer.batch_decode(input_ids)<br><br># Example usage<br>prompt = \"Large Language models are\"<br>generated_text = speculative_decoding(prompt, n_tokens=3, max_length=25)<br>print(generated_text)<\/pre>\n<h4>Evaluation<\/h4>\n<p>We can evaluate both decoding approaches by comparing them to a naive decoding method, where we randomly pick the next token from the probability distribution.<\/p>\n<pre>def sequential_sampling(prompt, max_length=50):<br>    \"\"\"<br>    Perform sequential sampling with the given model.<br>    \"\"\"<br>    # Tokenize the input prompt<br>    input_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids<br><br>    with torch.no_grad():<br>          while input_ids.shape[1] &lt; max_length:<br>            # Sample from the model output logits for the last token<br>            outputs = expert_lm(input_ids, return_dict=True)<br>            logits = outputs.logits[:, -1, :]<br><br>            probabilities = torch.softmax(logits, dim=-1)<br>            next_token = torch.multinomial(probabilities, num_samples=1)<br>            input_ids = torch.cat([input_ids, next_token], dim=-1)<br><br>    return tokenizer.batch_decode(input_ids)<\/pre>\n<p>To evaluate contrastive decoding, we can use the following metrics for lexical richness.<\/p>\n<ul>\n<li>\n<strong>n-gram Entropy<\/strong>: Measures the unpredictability or diversity of n-grams in the generated text. High entropy indicates more diverse text, while low entropy suggests repetition or predictability.<\/li>\n<li>\n<strong>distinct-n<\/strong>: Measures the proportion of unique n-grams in the generated text. Higher distinct-n values indicate more lexical diversity.<\/li>\n<\/ul>\n<pre>from collections import Counter<br>import math<br><br>def ngram_entropy(text, n):<br>    \"\"\"<br>    Compute n-gram entropy for a given text.<br>    \"\"\"<br>    # Tokenize the text<br>    tokens = text.split()<br>    if len(tokens) &lt; n:<br>        return 0.0  # Not enough tokens to form n-grams<br>    <br>    # Create n-grams<br>    ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]<br>    <br>    # Count frequencies of n-grams<br>    ngram_counts = Counter(ngrams)<br>    total_ngrams = sum(ngram_counts.values())<br>    <br>    # Compute entropy<br>    entropy = -sum((count \/ total_ngrams) * math.log2(count \/ total_ngrams)<br>                   for count in ngram_counts.values())<br>    return entropy<br><br>def distinct_n(text, n):<br>    \"\"\"<br>    Compute distinct-n metric for a given text.<br>    \"\"\"<br>    # Tokenize the text<br>    tokens = text.split()<br>    if len(tokens) &lt; n:<br>        return 0.0  # Not enough tokens to form n-grams<br>    <br>    # Create n-grams<br>    ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]<br>    <br>    # Count unique and total n-grams<br>    unique_ngrams = set(ngrams)<br>    total_ngrams = len(ngrams)<br>    <br>    return len(unique_ngrams) \/ total_ngrams if total_ngrams &gt; 0 else 0.0<br><br>prompts = [<br>    \"Large Language models are\",<br>    \"Barack Obama was\",<br>    \"Decoding strategy is important because\",<br>    \"A good recipe for Halloween is\",<br>    \"Stanford is known for\"<br>]<br><br># Initialize accumulators for metrics<br>naive_entropy_totals = [0, 0, 0]  # For n=1, 2, 3<br>naive_distinct_totals = [0, 0]    # For n=1, 2<br>contrastive_entropy_totals = [0, 0, 0]<br>contrastive_distinct_totals = [0, 0]<br><br>for prompt in prompts:<br>    naive_generated_text = sequential_sampling(prompt, max_length=50)[0]<br><br>    for n in range(1, 4):<br>        naive_entropy_totals[n - 1] += ngram_entropy(naive_generated_text, n)<br><br>    for n in range(1, 3):<br>        naive_distinct_totals[n - 1] += distinct_n(naive_generated_text, n)<br><br>    contrastive_generated_text = contrastive_decoding(prompt, max_length=50)[0]<br><br>    for n in range(1, 4):<br>        contrastive_entropy_totals[n - 1] += ngram_entropy(contrastive_generated_text, n)<br><br>    for n in range(1, 3):<br>        contrastive_distinct_totals[n - 1] += distinct_n(contrastive_generated_text, n)<br><br># Compute averages<br>naive_entropy_averages = [total \/ len(prompts) for total in naive_entropy_totals]<br>naive_distinct_averages = [total \/ len(prompts) for total in naive_distinct_totals]<br>contrastive_entropy_averages = [total \/ len(prompts) for total in contrastive_entropy_totals]<br>contrastive_distinct_averages = [total \/ len(prompts) for total in contrastive_distinct_totals]<br><br># Display results<br>print(\"Naive Sampling:\")<br>for n in range(1, 4):<br>    print(f\"Average Entropy (n={n}): {naive_entropy_averages[n - 1]}\")<br>for n in range(1, 3):<br>    print(f\"Average Distinct-{n}: {naive_distinct_averages[n - 1]}\")<br><br>print(\"nContrastive Decoding:\")<br>for n in range(1, 4):<br>    print(f\"Average Entropy (n={n}): {contrastive_entropy_averages[n - 1]}\")<br>for n in range(1, 3):<br>    print(f\"Average Distinct-{n}: {contrastive_distinct_averages[n - 1]}\")<\/pre>\n<p>The following results show us that contrastive decoding outperforms naive sampling for these\u00a0metrics.<\/p>\n<blockquote><p>\n<strong>Naive Sampling:<\/strong><br \/>Average Entropy (n=1): 4.990499826537679<br \/>Average Entropy (n=2): 5.174765791328267<br \/>Average Entropy (n=3): 5.14373124004409<br \/>Average Distinct-1: 0.8949694135740648<br \/>Average Distinct-2: 0.9951219512195122<\/p><\/blockquote>\n<blockquote><p>\n<strong>Contrastive Decoding:<\/strong><br \/>Average Entropy (n=1): 5.182773920916605<br \/>Average Entropy (n=2): 5.3495681172235665<br \/>Average Entropy (n=3): 5.313720275712986<br \/>Average Distinct-1: 0.9028425204970866<br \/>Average Distinct-2: 1.0<\/p><\/blockquote>\n<p>To evaluate speculative decoding, we can look at the average runtime for a set of prompts for different <strong>n<\/strong>\u00a0values.<\/p>\n<pre>import time<br>import matplotlib.pyplot as plt<br><br># Parameters<br>n_tokens = range(1, 11)<br>speculative_decoding_times = []<br>naive_decoding_times = []<br><br>prompts = [<br>    \"Large Language models are\",<br>    \"Barack Obama was\",<br>    \"Decoding strategy is important because\",<br>    \"A good recipe for Halloween is\",<br>    \"Stanford is known for\"<br>]<br><br># Loop through n_tokens values<br>for n in n_tokens:<br>    avg_time_naive, avg_time_speculative = 0, 0<br><br>    for prompt in prompts:<br>        start_time = time.time()<br>        _ = sequential_sampling(prompt, max_length=25)<br>        avg_time_naive += (time.time() - start_time)<br><br>        start_time = time.time()<br>        _ = speculative_decoding(prompt, n_tokens=n, max_length=25)<br>        avg_time_speculative += (time.time() - start_time)<br><br>    naive_decoding_times.append(avg_time_naive \/ len(prompts))<br>    speculative_decoding_times.append(avg_time_speculative \/ len(prompts))<br><br>avg_time_naive = sum(naive_decoding_times) \/ len(naive_decoding_times)<br><br># Plotting the results<br>plt.figure(figsize=(8, 6))<br>plt.bar(n_tokens, speculative_decoding_times, width=0.6, label='Speculative Decoding Time', alpha=0.7)<br>plt.axhline(y=avg_time_naive, color='red', linestyle='--', label='Naive Decoding Time')<br><br># Labels and title<br>plt.xlabel('n_tokens', fontsize=12)<br>plt.ylabel('Average Time (s)', fontsize=12)<br>plt.title('Speculative Decoding Runtime vs n_tokens', fontsize=14)<br>plt.legend()<br>plt.grid(axis='y', linestyle='--', alpha=0.7)<br><br># Show the plot<br>plt.show()<br>plt.savefig(\"plot.png\")<\/pre>\n<p>We can see that the average runtime for the naive decoding is much higher than for speculative decoding across <strong>n<\/strong>\u00a0values.<\/p>\n<figure><img data-recalc-dims=\"1\" decoding=\"async\" alt=\"\" src=\"https:\/\/i0.wp.com\/cdn-images-1.medium.com\/max\/1024\/1%2AslaABX5O1pPoxtZ7rOKarA.png?ssl=1\"><\/figure>\n<p>Combining large and small language models for decoding strikes a balance between quality and efficiency. While these approaches introduce additional complexity in system design and resource management, their benefits apply to conversational AI, real-time translation, and content creation.<\/p>\n<p>These approaches require careful consideration of deployment constraints. For instance, the additional memory and compute demands of running dual models may limit feasibility on edge devices, though this can be mitigated through techniques like model quantization.<\/p>\n<p><strong>Unless otherwise noted, all images are by the\u00a0author.<\/strong><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/medium.com\/_\/stat?event=post.clientViewed&amp;referrerSource=full_rss&amp;postId=1779b6b5100b\" width=\"1\" height=\"1\" alt=\"\"><\/p>\n<hr>\n<p><a href=\"https:\/\/towardsdatascience.com\/combining-large-and-small-llms-for-inference-time-and-quality-boosts-1779b6b5100b\">Combining Large and Small LLMs to Boost Inference Time and Quality<\/a> was originally published in <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a> on Medium, where people are continuing the conversation by highlighting and responding to this story.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Richa Gadgil<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/medium.com\/m\/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fcombining-large-and-small-llms-for-inference-time-and-quality-boosts-1779b6b5100b\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Combining Large and Small LLMs to Boost Inference Time and Quality Implementing Speculative and Contrastive Decoding Large Language models are comprised of billions of parameters (weights). For each word it generates, the model has to perform computationally expensive calculations across all of these parameters. Large Language models accept a sentence, or sequence of tokens, and [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,513,70,514],"tags":[516,515,103],"class_list":["post-427","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-inference","category-machine-learning","category-machine-learning-systems","tag-decoding","tag-large","tag-model"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/427"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=427"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/427\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=427"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=427"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=427"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}