{"id":2003,"date":"2025-02-22T07:00:55","date_gmt":"2025-02-22T07:00:55","guid":{"rendered":"https:\/\/mailitics.com\/index.php\/2025\/02\/22\/the-next-ai-revolution-a-tutorial-using-vaes-to-generate-high-quality-synthetic-data\/"},"modified":"2025-02-22T07:00:55","modified_gmt":"2025-02-22T07:00:55","slug":"the-next-ai-revolution-a-tutorial-using-vaes-to-generate-high-quality-synthetic-data","status":"publish","type":"post","link":"https:\/\/mailitics.com\/index.php\/2025\/02\/22\/the-next-ai-revolution-a-tutorial-using-vaes-to-generate-high-quality-synthetic-data\/","title":{"rendered":"The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data"},"content":{"rendered":"<p>    The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n    <!-- no image --><br \/>\n \t<BR><br \/>\n<BR><\/BR><\/p>\n<div>\n<h2 class=\"wp-block-heading\"><strong>What is synthetic data?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Data created by a computer intended to replicate or augment existing data.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Why is it useful?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society and have triggered many claims that we are rapidly approaching Artificial General Intelligence\u200a\u2014\u200aAI capable of replicating any human function.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Before getting too excited, or scared, depending on your perspective\u200a\u2014\u200awe are also rapidly approaching a hurdle to the advancement of these language models. According to a paper published by a group from the research institute, Epoch <a href=\"https:\/\/towardsdatascience.com\/#%5B1%5D\">[1]<\/a>,<em> we are running out of data<\/em>. They estimate that by 2028 we will have reached the upper limit of possible data upon which to train language models.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"f6f5f9\" data-has-transparency=\"true\" style=\"--dominant-color: #f6f5f9;\" fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"667\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.04%25E2%2580%25AFPM-1024x667.png?resize=1024%2C667&#038;ssl=1\" alt=\"\" class=\"wp-image-598366 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.04\u202fPM-1024x667.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.04\u202fPM-300x195.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.04\u202fPM-768x500.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.04\u202fPM.png 1192w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by Author. Graph based on estimated dataset projections. This is a reconstructed visualisation inspired by Epoch research group [1].<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\"><strong>What happens if we run out of data?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Well, if we run out of data then we aren\u2019t going to have anything new with which to train our language models. These models will then stop improving. If we want to pursue Artificial General Intelligence then we are going to have to come up with new ways of improving AI without just increasing the volume of real-world training data.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">One potential saviour is synthetic data which can be generated to mimic existing data and has already been used to improve the performance of models like Gemini and DBRX.\u00a0<\/p>\n<h2 class=\"wp-block-heading\"><strong>Synthetic data beyond LLMs<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Beyond overcoming data scarcity for large language models, synthetic data can be used in the following situations:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>Sensitive Data<\/strong>\u200a\u2014\u200aif we don\u2019t want to share or use sensitive attributes, synthetic data can be generated which mimics the properties of these features while maintaining anonymity.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Expensive data<\/strong>\u200a\u2014\u200aif collecting data is expensive we can generate a large volume of synthetic data from a small amount of real-world data.<\/li>\n<li class=\"wp-block-list-item\">\n<strong>Lack of data\u200a<\/strong>\u2014\u200adatasets are biased when there is a disproportionately low number of individual data points from a particular group. Synthetic data can be used to balance a dataset.\u00a0<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Imbalanced datasets<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Imbalanced datasets can (*but not always*) be problematic as they may not contain enough information to effectively train a predictive model. For example, if a dataset contains many more men than women, our model may be biased towards recognising men and misclassify future female samples as men.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">In this article we show the imbalance in the popular UCI<a href=\"https:\/\/archive.ics.uci.edu\/dataset\/2\/adult\"> Adult dataset<\/a> <a href=\"https:\/\/towardsdatascience.com\/#%5B2%5D\">[2],<\/a> and how we can use a <strong>variational auto-encoder<\/strong> to generate <a href=\"https:\/\/towardsdatascience.com\/tag\/synthetic-data\/\" title=\"Synthetic Data\">Synthetic Data<\/a> to improve classification on this example.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We first download the Adult dataset. This dataset contains features such as age, education and occupation which can be used to predict the target outcome \u2018income\u2019.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Download dataset into a dataframe\nurl = \"https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/adult\/adult.data\"\ncolumns = [\n   \"age\", \"workclass\", \"fnlwgt\", \"education\", \"education-num\", \"marital-status\",\n   \"occupation\", \"relationship\", \"race\", \"sex\", \"capital-gain\",\n   \"capital-loss\", \"hours-per-week\", \"native-country\", \"income\"\n]\ndata = pd.read_csv(url, header=None, names=columns, na_values=\" ?\", skipinitialspace=True)\n\n# Drop rows with missing values\ndata = data.dropna()\n\n# Split into features and target\nX = data.drop(columns=[\"income\"])\ny = data['income'].map({'&gt;50K': 1, '&lt;=50K': 0}).values\n\n# Plot distribution of income\nplt.figure(figsize=(8, 6))\nplt.hist(data['income'], bins=2, edgecolor='black')\nplt.title('Distribution of Income')\nplt.xlabel('Income')\nplt.ylabel('Frequency')\nplt.show()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In the Adult dataset, income is a binary variable, representing individuals who earn above, and below, $50,000. We plot the distribution of income over the entire dataset below. We can see that the dataset is heavily imbalanced with a far larger number of individuals who earn less than $50,000.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"9cc2dc\" data-has-transparency=\"false\" style=\"--dominant-color: #9cc2dc;\" decoding=\"async\" width=\"1024\" height=\"746\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.13%25E2%2580%25AFPM-1024x746.png?resize=1024%2C746&#038;ssl=1\" alt=\"\" class=\"wp-image-598367 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.13\u202fPM-1024x746.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.13\u202fPM-300x218.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.13\u202fPM-768x559.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.13\u202fPM.png 1236w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by Author. Original dataset: Number of data instances with the label \u226450k and &gt;50k. There is a disproportionately larger representation of individuals who earn less than 50k in the dataset.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Despite this imbalance we can still train a machine learning classifier on the Adult dataset which we can use to determine whether unseen, or test, individuals should be classified as earning above, or below, 50k.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Preprocessing: One-hot encode categorical features, scale numerical features\nnumerical_features = [\"age\", \"fnlwgt\", \"education-num\", \"capital-gain\", \"capital-loss\", \"hours-per-week\"]\ncategorical_features = [\n   \"workclass\", \"education\", \"marital-status\", \"occupation\", \"relationship\",\n   \"race\", \"sex\", \"native-country\"\n]\n\npreprocessor = ColumnTransformer(\n   transformers=[\n       (\"num\", StandardScaler(), numerical_features),\n       (\"cat\", OneHotEncoder(), categorical_features)\n   ]\n)\n\nX_processed = preprocessor.fit_transform(X)\n\n# Convert to numpy array for PyTorch compatibility\nX_processed = X_processed.toarray().astype(np.float32)\ny_processed = y.astype(np.float32)\n# Split dataset in train and test sets\nX_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)\n\n\nrf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)\nrf_classifier.fit(X_model_train, y_model_train)\n\n# Make predictions\ny_pred = rf_classifier.predict(X_model_test)\n\n# Display confusion matrix\nplt.figure(figsize=(6, 4))\nsns.heatmap(cm, annot=True, fmt=\"d\", cmap=\"YlGnBu\", xticklabels=[\"Negative\", \"Positive\"], yticklabels=[\"Negative\", \"Positive\"])\nplt.xlabel(\"Predicted\")\nplt.ylabel(\"Actual\")\nplt.title(\"Confusion Matrix\")\nplt.show()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Printing out the confusion matrix of our classifier shows that our model performs fairly well despite the imbalance. Our model has an overall error rate of 16% but the error rate for the positive class (income &gt; 50k) is 36% where the error rate for the negative class (income &lt; 50k) is 8%.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">This discrepancy shows that the model is indeed biased towards the negative class. The model is frequently incorrectly classifying individuals who earn more than 50k as earning less than 50k.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Below we show how we can use a <a href=\"https:\/\/towardsdatascience.com\/tag\/variational-autoencoder\/\" title=\"Variational Autoencoder\">Variational Autoencoder<\/a> to generate synthetic data of the positive class to balance this dataset. We then train the same model using the synthetically balanced dataset and reduce model errors on the test set.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" loading=\"lazy\" data-dominant-color=\"d0d7c8\" data-has-transparency=\"true\" style=\"--dominant-color: #d0d7c8;\" decoding=\"async\" width=\"1024\" height=\"723\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.24%25E2%2580%25AFPM-1024x723.png?resize=1024%2C723&#038;ssl=1\" alt=\"\" class=\"wp-image-598368 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.24\u202fPM-1024x723.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.24\u202fPM-300x212.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.24\u202fPM-768x542.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.24\u202fPM.png 1264w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by Author. Confusion matrix for predictive model on original dataset.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\"><strong>How can we generate synthetic data?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">There are lots of different methods for generating synthetic data. These can include more traditional methods such as SMOTE and Gaussian Noise which generate new data by modifying existing data. Alternatively Generative models such as Variational Autoencoders or General Adversarial networks are predisposed to generate new data as their architectures learn the distribution of real data and use these to generate synthetic samples.<\/p>\n<p class=\"wp-block-paragraph\"><strong>In this tutorial we use a variational autoencoder to generate synthetic data.<\/strong><\/p>\n<h2 class=\"wp-block-heading\"><strong>Variational Autoencoders<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Variational Autoencoders (VAEs) are great for synthetic data generation because they use real data to learn a continuous latent space. We can view this latent space as a magic bucket from which we can sample synthetic data which closely resembles existing data. The continuity of this space is one of their big selling points as it means the model generalises well and doesn\u2019t just memorise the latent space of specific inputs.<\/p>\n<p class=\"wp-block-paragraph\">A VAE consists of an <strong>encoder<\/strong>, which maps input data into a probability distribution (mean and variance) and a <strong>decoder<\/strong>, which reconstructs the data from the latent space.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">For that continuous latent space, VAEs use a reparameterization trick<strong>,<\/strong> where a random noise vector is scaled and shifted using the learned mean and variance, ensuring smooth and continuous representations in the latent space.<\/p>\n<p class=\"wp-block-paragraph\">Below we construct a <strong>BasicVAE<\/strong> class which implements this process with a simple architecture.<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">\n<strong>\u00a0The encoder<\/strong> compresses the input into a smaller, hidden representation, producing both a mean and log variance that define a Gaussian distribution aka creating our magic sampling bucket. Instead of directly sampling, the model applies the reparameterization trick to generate latent variables, which are then passed to the decoder.\u00a0<\/li>\n<li class=\"wp-block-list-item\">\n<strong>The decoder<\/strong> reconstructs the original data from these latent variables, ensuring the generated data maintains characteristics of the original dataset.\u00a0<\/li>\n<\/ul>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">class BasicVAE(nn.Module):\n   def __init__(self, input_dim, latent_dim):\n       super(BasicVAE, self).__init__()\n       # Encoder: Single small layer\n       self.encoder = nn.Sequential(\n           nn.Linear(input_dim, 8),\n           nn.ReLU()\n       )\n       self.fc_mu = nn.Linear(8, latent_dim)\n       self.fc_logvar = nn.Linear(8, latent_dim)\n      \n       # Decoder: Single small layer\n       self.decoder = nn.Sequential(\n           nn.Linear(latent_dim, 8),\n           nn.ReLU(),\n           nn.Linear(8, input_dim),\n           nn.Sigmoid()  # Outputs values in range [0, 1]\n       )\n\n   def encode(self, x):\n       h = self.encoder(x)\n       mu = self.fc_mu(h)\n       logvar = self.fc_logvar(h)\n       return mu, logvar\n\n   def reparameterize(self, mu, logvar):\n       std = torch.exp(0.5 * logvar)\n       eps = torch.randn_like(std)\n       return mu + eps * std\n\n   def decode(self, z):\n       return self.decoder(z)\n\n   def forward(self, x):\n       mu, logvar = self.encode(x)\n       z = self.reparameterize(mu, logvar)\n       return self.decode(z), mu, logvar<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Given our BasicVAE architecture we construct our loss functions and model training below.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):\n   recon_loss = nn.MSELoss()(recon_x, x)\n \n   # KL Divergence Loss\n   kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())\n   return recon_loss + kld_loss \/ x.size(0)\n\ndef train_vae(model, data_loader, epochs, learning_rate):\n   optimizer = optim.Adam(model.parameters(), lr=learning_rate)\n   model.train()\n   losses = []\n   reconstruction_mse = []\n\n   for epoch in range(epochs):\n       total_loss = 0\n       total_mse = 0\n       for batch in data_loader:\n           batch_data = batch[0]\n           optimizer.zero_grad()\n           reconstructed, mu, logvar = model(batch_data)\n           loss = vae_loss(reconstructed, batch_data, mu, logvar)\n           loss.backward()\n           optimizer.step()\n           total_loss += loss.item()\n\n           # Compute batch-wise MSE for comparison\n           mse = nn.MSELoss()(reconstructed, batch_data).item()\n           total_mse += mse\n\n       losses.append(total_loss \/ len(data_loader))\n       reconstruction_mse.append(total_mse \/ len(data_loader))\n       print(f\"Epoch {epoch+1}\/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}\")\n   return losses, reconstruction_mse\n\ncombined_data = np.concatenate([X_model_train.copy(), y_model_train.cop\ny().reshape(26048,1)], axis=1)\n\n# Train-test split\nX_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)\n\nbatch_size = 128\n\n# Create DataLoaders\ntrain_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)\ntest_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)\n\nbasic_vae = BasicVAE(input_dim=X_train.shape[1], latent_dim=8)\n\nbasic_losses, basic_mse = train_vae(\n   basic_vae, train_loader, epochs=50, learning_rate=0.001,\n)\n\n# Visualize results\nplt.figure(figsize=(12, 6))\nplt.plot(basic_mse, label=\"Basic VAE\")\nplt.ylabel(\"Reconstruction MSE\")\nplt.title(\"Training Reconstruction MSE\")\nplt.legend()\nplt.show()<\/code><\/pre>\n<p class=\"wp-block-paragraph\"><strong>vae_loss <\/strong>consists of two components: <strong>reconstruction loss<\/strong>, which measures how well the generated data matches the original input using Mean Squared Error (MSE), and <strong>KL divergence loss<\/strong>, which ensures that the learned latent space follows a normal distribution.<\/p>\n<p class=\"wp-block-paragraph\"><strong>train_vae<\/strong> optimises the VAE using the Adam optimizer over multiple epochs. During training, the model takes mini-batches of data, reconstructs them, and computes the loss using <strong>vae_loss<\/strong>. These errors are then corrected via backpropagation where the model weights are updated. We train the model for 50 epochs and plot how the reconstruction mean squared error decreases over training.<\/p>\n<p class=\"wp-block-paragraph\">We can see that our model learns quickly how to reconstruct our data, evidencing efficient learning.\u00a0<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"fcfcfd\" data-has-transparency=\"true\" style=\"--dominant-color: #fcfcfd;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"510\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.34%25E2%2580%25AFPM-1024x510.png?resize=1024%2C510&#038;ssl=1\" alt=\"\" class=\"wp-image-598369 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.34\u202fPM-1024x510.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.34\u202fPM-300x149.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.34\u202fPM-768x382.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.34\u202fPM.png 1274w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by Author. Reconstruction MSE of BasicVAE on the Adult dataset.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Now we have trained our BasicVAE to accurately reconstruct the Adult dataset we can now use it to generate synthetic data. We want to generate more samples of the positive class (individuals who earn over 50k) in order to balance out the classes and remove the bias from our model.<\/p>\n<p class=\"wp-block-paragraph\">To do this we select all the samples from our VAE dataset where income is the positive class (earn more than 50k). We then encode these samples into the latent space. As we have only selected samples of the positive class to encode, this latent space will reflect properties of the positive class which we can sample from to create synthetic data.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">We sample 15000 new samples from this latent space and decode these latent vectors back into the input data space as our synthetic data points.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-\"># Create column names\ncol_number = sample_df.shape[1]\ncol_names = [str(i) for i in range(col_number)]\nsample_df.columns = col_names\n\n# Define the feature value to filter\nfeature_value = 1.0  # Specify the feature value - here we set the income to 1\n\n# Set all income values to 1 : Over 50k\nselected_samples = sample_df[sample_df[col_names[-1]] == feature_value]\nselected_samples = selected_samples.values\nselected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)\n\nbasic_vae.eval()  # Set model to evaluation mode\nwith torch.no_grad():\n   mu, logvar = basic_vae.encode(selected_samples_tensor)\n   latent_vectors = basic_vae.reparameterize(mu, logvar)\n\n# Compute the mean latent vector for this feature\nmean_latent_vector = latent_vectors.mean(dim=0)\n\n\nnum_samples = 15000  # Number of new samples\nlatent_dim = 8\nlatent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)\n\nwith torch.no_grad():\n   generated_samples = basic_vae.decode(latent_samples)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Now we have generated synthetic data of the positive class, we can combine this with the original training data to generate a balanced synthetic dataset.\u00a0<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">new_data = pd.DataFrame(generated_samples)\n\n# Create column names\ncol_number = new_data.shape[1]\ncol_names = [str(i) for i in range(col_number)]\nnew_data.columns = col_names\n\nX_synthetic = new_data.drop(col_names[-1],axis=1)\ny_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])\n\nX_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)\ny_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)\n\nmapping = {1: '&gt;50K', 0: '&lt;=50K'}\nmap_function = np.vectorize(lambda x: mapping[x])\n# Apply mapping\ny_mapped = map_function(y_synthetic_train)\n\nplt.figure(figsize=(8, 6))\nplt.hist(y_mapped, bins=2, edgecolor='black')\nplt.title('Distribution of Income')\nplt.xlabel('Income')\nplt.ylabel('Frequency')\nplt.show()<\/code><\/pre>\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" data-dominant-color=\"69a2ca\" data-has-transparency=\"true\" style=\"--dominant-color: #69a2ca;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"758\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.45%25E2%2580%25AFPM-1024x758.png?resize=1024%2C758&#038;ssl=1\" alt=\"\" class=\"wp-image-598370 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.45\u202fPM-1024x758.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.45\u202fPM-300x222.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.45\u202fPM-768x569.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.45\u202fPM.png 1232w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by Author. Synthetic dataset: Number of data instances with the label \u226450k and &gt;50k. There are now a balanced number of individuals earning more and less than 50k.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">We can now use our balanced training synthetic dataset to retrain our random forest classifier. We can then evaluate this new model on the original test data to see how effective our synthetic data is at reducing the model bias.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)\nrf_classifier.fit(X_synthetic_train, y_synthetic_train)\n\n# Step 5: Make predictions\ny_pred = rf_classifier.predict(X_model_test)\n\ncm = confusion_matrix(y_model_test, y_pred)\n\n# Create heatmap\nplt.figure(figsize=(6, 4))\nsns.heatmap(cm, annot=True, fmt=\"d\", cmap=\"YlGnBu\", xticklabels=[\"Negative\", \"Positive\"], yticklabels=[\"Negative\", \"Positive\"])\nplt.xlabel(\"Predicted\")\nplt.ylabel(\"Actual\")\nplt.title(\"Confusion Matrix\")\nplt.show()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Our new classifier, trained on the balanced synthetic dataset makes fewer errors on the original test set than our original classifier trained on the imbalanced dataset and our error rate is now reduced to 14%.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-recalc-dims=\"1\" data-dominant-color=\"cfd6c7\" data-has-transparency=\"true\" style=\"--dominant-color: #cfd6c7;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"703\" src=\"https:\/\/i0.wp.com\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.55%25E2%2580%25AFPM-1024x703.png?resize=1024%2C703&#038;ssl=1\" alt=\"\" class=\"wp-image-598371 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.55\u202fPM-1024x703.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.55\u202fPM-300x206.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.55\u202fPM-768x527.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/Screenshot-2025-02-21-at-3.24.55\u202fPM.png 1226w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"><figcaption class=\"wp-element-caption\">Image by Author. Confusion matrix for predictive model on synthetic dataset.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">However, we have not been able to reduce the discrepancy in errors by a significant amount, our error rate for the positive class is still 36%. This could be due to to the following reasons:\u00a0<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">We have discussed how one of the benefits of VAEs is the learning of a continuous latent space. However, if the majority class dominates, the latent space might skew towards the majority class.<\/li>\n<li class=\"wp-block-list-item\">The model may not have properly learned a distinct representation for the minority class due to the lack of data, making it hard to sample from that region accurately.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><strong>In this tutorial we have introduced and built a BasicVAE architecture which can be used to generate synthetic data which improves the classification accuracy on an imbalanced dataset.\u00a0<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Follow for future articles where I will show how we can build more sophisticated VAE architectures which address the above problems with imbalanced sampling and more.<\/p>\n<p class=\"wp-block-paragraph\" id=\"[1]\">[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., &amp; Hobbhahn, M. (2024). Will we run out of data? Limits of LLM scaling based on human-generated data. <em>arXiv preprint arXiv:2211.04325<\/em>, <em>3<\/em>.<\/p>\n<p class=\"wp-block-paragraph\" id=\"[2]\">[2] Becker, B. &amp; Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository.<a href=\"https:\/\/doi.org\/10.24432\/C5XW20.\"> https:\/\/doi.org\/10.24432\/C5XW20.<\/a><\/p>\n<p>The post <a href=\"https:\/\/towardsdatascience.com\/the-next-ai-revolution-a-tutorial-using-vaes-to-generate-high-quality-synthetic-data\/\">The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data<\/a> appeared first on <a href=\"https:\/\/towardsdatascience.com\/\">Towards Data Science<\/a>.<\/p>\n<\/div>\n<p> \t<BR><br \/>\n <BR><\/BR><br \/>\n    Torty Sivill<br \/>\n \t<BR><br \/>\n<BR><\/BR><br \/>\n<a href=\"https:\/\/towardsdatascience.com\/the-next-ai-revolution-a-tutorial-using-vaes-to-generate-high-quality-synthetic-data\/\">Go to original source<\/a><br \/>\n \t<BR><br \/>\n <BR><\/BR><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data What is synthetic data? Data created by a computer intended to replicate or augment existing data. Why is it useful? We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,69,83,166,704,1417,843],"tags":[84,73,805],"class_list":["post-2003","post","type-post","status-publish","format-standard","hentry","category-aimldsaimlds","category-artificial-intelligence","category-data-science","category-hands-on-tutorials","category-imbalanced-data","category-synthetic-data","category-variational-autoencoder","tag-data","tag-models","tag-synthetic"],"_links":{"self":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2003"}],"collection":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/comments?post=2003"}],"version-history":[{"count":0,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/posts\/2003\/revisions"}],"wp:attachment":[{"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/media?parent=2003"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/categories?post=2003"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mailitics.com\/index.php\/wp-json\/wp\/v2\/tags?post=2003"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}