Tag: synthetic
-
Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add
Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add arXiv:2601.16120v1 Announce Type: new Abstract: Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and…
-
Evaluating Synthetic Data — The Million Dollar Question
Evaluating Synthetic Data — The Million Dollar Question Learn how to evaluate synthetic data quality using the Maximum Similarity Test — a simple, quantitative approach for assessing fidelity, utility, and privacy in synthetic datasets. The post Evaluating Synthetic Data — The Million Dollar Question appeared first on Towards Data Science. Andrew Skabar Go to original…
-
Bias-Corrected Data Synthesis for Imbalanced Learning
Bias-Corrected Data Synthesis for Imbalanced Learning arXiv:2510.26046v1 Announce Type: new Abstract: Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the…
-
Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice?
Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice? I’ve been experimenting with generating synthetic datasets for financial indicators (GDP, inflation, unemployment, etc.) and found that CTGAN offered stronger privacy protection in simple linkage tests, but its overall analytical utility was much weaker. In contrast, Gaussian Copula provided reasonably strong privacy and…
-
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks arXiv:2508.21146v1 Announce Type: cross Abstract: Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and…
-
Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI
Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI arXiv:2508.14936v1 Announce Type: cross Abstract: Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies…
-
How I Won the “Mostly AI” Synthetic Data Challenge
How I Won the “Mostly AI” Synthetic Data Challenge A deep dive into how post-processing can supercharge synthetic data generation The post How I Won the “Mostly AI” Synthetic Data Challenge appeared first on Towards Data Science. Daniel Gärber Go to original source
-
Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
Boosting Statistic Learning with Synthetic Data from Pretrained Large Models arXiv:2505.04992v1 Announce Type: new Abstract: The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose…
-
Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training
Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training arXiv:2502.18049v1 Announce Type: new Abstract: Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies…
-
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data What is synthetic data? Data created by a computer intended to replicate or augment existing data. Why is it useful? We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society…
-
Synthetic Control Sample for Before and After A/B Test
Synthetic Control Sample for Before and After A/B Test Learn a simple way to use linear regression to create a synthetic control sample for your A/B test Continue reading on Towards Data Science » Gustavo R Santos Go to original source