Tag: synthetic

Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add arXiv:2601.16120v1 Announce Type: new Abstract: Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and…

January 23, 2026
Evaluating Synthetic Data — The Million Dollar Question

Evaluating Synthetic Data — The Million Dollar Question Learn how to evaluate synthetic data quality using the Maximum Similarity Test — a simple, quantitative approach for assessing fidelity, utility, and privacy in synthetic datasets. The post Evaluating Synthetic Data — The Million Dollar Question appeared first on Towards Data Science. Andrew Skabar Go to original…

November 8, 2025
Bias-Corrected Data Synthesis for Imbalanced Learning

Bias-Corrected Data Synthesis for Imbalanced Learning arXiv:2510.26046v1 Announce Type: new Abstract: Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the…

October 31, 2025
Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice?

Has anyone validated synthetic financial data (Gaussian Copula vs CTGAN) in practice? I’ve been experimenting with generating synthetic datasets for financial indicators (GDP, inflation, unemployment, etc.) and found that CTGAN offered stronger privacy protection in simple linkage tests, but its overall analytical utility was much weaker. In contrast, Gaussian Copula provided reasonably strong privacy and…

September 15, 2025
Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

Privacy Auditing Synthetic Data Release through Local Likelihood Attacks arXiv:2508.21146v1 Announce Type: cross Abstract: Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and…

September 1, 2025
Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI

Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI arXiv:2508.14936v1 Announce Type: cross Abstract: Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies…

August 22, 2025
How I Won the “Mostly AI” Synthetic Data Challenge

How I Won the “Mostly AI” Synthetic Data Challenge A deep dive into how post-processing can supercharge synthetic data generation The post How I Won the “Mostly AI” Synthetic Data Challenge appeared first on Towards Data Science. Daniel Gärber Go to original source

August 7, 2025
Boosting Statistic Learning with Synthetic Data from Pretrained Large Models

Boosting Statistic Learning with Synthetic Data from Pretrained Large Models arXiv:2505.04992v1 Announce Type: new Abstract: The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose…

May 9, 2025
Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training

Golden Ratio Mixing of Real and Synthetic Data for Stabilizing Generative Model Training arXiv:2502.18049v1 Announce Type: new Abstract: Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies…

February 26, 2025
The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data

The Next AI Revolution: A Tutorial Using VAEs to Generate High-Quality Synthetic Data What is synthetic data? Data created by a computer intended to replicate or augment existing data. Why is it useful? We have all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are being used ubiquitously across society…

February 22, 2025
Synthetic Control Sample for Before and After A/B Test

Synthetic Control Sample for Before and After A/B Test Learn a simple way to use linear regression to create a synthetic control sample for your A/B test Continue reading on Towards Data Science » Gustavo R Santos Go to original source

December 20, 2024