Yusof, Yuhanis and Fajila, Fathima (2026) Conditional Tabular Generative Adversarial Network-based Synthetic Data Generation for Model Generalisation Improvement. Journal of ICT, 25 (1). pp. 1-16. ISSN 1675-414X
Preview |
PDF
- Published Version
Available under License Attribution 4.0 International (CC BY 4.0). Download (517kB) | Preview |
Abstract
Accessing extensive and varied datasets is essential for developing strong predictive models in data analytics. However, many real-world applications suffer from small and imbalanced datasets, leading to overfitting, poor generalisation, and low model performance. Traditional data augmentation techniques are often unsuitable for tabular data, as they fail to preserve complex feature relationships. To address this challenge, this study adapts the Conditional Tabular Generative Adversarial Network (CTGAN) for synthetic data generation. The proposed approach involves five phases: (1) Data Acquisition, 2) Data Preparation, (3) Model Training, (4) Synthetic Data Generation, and (5) Evaluation. Experimental results on three benchmark datasets show that the proposed work produced data that closely adheres to the statistical distribution of the original dataset, with Wasserstein Distance < 0.05 for numerical features and Jensen-Shannon Divergence < 0.08 for categorical features. Additionally, models trained on datasets including synthetic and real data achieved up to 15% improvement in classification accuracy compared to those trained on real and small datasets alone.Training on a combination of real and synthetic data for the minority class in large datasets significantly improves the F1-score, with gains of approximately 9–10%. This approach also yields a modest increase in overall accuracy (around 1.5%), suggesting enhanced model generalisation. These results indicate that the adapted CTGAN is a viable option for data augmentation, addressing problems with limited and imbalanced data for machine learning data training.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | Deep learning, CTGAN, data augmentation, synthetic data. |
| Subjects: | Q Science > Q Science (General) |
| Divisions: | School of Computing |
| Depositing User: | Mrs. Norazmilah Yaakub |
| Date Deposited: | 01 Mar 2026 07:30 |
| Last Modified: | 01 Mar 2026 07:30 |
| URI: | https://repo.uum.edu.my/id/eprint/34570 |
Actions (login required)
![]() |
View Item |
Dimensions
Dimensions