mailto:uumlib@uum.edu.my 24x7 Service; AnyTime; AnyWhere

Conditional Tabular Generative Adversarial Network-based Synthetic Data Generation for Model Generalisation Improvement

Yusof, Yuhanis and Fajila, Fathima (2026) Conditional Tabular Generative Adversarial Network-based Synthetic Data Generation for Model Generalisation Improvement. Journal of ICT, 25 (1). pp. 1-16. ISSN 1675-414X

[thumbnail of JICT 2026 v25 n1 Jan  2026 1-16.pdf]
Preview
PDF - Published Version
Available under License Attribution 4.0 International (CC BY 4.0).

Download (517kB) | Preview

Abstract

Accessing extensive and varied datasets is essential for developing strong predictive models in data analytics. However, many real-world applications suffer from small and imbalanced datasets, leading to overfitting, poor generalisation, and low model performance. Traditional data augmentation techniques are often unsuitable for tabular data, as they fail to preserve complex feature relationships. To address this challenge, this study adapts the Conditional Tabular Generative Adversarial Network (CTGAN) for synthetic data generation. The proposed approach involves five phases: (1) Data Acquisition, 2) Data Preparation, (3) Model Training, (4) Synthetic Data Generation, and (5) Evaluation. Experimental results on three benchmark datasets show that the proposed work produced data that closely adheres to the statistical distribution of the original dataset, with Wasserstein Distance < 0.05 for numerical features and Jensen-Shannon Divergence < 0.08 for categorical features. Additionally, models trained on datasets including synthetic and real data achieved up to 15% improvement in classification accuracy compared to those trained on real and small datasets alone.Training on a combination of real and synthetic data for the minority class in large datasets significantly improves the F1-score, with gains of approximately 9–10%. This approach also yields a modest increase in overall accuracy (around 1.5%), suggesting enhanced model generalisation. These results indicate that the adapted CTGAN is a viable option for data augmentation, addressing problems with limited and imbalanced data for machine learning data training.

Item Type: Article
Uncontrolled Keywords: Deep learning, CTGAN, data augmentation, synthetic data.
Subjects: Q Science > Q Science (General)
Divisions: School of Computing
Depositing User: Mrs. Norazmilah Yaakub
Date Deposited: 01 Mar 2026 07:30
Last Modified: 01 Mar 2026 07:30
URI: https://repo.uum.edu.my/id/eprint/34570

Actions (login required)

View Item View Item