-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surprising results on a simple regression dataset #231
Comments
Hi @Sun-Haotian, thanks for the detailed issue, code snippets and data. The CTGAN model is designed & tested on enterprise datasets that contain records from real-world user behavior or natural events. I’m curious how you are identifying that your dataset is “simple”. It contains several properties that may not be suitable for CTGAN —
Was this dataset created using a formula or manual properties? What problem are you hoping to solve using the synthetic data? |
Hi @npatki, thank you very much for your reply. I will answer your questions in detail and propose some other thoughts based on my observations since my last post. Was this dataset created using a formula or manual properties? What problem are you hoping to solve using the synthetic data? There are not that many rows. Judging by the names and descriptions, there seems to be a mathematical relationship between the columns. The rows don’t appear to be fully independent or naturally collected. For example, the first 9 rows of the training data have the same exact floating point value for 5 out of the 8 columns. I’m curious how you are identifying that your dataset is “simple” Another interesting observation that I am aware of is that, if I set both the batch_size and epochs to very large values, like 50,000, I can get good synthetic data. However, according to the definition, batch_size cannot exceed the number of rows in the dataset. In TGAN, when setting batch_size larger than the number of rows, an error will appear. However, for the CTGAN code, I am not sure why setting batch_size to a very large value will not lead to an error but leads to good synthetic data. Would you please provide some instructions or explanations for this behavior? Thank you again for your reply and attention to my question. I sincerely look forward to discussing this with you further. |
Environment details
If you are already running CTGAN, please indicate the following details about the environment in
which you are running it:
Problem description
I am trying to use CTGAN to generate some synthetic data for my regression dataset. The full dataset has 159 data points and I have manually split them into a training dataset with 111 data points and a test dataset with 48 data points. I have 7 input features (the first 7 columns) and 1 output feature "Normerr" (the last column). Due to their mechanical implications, "su/sy", "D/t" and "a/t" should be smaller than 1, larger than 22, and smaller than 0.8, respectively, and these limits have been reflected in the data.
My goal is to generate synthetic data that can exhibit ML utility. I used Gaussian process regression in sklearn to train on 111 data points and test on 48 data points (i.e. original training and test datasets) and the coefficient of determination on the training set is 0.97 and on the test dataset is 0.87, which is satisfactory to me. I then tried to use the 111 data points in my training dataset to generate 300 synthetic data points. I considered 2 scenarios to show the ML utility. I tried to: 1. train on real 111 data points and test on synthetic 300 data points; 2. train on synthetic 300 data points using the same hyperparameters as the GPR model trained by 111 data points, and test on 48 data points. Both scenarios give poor results. I also notice that generated synthetic data contains a lot of extreme values (e.g. the lowest value in Cv column, 15.2, repeats more than 10 times in many rounds of trials)
What I already tried
I think I have tried to adjust every parameter that is included in the package. However, I have not noticed any improvements yet.
The successful application of CTGAN will be a great help to my study and I would sincerely appreciate it if anyone could help generate high-quality synthetic data that has ML utility as described above. Please forgive my poor coding skills.
Another thing is that my colleague successfully used TGAN to generate synthetic data on a similar dataset. However, he used GPU to run in Colab, and takes hours to get the results. However, using CTGAN I can get results with CPU in minutes with epochs = 300 or 500. Is there anything wrong here? Sorry, I am not familiar with Deep learning and what I did is just follow the manual.
Thanks in advance!
159testCTGAN.xlsx
159trainingCTGAN.xlsx
The text was updated successfully, but these errors were encountered: