-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Details about generating the ICSD version #2
Comments
Hey Yanjun, Yes, the first step is to download all the cifs into a directory. You can do this using the excellent repo https://github.com/simonverret/materials_data_api_scripts, or you can use the already downloaded code of this repo in the 3DSC repo under 3DSC/superconductors_3D/dataset_preparation/dataset_download/materials_data_api_scripts-master. I remember that I did do some small changes to the code, so I would recommend to try the code in my repo first, but if you get stuck just check out the original repo of Simon. This code should then download all the cif files into a directory, and also give you a .csv file with information about all the downloaded cifs. You should then put the cifs under 3DSC/data/source/ICSD/raw/cifs/ and the .csv under 3DSC/data/source/ICSD/raw/0_all_data_ICSD.csv. You can then run the script generate_3DSC.py. If you run into any issues, I would recommend you to execute this script not via the command line, but using a Python debugger. That way, you can easily go through the code line by line and see exactly how it works. That's what I usually do if I want to analyse the working of a new code for me. Both Pycharm and Spyder have good debuggers. Let me know how it works! Best regards, |
Hi Timo, Thank you! I'm now trying to download the cifs. I saw that in your icsd folder the download.py is missing and thus I directly copied the one in simon's repo to the folder and ran it, but unfortunately I got: (ICSD) yanjunliu@dhcp-vl2041-23489 materials_datasets % python icsd/download.py Best wishes, |
Hi Timo, Since simon hasn't replied to my issue, I tried another icsd client: https://github.com/lrcfmd/ICSDClient. And this one can work correctly. However, the code fetches CIFS based on the collection code instead of the icsd id. Do you have the collection codes for the CIFS listed in the icsd version of 3DSC? Thank you! Best wishes, |
Hey Yanjun, the first error message to me looks like you haven't corrrectly setup your ICSD credentials. Have you doublechecked that with the instructions in Simons repo? For the second question, it's very unfortunate that the ICSD has different collection codes and ICSD IDs, but unfortunately I currently don't have access to the collection codes. Can you maybe just download all of them and then check the ICSD ID in each downloaded cif? Best regards, |
Hi Timo, I downloaded the whole icsd cifs, and I can select those in the dataset. However, since I used a different script, I don't have the .csv file needed. Could you describe what's in the .csv file? Or is it possible to share that csv file with me? Thank you! Best wishes, |
Hey Yanjun, good work! I don't have access to the file currently, but it should be quite straightforward to see from the code which properties are needed. I'd recommend you to go through the code in the files From what I see right now, you should extract the following properties: Additionally, there is one property called 'file_id', which should be an absolute path to each cif structure in the directory From there, you can try to execute the code and see if there comes up any error with an unknown property. I would recommend you to try everything on a small sample of only 100 or so cif files first to speed up this process, which should be as easy as reducing the input csv to just the first 100 rows, since the code just looks up the paths in the csv and then reads in the cifs, but I don't think it ever reads in all the cifs in the directory. Let me know how it goes! Best regards, |
Hi Timo, Thank you! This is very detailed. I checked some cifs in the list, but obviously not all of them have this complete set of important_cols. For example in the cif attached, for which the artificial doping will not even be applied, the '_cell_measurement_temperature', '_diffrn_ambient_temperature', To enable uploading I just turn it to txt format. Best wishes, Yanjun |
Hey Yanjun, the best way to treat missing entries is usually to set them to Btw, you should not set the Best regards, |
1_all_data_ICSD_cifs_normalized.csv The past three weeks were a bit crazy because of the March meeting deadline, and now I finally have time to resume attempts at generating the dataset 😂. I seemed to be able to extract the 0_all_data_ICSD.csv and run the _1_clean_cifs.py to get the cifs in the cleaned folder and the new csv. I attached the two csv files I got. Could you take a look and see whether they look fine? One question is that in all cifs there are only 'space_group_name_H-M_alt' instead of '_symmetry_space_group_name_H-M'. Should I assign the 'space_group_name_H-M_alt' symbols in the cifs to the '_symmetry_space_group_name_H-M' column in the csv? Also, there seems to be another csv file needed, named ICSD_content_type.csv. Could you let me know what it is and how to generate it? Thank you! Best regards, |
Hey Yanjun, the input files should have For the ICSD_content_type.csv, this is another csv which contains information about whether each structure is experimental or theoretical. Unfortunately, right now I do not remember exactly where I got this from, somewhere I think from the ICSD. This csv file has three columns: Also, in case you don't know where to get this file from, I would recommend you to just make a pseudo csv in which all structures are in the experimental groups. I think this information was not particularly relevant, it was just some more information I was hoping to maybe play with. |
Hi Timo, Thank you! I'm now able to run _2_2_clean_ICSD.py, and I tried to run python generate_3DSC.py -d ICSD -n 4. However, it seems that I'm missing the ICSD_subset.csv this time. Could you point me to where I should check? Best wishes, |
Hey Yanjun, I think it's very difficult right now to debug this for me, since I don't have access to the files anymore. I think it would be very helpful if you could send me all csv files that you have so far, plus a selection of 20 ICSD entries, 10 of which are mentioned in the file superconductors_3D/data/final/ICSD/3DSC_ICSD_only_IDs.csv and 10 of which are not. That way, I can debug this on my own and then provide a full tutorial for you how to do this. Could you please send me these files to the email address you have from me? |
Hey Yanjun, thank you very much for your help and the files you sent me. I have majorly simplified the installation and the run by optionally skipping the generation of ML features, which is not necessary for the dataset itself and was a major issue because it required quite big and difficult to install python packages. However, the ML features will still be generated if the corresponding packages are installed. I have also used the example data that you provided to showcase the structure of the input data and explained this in a little tutorial in the README. It should be pretty much plug & play now. If you got any more questions, please let me know. |
Hi,
I got the license to access the ICSD api, but it's a bit unclear what I should do to generate the 3DSC_ICSD dataset. Should I download all the cifs myself and put them into a folder? Or do you have the download process already built in? Sorry that I'm not really good at reading codes. Thank you!
Best wishes,
Yanjun
The text was updated successfully, but these errors were encountered: