The CIF Bond Analyzer (CBA) is an interactive, command-line-based application
designed for high-throughput extraction of bonding information from CIF
(Crystallographic Information File) files. CBA offers Site Analysis, System
Analysis for binary/ternary systems, and Coordination Analysis. The outputs are
saved in .json
, .xlsx
, and .png
formats.
The current README.md serves as a tutorial and documentation - last update July 9, 2024
The code is designed for interactive use without the need to write any code.
Any .cif
files.
CBA
simplifies crystal structure analysis by automating the extraction of
minimum bond lengths, which are crucial for understanding geometric
configurations and identifying irregularities. Histograms and figures assist in
identifying distinct bond lengths and structural patterns.
Copy each line into your command-line applications:
$ git clone https://github.com/bobleesj/cif-bond-analyzer.git
$ cd cif-bond-analyzer
$ pip install -r requirements.txt
$ python main.py
Once the code is executed using python main.py
, the following prompt will
appear, asking you to choose one of the three analysis options:
Welcome! Please choose an option to proceed:
[1] Conduct site analysis.
[2] Conduct system analysis.
[3] Conduct coordination analysis.
Enter your choice (1-3): 1
For any option, CBA will ask you to choose folders containing .cif
files:
Folders with .cif files:
1. 20240623_ErCoIn_nested, 16 files, 136 nested files
2. 20240612_ternary_only, 2 files
3. 20240611_ternary_binary_combined, 5 files
4. 20240623_teranry_3_unique_elements, 3 files
5. 20240611_binary_2_unique_elements, 4 files
Would you like to process each folder above sequentially?
(Default: Y) [Y/n]:
You may then choose to process folders either sequentially or select specific
folders by entering numbers associated with the folders prompted. For each
folder, CBA generates site pair data saved in site_pairs.json
or
site_pairs.xlsx
.
The following discusses formatting, supercell generation, and atomic mixing information.
CBA uses the CifEnsemble
object from cifkit
(https://github.com/bobleesj/cifkit) to conduct preprocessing
automatically.
-
CBA standardizes the site labels in
atom_site_label
. Some site labels may contain a comma or a symbol such asM
due to atomic mixing. CBA reformats eachatom_site_label
so it can be parsed into an element type that matchesatom_site_type_symbol
. -
CBA removes the content of
publ_author_address
. This section often has an incorrect format that otherwise requires manual modifications. -
CBA relocates any ill-formatted files, such as those with duplicate labels in
atom_site_label
, missing fractional coordinates, or files that require supercell generation.
For each .cif
file, a unit cell is generated by applying the symmetry
operations. A supercell is generated by applying ±1 shifts from the unit cell.
Each bonding pair is defined with one of four atomic mixing categories:
- Full occupancy is assigned when a single atomic site occupies the fractional coordinate with an occupancy value of 1.
- Full occupancy with mixing is assigned when multiple atomic sites collectively occupy the fractional coordinate to a sum of 1.
- Deficiency without mixing is assigned when a single atomic site occupying the fractional coordinate with a sum less than 1.
- Deficiency with atomic mixing is assigned when multiple atomic sites occupy the fractional coordinate with a sum less than 1.
CBA provides three options for analysis.
-
Purpose: Site Analysis determines the shortest distance and its nearest neighbor for each label in
atom_site_label
. -
Process: For each atom in the unit cell, Euclidean distances are calculated from the atom to all atoms in the supercell. The position of the atom in the unit cell for each site label is determined based on the atom with the greatest number of shortest distances to its neighbors.
-
Example: If a
.cif
file underatom_site_label
contains four site labels:Er1
,Er2
,Er3
, andEr4
. The bonding pair from the site labelEr4
and its nearest neighborEr2
is unique and recorded. The bonding pair fromEr3
toEr2
is also considered unique. However, the pairsEr4-Er2
andEr2-Er4
are considered identical. Out of the two pairs, the pair with the shorter distance is recorded below.
Data for each folder is saved in site_pairs.json
or site_pairs.xlsx
. Below
is an example of the JSON structure for bond pairs:
{
"Co-Co": {
"250361": [
{
"dist": 2.529,
"mixing": "full_occupancy",
"formula": "ErCo2",
"tag": "rt",
"structure": "MgCu2"
}
],
"1955204": [
{
"dist": 2.46,
"mixing": "full_occupancy",
"formula": "Er2Co17",
"tag": "hex",
"structure": "Th2Ni17"
},
{
"dist": 2.274,
"mixing": "full_occupancy",
"formula": "Er2Co17",
"tag": "hex",
"structure": "Th2Ni17"
}
]
}
}
The minimum bond pair for each file is saved in element_pairs.json
and
element_pairs.xlsx
.
{
"Co-Co": {
"250361": [
{
"dist": 2.529,
"mixing": "full_occupancy",
"formula": "ErCo2",
"tag": "rt",
"structure": "MgCu2"
}
],
"1955204": [
{
"dist": 2.274,
"mixing": "full_occupancy",
"formula": "Er2Co17",
"tag": "hex",
"structure": "Th2Ni17"
}
]
}
}
Here is a screenshot of element_pairs.xlsx
.
A summary text file, summary_element.txt
, lists the shortest bonding pairs and
identifies missing pairs across selected folders:
Summary:
Pair: In-In, Count: 4, Distances: 2.736, 2.782, 2.785, 2.793
Pair: Pd-Ge, Count: 4, Distances: 2.449, 2.455, 2.489, 2.672
Pair: Pd-Sb, Count: 4, Distances: 2.505, 2.700, 2.737, 2.793
Pair: Si-Si, Count: 4, Distances: 1.975, 2.289, 2.325, 2.533
Pair: Rh-Ge, Count: 2, Distances: 2.484, 2.495
Pair: Ru-Si, Count: 2, Distances: 2.394, 2.519
Pair: Sb-Sb, Count: 2, Distances: 2.573, 2.793
Pair: Co-Ga, Count: 1, Distances: 2.485
Pair: Co-Sb, Count: 1, Distances: 2.594
Pair: Co-Sn, Count: 1, Distances: 2.737
Missing pairs:
Co-In
Co-Ir
Co-Ni
Co-Pd
Co-Pt
Co-Rh
Co-Si
Fe-Co
histogram_element_pair.png
and histogram_site_pair.png
are used visualize
data, with colors indicating atomic mixing types.
- To modify the x-axis, run
python plot-histogram.py
. This script allows you to interactively specify parameters such as the bin width and x-axis range:
-
Purpose: System Analysis provides an overview of bond fractions acquired from Option 1: Site Analysis, or bond fractions in coordination number geometries.
-
Scope: System Analysis is applicable for folders containing either 2 or 3 unique elements.
4 types of folders are applicable for System Analysis.
- Type 1. Binary files, 2 unique elements
- Type 2. Binary files, 3 unique elements
- Type 3. Ternary files, 3 unique elements
- Type 4. Ternary and binary combined, 3 unique elements
Here is an example of CBA detecting folders containing 2 or 3 unique elements.
Available folders containing 2 or 3 unique elements:
1. 20240623_ErCoIn_nested, 3 elements (In, Er, Co), 152 files
2. 20240612_ternary_only, 3 elements (In, Er, Co), 2 files
3. 20240611_ternary_binary_combined, 3 elements (In, Er, Co), 5 files
4. 20240623_teranry_3_unique_elements, 2 elements (Er, Co), 3 files
5. 20240611_binary_2_unique_elements, 2 elements (Er, Co), 4 files````
For Types 2, 3, and 4:
Customize legend position:
To adjust the legend position in the ternary diagram, modify the values of X_SHIFT = 0.0
and Y_SHIFT = 0.0
in core/configs/ternary.py
.
Customize extra lines:
To add extra lines to the ternary diagram based on tags, edit TAGS_IN_FIRST_EXTRA_LINE = ["lt", "ht", "hp", "hp1", "hp2", "hp3"]
and TAGS_IN_SECOND_EXTRA_LINE = ["lt", "ht", "hp", "hp1", "hp2", "hp3"]
in core/configs/ternary.py
.
For Type 1:
All of the individual hexagon figures also saved in order.
For Types 2, 3, and 4, color maps for each bond type and overall are generated.
Bond count per each cif
file is recorded in system_analysis_files.xlsx
.
Average bond lenghts, count, and statistical values are recorded in
system_analysis_main.xlsx
.
-
Purpose: This option determines the best coordination geometry using four methods provided in
cifkit
. Excel files and JSON are saved with nearest neighbor information. -
Customization: The Excel contains
Δ
, which is defined as the interatomic distance subtracted by the sum of atomic radii. You may provide your radii values by modifying theradii.xlsx
file.
For each site, the nearest neighbors within the coordination number geometry are
recorded in CN_connections.json
.
{
"250361": {
"Co": [
{
"connected_label": "Co",
"distance": 2.529,
"delta": 1.16,
"mixing": "full_occupancy",
"neighbor": 1
},
{
"connected_label": "Co",
"distance": 2.529,
"delta": 1.16,
"mixing": "full_occupancy",
"neighbor": 2
},
...
{
"connected_label": "Er",
"distance": 2.966,
"delta": -0.603,
"mixing": "full_occupancy",
"neighbor": 11
},
{
"connected_label": "Er",
"distance": 2.966,
"delta": -0.603,
"mixing": "full_occupancy",
"neighbor": 12
}
]
}
}
For each .cif
file, the nearest neighbor information is wrriten in each sheet
within CN_connections.xlsx
.
git clone https://github.com/bobleesj/cif-bond-analyzer.git
cd cif-bond-analyzer
pip install -r requirements.txt
python main.py
If you are interested in using Conda
with a new environment run the following:
git clone https://github.com/bobleesj/cif-bond-analyzer.git
cd cif-bond-analyzer
conda create -n cif python=3.12
conda activate cif
pip install -r requirements.txt
python main.py
- Anton Oliynyk
- Emil Jaffal
- Sangjoon Bob Lee
CBA
is also designed for experimental materials scientists and chemists.
- If you have any issues or questions, please feel free to reach out or leave an issue.
Here is how you can contribute to the CBA
project if you found it helpful:
- Star the repository on GitHub and recommend it to your colleagues who might
find
CBA
helpful as well. - Fork the repository and consider contributing changes via a pull request.
- If you have any suggestions or need further clarification on how to use
CBA
, please feel free to reach out to Sangjoon Bob Lee (@bobleesj).
- 20240623 - Implement CN bond fractions, add GitHub CI. See Pull #22.
- 20240330 - Add sequential folder processing and customizable histogram generation. See Pull #16.
- 20240311 - Integrate PEP8 linting with
black
. See Pull #12. - 20240310 - Enhance output options to include both element-based and label-based data for Excel, JSON, and histograms. See Pull #11.
- 20240301 - Display atom counts and execution time per file in Terminal; adds CSV logging.
- 20240229 - Expand file support to include all CIF files.