'seq' value is ;[]', empty for the LBA dataset #62

lzhangUT · 2023-02-10T03:45:28Z

Hi,
I am working on the LBA dataset trying to reproduce your results.
I downloaded your LBA dataset in the LMDB format, the download and load dataset function works fine, but the 'seq' value in the dataset is '[]'- empty for each protein.

why is that?
I tried to generate the sequence by myself using your get_chain_sequences function in the sequence.py in the protein folder:

def get_chain_sequences(df):
"""Return list of tuples of (id, sequence) for different chains of monomers in a given dataframe."""
# Keep only CA of standard residues
df = df[df['name'] == 'CA'].drop_duplicates()
df = df[df['resname'].apply(lambda x: Poly.is_aa(x, standard=True))]
df['resname'] = df['resname'].apply(Poly.three_to_one)
chain_sequences = []
for c, chain in df.groupby(['ensemble', 'subunit', 'structure', 'model', 'chain']):
seq = ''.join(chain['resname'])
chain_sequences.append((tuple([str(x) for x in c]), seq))
return chain_sequences

It also returns empty list for sequence, so I think there is a bug here.

I modified the function a little bit, so I can the get the protein sequences. While for some proteins, there are multiple chains, how to process the multiple chains to use for training or which chain to choose to pair with ligand SMILES to be used for training?

Thanks for your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'seq' value is ;[]', empty for the LBA dataset #62

'seq' value is ;[]', empty for the LBA dataset #62

lzhangUT commented Feb 10, 2023

'seq' value is ;[]', empty for the LBA dataset #62

'seq' value is ;[]', empty for the LBA dataset #62

Comments

lzhangUT commented Feb 10, 2023