Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in make_dataset and question about filtering #6

Open
octavian-ganea opened this issue Sep 6, 2021 · 2 comments
Open

Error in make_dataset and question about filtering #6

octavian-ganea opened this issue Sep 6, 2021 · 2 comments

Comments

@octavian-ganea
Copy link

octavian-ganea commented Sep 6, 2021

Hi,

Thanks for these great resources. I have 2 questions:

  1. Can you please detail what exactly are the filtering criteria used in prune_pairs.py and if these were already applied to the 42,826 pairs listed in the paper ?
  2. I tried to run make_dataset on a subset of DIPS, but got this error. Can you please help ? Thanks.
$ python src/make_dataset.py ../raw/pdb/ ../interim
2021-09-06 13:35:29,892 INFO 10990: making final data set from interim data
2021-09-06 13:35:33,994 INFO 10990: 2566 requested keys, 0 produced keys, 2566 work keys
2021-09-06 13:35:34,058 INFO 10990: Processing 2566 inputs.
2021-09-06 13:35:34,058 INFO 10990: Sequential Mode.
2021-09-06 13:35:34,058 INFO 10990: Reading ../raw/pdb/17/317d.pdb1.gz
Traceback (most recent call last):
  File "src/make_dataset.py", line 45, in <module>
    main()
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1134, in __call__
    return self.main(*args, **kwargs)
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1059, in main
    rv = self.invoke(ctx)
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 1401, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "miniconda/miniconda3/lib/python3.8/site-packages/click/core.py", line 767, in invoke
    return __callback(*args, **kwargs)
  File "src/make_dataset.py", line 30, in main
    pa.parse_all(input_dir, parsed_dir, num_cpus)
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/parse.py", line 57, in parse_all
    par.submit_jobs(parse, inputs, num_cpus)
  File "miniconda/miniconda3/lib/python3.8/site-packages/parallel.py", line 62, in submit_jobs
    out = [function(*args) for args in inputs]
  File "miniconda/miniconda3/lib/python3.8/site-packages/parallel.py", line 62, in <listcomp>
    out = [function(*args) for args in inputs]
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/parse.py", line 64, in parse
    df = struct.parse_structure(pdb_filename, one_model=False)
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/structure.py", line 61, in parse_structure
    biopy_structure = db.parse_biopython_structure(structure_filename)
  File "miniconda/miniconda3/lib/python3.8/site-packages/atom3/database.py", line 59, in parse_biopython_structure
    biopy_structure = parser.get_structure('pdb', gzip.open(pdb_filename))
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 100, in get_structure
    self._parse(lines)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 121, in _parse
    self.header, coords_trailer = self._get_header(header_coords_trailer)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/PDBParser.py", line 139, in _get_header
    header_dict = _parse_pdb_header_list(header)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/parse_pdb_header.py", line 199, in _parse_pdb_header_list
    pdbh_dict["structure_reference"] = _get_references(header)
  File "miniconda/miniconda3/lib/python3.8/site-packages/Bio/PDB/parse_pdb_header.py", line 38, in _get_references
    if re.search(r"\AREMARK   1", l):
  File "miniconda/miniconda3/lib/python3.8/re.py", line 201, in search
    return _compile(pattern, flags).search(string)
TypeError: cannot use a string pattern on a bytes-like object

@vsomnath
Copy link

Was able to resolve this error by running gzip -dr DIPS/raw/pdb to make sure all files are uncompressed before running make_dataset.py

@octavian-ganea
Copy link
Author

yeah, me too using https://github.com/amorehead/DIPS-Plus/blob/main/project/datasets/builder/extract_raw_pdb_gz_archives.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants