Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

casm-calc fails with duplicate database entries #274

Open
xivh opened this issue Sep 21, 2022 · 0 comments
Open

casm-calc fails with duplicate database entries #274

xivh opened this issue Sep 21, 2022 · 0 comments

Comments

@xivh
Copy link
Contributor

xivh commented Sep 21, 2022

An error message was stored many times into my JobDB: 'pbs_iff: cannot read reply from pbs_server\nNo Permission'. This caused db lookups to fail e.g.

db.select_regex_id("jobid", 'pbs_iff: cannot read reply from pbs_server\nNo Permission')

because the select_job function in jobdb.py can't handle duplicates. I fixed this by making a new select_job and passing the first returned job to delete_job:

def select_duplicate_jobs(self, jobid):
    if not isinstance(jobid, string_types):
        print("Error in prisms_jobs.JobDB.select_job(). type(id):", type(jobid), "expected str.")
        sys.exit()
    self.curs.execute("SELECT * FROM jobs WHERE jobid=?", (jobid,))
    import pdb; pdb.set_trace()
    dupes = self.curs.fetchall()    #pylint: disable=invalid-name                                                                                  
    if len(dupes) == 0:
        raise JobDBError("Error in prisms_jobs.JobDB.select_job(). jobid: '"
                         + jobid + "' not found in jobs database.")
    return [CompatibilityRow(r) for r in dupes]

I am also wondering if this issue could come up if the job ids on the cluster are reset/lost because the queue crashes. I noticed in the casm-calc output that it is often finding an existing JobID, but it seems to be running fine.

Update: actually, they have all failed. Maybe this is a separate issue, but casm-calc reported that a JobID was found, printed out the list of nodes, and then hung there. Deleting the job from the db and resubmitting was successful.

{'jobid': '5090221', 'jobname': 'SCEL5_5_1_1_0_1_3.1213', 'rundir': '/home/Ta\
N/casm/irrep_phonon_modes/training_data/SCEL5_5_1_1_0_1_3/1213/calctype.default', 'jobstatus': '?', 'auto': 1, 'taskstatus': 'Error: Not convergin\
g', 'continuation_jobid': '-', 'qsubstr': '#!/bin/sh\n#PBS -S /bin/sh\n#PBS -N SCEL5_5_1_1_0_1_3.1213\n#PBS -l walltime=10:00:00\n#PBS -l nodes=1:\
ppn=4\n#PBS -q batch\n#PBS -V\n#PBS -p 0\n\n#auto=True\n\necho "I ran on:"\ncat $PBS_NODEFILE\n\ncd $PBS_O_WORKDIR\npython -c "import casm.vaspwra\
pper; obj = casm.vaspwrapper.Relax.from_configuration_dir(\'/home/TaN/casm/irrep_phonon_modes/training_data/SCEL5_5_1_1_0_1_3/1213\', \'\
default\'); obj.run()"\n\n', 'qstatstr': '-', 'nodes': 1, 'procs': 4, 'walltime': 36000, 'elapsedtime': None, 'creationtime': 1663802985, 'startti\
me': None, 'completiontime': None, 'modifytime': 1663804062}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant