Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Questions about asynchronous saving #2

Open
JOjoker-world opened this issue Jul 25, 2024 · 0 comments
Open

A Questions about asynchronous saving #2

JOjoker-world opened this issue Jul 25, 2024 · 0 comments

Comments

@JOjoker-world
Copy link

`Fsync the file after writing to gaurantee persistence

Call this in a new process to perform in bgk
"""
def _serialize_and_persist(
self,
filepath,
snapshot,
active_snapshot,
lock,
linkpath=None,
iter_chk = None,
epoch_chk = None,
overwrite = True):
print("[{}] START ASYNC".format(time.time()))

with lock:
	if active_snapshot.value == 0:
		self.logger.error("Cannot persist. Empty snapshot")
		return
#Create new stream
s = torch.cuda.Stream()
torch.cuda.stream(s)

#print("Saving : {}".format(filepath))
torch.save(snapshot, filepath)
#print("Saved : {}".format(filepath))
# Clear the snapshot.
with lock:
	active_snapshot.value = 0

# Ensure its persisted
f = open(filepath, 'a+')
os.fsync(f.fileno())
f.close()

update_stats(
		filepath,
		iter_chk=iter_chk,
		overwrite=overwrite,
		epoch_chk = epoch_chk,
		linkpath=linkpath)
print("[{}] END ASYNC".format(time.time()))`

In this code segment:
# Create new stream s = torch.cuda.Stream() torch.cuda.stream(s) torch.save(snapshot, filepath)
I tested it and found that it did not achieve the overlapping of computation and saving. It seems that CheckFreq did not call the entire function. Could you please advise if this function is feasible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant