Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data set is too big (which is too big to be held in one machine's mem), and I should break it to small daily set #5

Open
jackyhawk opened this issue May 12, 2022 · 2 comments

Comments

@jackyhawk
Copy link

Thanks for the excellent code.

and I met one question, my data set is too big (which can not be held in one machine's mem), and I should break it to small daily set.
so I should first generate each day's walk result (sequence) and then train by other code(suan as Gensim) as word2vec.

All I want is the random walking result

as for the walking result, should I just return before the part listed as following?
and then save dw_rw to disk for latter training?
1652349681(1)

@xgfs
Copy link
Owner

xgfs commented May 12, 2022

You will need to deal with multiprocessing slightly better than I do in the training loop. One option would be to just run the random walk generation and write to the file in the single thread. As for the place, it is correct.

@jackyhawk
Copy link
Author

Thanks very much.
Is there any other repo that is available to generate random walk sequence for big data set?
I found when I use data set bigger than 10 million edge, the memory required would be bigger than my memory capacity(200G)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants