Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweets.2016-09-01 dataset #26

Open
rezwanh001 opened this issue Dec 1, 2019 · 5 comments
Open

tweets.2016-09-01 dataset #26

rezwanh001 opened this issue Dec 1, 2019 · 5 comments

Comments

@rezwanh001
Copy link

""" Creates a vocabulary from a tsv file.
"""

import codecs
import example_helper
from torchmoji.create_vocab import VocabBuilder
from torchmoji.word_generator import TweetWordGenerator

with codecs.open('../../twitterdata/tweets.2016-09-01', 'rU', 'utf-8') as stream:
    wg = TweetWordGenerator(stream)
    vb = VocabBuilder(wg)
    vb.count_all_words()
    vb.save_vocab()

In this code, in oder to create a vocabulary, you had been used '../../twitterdata/tweets.2016-09-01'
dataset. But where I will find this dataset? Please let me know.
Please share this dataset with my mail [email protected], if it is possible.

@KingS770234358
Copy link

Hello,have you solved this problem?

@rezwanh001
Copy link
Author

@KingS770234358 , This issue is not solved yet.

@KingS770234358
Copy link

@rezwanh001 as the huggingface mentioned in the readme file,the code in the 'script' folder are used to process the raw data in the folder ‘data'. I think 'tweets.2016-09-01' may be the result of processing.

@KingS770234358
Copy link

Maybe you should run the script 'convert_all_datasets.py' in the 'script' folder.

@anuragvij264
Copy link

@KingS770234358 I tried running that script. Ran into this error.

Converting Olympic
-- Generating ../data/Olympic/own_vocab.pickle 
     done. Coverage: 0.030899113550021062
-- Generating ../data/Olympic/twitter_vocab.pickle 
     done. Coverage: 0.8874630645842128
-- Generating ../data/Olympic/combined_vocab.pickle 
Traceback (most recent call last):
  File "/Users/avij1/Desktop/imp_shit/torchMoji/scripts/convert_all_datasets.py", line 88, in <module>
    data = pickle.load(dataset, fix_imports=True,encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 6: invalid continuation byte
     done. Coverage: 0.8874630645842128
Converting PsychExp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants