-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data preprocessing details #4
Comments
Hi,
There are no preprocessing steps. The data for each outlet is directly from
thier RSS feed. If there are any artifacts, then it is from the outlet
itself, not the collection process.
Ben
…On Wed, Apr 6, 2022, 5:28 PM Artidoro Pagnoni ***@***.***> wrote:
Could you give more details on how you preprocess the data? I noticed
underscore characters are present instead of some special characters, for
example.
It would be ideal if you could share the code you used to preprocess the
data. I am comparing another dataset to NELA and I need to apply the same
preprocessing steps to make sure the discriminators don't pick up
preprocessing differences between the datasets.
Thank you for your help!
—
Reply to this email directly, view it on GitHub
<#4>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCLQZQM4JKV4OJ74B7EI2DVDX6WPANCNFSM5SXLS7BA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for the information! Following up, when you add the "@" signs, what tokenizer do you use? |
It seems like the text you provide is tokenized, for example: "Here 's", and "it ’ s" have spaces. Also, there are spaces between words and punctuation which are not stylistically common. Do you have any hunch on how these things came to be? |
Hi @artidoro. After looking at your questions I believe there might be some points that need clarification. |
Could you give more details on how you preprocess the data? I noticed underscore characters are present instead of some special characters, for example.
It would be ideal if you could share the code you used to preprocess the data. I am comparing another dataset to NELA and I need to apply the same preprocessing steps to make sure the discriminators don't pick up preprocessing differences between the datasets.
Thank you for your help!
The text was updated successfully, but these errors were encountered: