MISINFORMATION

Data Container for Misinformation Data from LREC 2022

In this repository you're able to download close to 1 Million samples from our dataset. These are divided by disease, and are stored as JSON files.

All data will be released after the end of the LREC 2022 conference in June. This helps us stay in accordance to citation guidlines, and sharing language resources with you directly.

More data is at the link below owing to Githubs space requirements

https://drive.google.com/file/d/1Wa8goi2eo7uKJ8QH2r1RLaZSzk5yruOt/view?usp=sharing

As time passes you will be able to access starter code from here which will help with accessing the tweets directly. We are also continually working on the annotation pipeline which helps with creating more parts of our dataset annotated and ready to use.

To access more data. You can do either of the following.

Visit ankitaich.com and use the contact section to send an email directly to me asking for more data.
From our paper you can directly email the first author to access more data.

In your email specify your name, affiliation, and reason for more data access. We'll send over the data directly to you.

If you use a part of our data, or analysis from our paper, please cite the following:

Find our poster and presentation on this repository as well

----citation to be updated after the LREC 2022 conference----

@InProceedings{aich-parde:2022:LREC, author = {Aich, Ankit and Parde, Natalie}, title = {Telling a Lie: Analyzing the Language of Information and Misinformation during Global Health Events}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {4135--4141}, abstract = {The COVID-19 pandemic and other global health events are unfortunately excellent environments for the creation and spread of misinformation, and the language associated with health misinformation may be typified by unique patterns and linguistic markers. Allowing health misinformation to spread unchecked can have devastating ripple effects; however, detecting and stopping its spread requires careful analysis of these linguistic characteristics at scale. We analyze prior investigations focusing on health misinformation, associated datasets, and detection of misinformation during health crises. We also introduce a novel dataset designed for analyzing such phenomena, comprised of 2.8 million news articles and social media posts spanning the early 1900s to the present. Our annotation guidelines result in strong agreement between independent annotators. We describe our methods for collecting this data and follow this with a thorough analysis of the themes and linguistic features that appear in information versus misinformation. Finally, we demonstrate a proof-of-concept misinformation detection task to establish dataset validity, achieving a strong performance benchmark (accuracy = 75%; F1 = 0.7).}, url = {https://aclanthology.org/2022.lrec-1.439} }

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Ebola2.json		Ebola2.json
EbolaFlu.json		EbolaFlu.json
README.md		README.md
SARSMERS.json		SARSMERS.json
Telling a Lie.pdf		Telling a Lie.pdf
dataset_500.json		dataset_500.json
poster_draft_1.pdf		poster_draft_1.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MISINFORMATION

About

Releases

Packages

uic-nlp-lab/LREC-2022-MISINFORMATION

Folders and files

Latest commit

History

Repository files navigation

MISINFORMATION

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages