Data Container for Misinformation Data from LREC 2022
In this repository you're able to download close to 1 Million samples from our dataset. These are divided by disease, and are stored as JSON files.
All data will be released after the end of the LREC 2022 conference in June. This helps us stay in accordance to citation guidlines, and sharing language resources with you directly.
More data is at the link below owing to Githubs space requirements
https://drive.google.com/file/d/1Wa8goi2eo7uKJ8QH2r1RLaZSzk5yruOt/view?usp=sharing
As time passes you will be able to access starter code from here which will help with accessing the tweets directly. We are also continually working on the annotation pipeline which helps with creating more parts of our dataset annotated and ready to use.
To access more data. You can do either of the following.
- Visit ankitaich.com and use the contact section to send an email directly to me asking for more data.
- From our paper you can directly email the first author to access more data.
In your email specify your name, affiliation, and reason for more data access. We'll send over the data directly to you.
If you use a part of our data, or analysis from our paper, please cite the following:
Find our poster and presentation on this repository as well
----citation to be updated after the LREC 2022 conference----
@InProceedings{aich-parde:2022:LREC, author = {Aich, Ankit and Parde, Natalie}, title = {Telling a Lie: Analyzing the Language of Information and Misinformation during Global Health Events}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {4135--4141}, abstract = {The COVID-19 pandemic and other global health events are unfortunately excellent environments for the creation and spread of misinformation, and the language associated with health misinformation may be typified by unique patterns and linguistic markers. Allowing health misinformation to spread unchecked can have devastating ripple effects; however, detecting and stopping its spread requires careful analysis of these linguistic characteristics at scale. We analyze prior investigations focusing on health misinformation, associated datasets, and detection of misinformation during health crises. We also introduce a novel dataset designed for analyzing such phenomena, comprised of 2.8 million news articles and social media posts spanning the early 1900s to the present. Our annotation guidelines result in strong agreement between independent annotators. We describe our methods for collecting this data and follow this with a thorough analysis of the themes and linguistic features that appear in information versus misinformation. Finally, we demonstrate a proof-of-concept misinformation detection task to establish dataset validity, achieving a strong performance benchmark (accuracy = 75%; F1 = 0.7).}, url = {https://aclanthology.org/2022.lrec-1.439} }