diff --git a/About-us/index.html b/About-us/index.html index 4d1bd97..49d793e 100755 --- a/About-us/index.html +++ b/About-us/index.html @@ -330,6 +330,24 @@ + + +
We extend our sincere thanks to Michael Joseph, Manoj Athreya, Sara Kodeiri, Roya Javadi, Fatemeh Tavakoli, Nan Ajmain, Wu Rupert, and the entire team for their valuable assistance in reviewing the data.
diff --git a/dataset.md/index.html b/dataset.md/index.html index f8ea971..e41a827 100755 --- a/dataset.md/index.html +++ b/dataset.md/index.html @@ -350,87 +350,69 @@Our dataset includes articles from a broad range of reputable news organizations across the political and ideological spectrum, ensuring a comprehensive view of media bias:
The dataset spans from May 6, 2023 to September 6, 2024. Articles were collected using Python scripts incorporating feedparser, newspaper3k, and selenium to enable keyword-based searches, custom date ranges, deduplication of articles, and image downloads.
-- news_article_analysis (
-unique_id VARCHAR(255) PRIMARY KEY,
-
--- Bias assessment
-text_label VARCHAR(10) CHECK (text_label IN ('Likely', 'Unlikely')),
-image_label VARCHAR(10) CHECK (image_label IN ('Likely', 'Unlikely')),
-text_label_reason TEXT,
-image_label_reason TEXT,
-news_category VARCHAR(255),
-
--- Article content
-title VARCHAR(255),
-canonical_link VARCHAR(255),
-first_paragraph TEXT,
-outlet VARCHAR(100),
-source_url VARCHAR(255),
-topics TEXT,
-text_content TEXT,
-date_published TIMESTAMP,
-
--- Image details
-img_description TEXT,
-image_filename VARCHAR(255),
-
--- Topic modeling
-bertopics TEXT
+The dataset schema is designed for bias assessment and structured analysis of media content, including both textual and image data:
+-- news_article_analysis (
+ unique_id VARCHAR(255) PRIMARY KEY,
+ outlet VARCHAR(255),
+ headline TEXT,
+ article_text TEXT,
+ image_description TEXT,
+ image BLOB,
+ date_published VARCHAR(255),
+ source_url VARCHAR(255),
+ canonical_link VARCHAR(255),
+ new_categories TEXT,
+ news_categories_confidence_scores TEXT,
+ text_label VARCHAR(255),
+ multimodal_label VARCHAR(255)
+)
-)
Access
-Here's the information formatted in MkDocs style:
-Dataset Access
-Train
-https://example.com/datasets/train_data.csv
-
-Val
-https://example.com/datasets/train_data.csv```
-
-### Test
+Dataset Access
+You can access the NewsMediaBias-Plus dataset via the following link:
+NewsMediaBias-Plus Dataset on Hugging Face
+Usage
+To load the full dataset into your Python environment, use the following code:
+from datasets import load_dataset
+
+ds = load_dataset("vector-institute/newsmediabias-plus")
+print(ds) # Displays structure and splits
+print(ds['train'][0]) # Access the first element of the train split
+print(ds['train'][:5]) # Access the first five elements of the train split
-https://example.com/datasets/train_data.csv```
-This format provides clear and organized access to the dataset links for train, validation, and test sets. Users can easily copy the URLs for each dataset split.
+The dataset is also available for download in Parquet format, along with the corresponding images, via Zenodo:
+Download Parquet and Images
+
Sample Data
-Certainly! I'll create a sample dataset with 3 entries based on the schema provided. Here's how you can present it in MkDocs format:
-Sample Data
Article 1: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN
- Unique ID: 1098444910
-- Title: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN
+- Title: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN
- Text: CNN — The woman whose story Alabama Sen. Katie Britt appeared to have shared in the Republican response to the State of the Union as an example of President Joe Biden’s failed immigration policies told CNN she was trafficked before Biden’s presidency and said legislators lack empathy when using the issue of human trafficking for political purposes.
"I hardly ever cooperate with politicians, because it seems to me that they only want an image. They only want a photo — and that to me is not fair," Karla Jacinto told CNN on Sunday.
- Outlet: CNN
-- Source URL: CNN
+- Source URL: CNN
- Topics: 5_bipartisan, border, border deal, border policy, border wall
- Date Published: 2024-03-10
- Image Description:
@@ -452,13 +434,13 @@ Article 2: LA’s graffiti-tagged skyscraper: a work of art – and symbol of city’s wider failings - The Guardian US
- Unique ID: 1148232027
-- Title: LA’s graffiti-tagged skyscraper: a work of art – and symbol of city’s wider failings - The Guardian US
+- Title: LA’s graffiti-tagged skyscraper: a work of art – and symbol of city’s wider failings - The Guardian US
- Text:
An asparagus patch is how the architect Charles Moore described the lackluster skyline of downtown Los Angeles in the 1980s. "The tallest stalk and the shortest stalk are just alike, except that the tallest has shot farther out of the ground." This sprawling city of bungalows has never been known for the quality of its high-rise buildings, and not much has changed since Moore’s day. A 1950s ordinance dictating that every tower must have a flat roof was rescinded in 2014, spawning a handful of clumsy quiffs and crowns atop a fresh crop of swollen glass slabs. It only added further evidence to the notion that architects in this seismic city are probably better suited to staying on the ground.
- Outlet: The Guardian US
-- Source URL: The Guardian US
+- Source URL: The Guardian US
- Topics: affordable housing, public housing, homeowners, housing crisis
- Date Published: 2024-03-17
- Image Description:
diff --git a/images/Vector Logo_Bilingual_FullColour_Horizontal.jpg b/images/Vector Logo_Bilingual_FullColour_Horizontal.jpg
new file mode 100755
index 0000000..764d15a
Binary files /dev/null and b/images/Vector Logo_Bilingual_FullColour_Horizontal.jpg differ
diff --git a/index.html b/index.html
index ab10edc..afe23ae 100755
--- a/index.html
+++ b/index.html
@@ -268,6 +268,39 @@
+
+
+ -
+
+
+ Dataset Access
+
+
+
+
+
-
@@ -427,6 +460,39 @@
+
+
+ -
+
+
+ Dataset Access
+
+
+
+
+
-
@@ -454,44 +520,50 @@
News Media Bias Plus Project
+
Our Mission
-The News Media Bias Plus Project is dedicated to promoting responsible AI development and addressing critical challenges in artificial intelligence, with a special focus on media bias and disinformation. We explore the intersection of AI safety and media integrity, focusing on:
+The News Media Bias Plus Project is a leading initiative in the field of responsible AI, dedicated to advancing the understanding of media bias and disinformation through the lens of artificial intelligence. We focus on the critical intersection between AI safety and media integrity, with the goal of promoting a more balanced and transparent information ecosystem. Our key areas of interest include:
-- Bias Detection: Uncovering and mitigating biases in AI systems and media content
-- Disinformation Challenges: Addressing misinformation and its societal impact
-- Ethical AI: Promoting responsible use of AI in news reporting and production
+- Bias Detection: Identifying and addressing biases in both AI systems and media content.
+- Disinformation Challenges: Combatting misinformation and its societal impact.
+- Ethical AI: Advocating for the responsible use of AI in media production and journalism.
Our Framework
-Using state-of-the-art AI methods, we analyze news articles, documents and images to detect and categorize different types of media bias. Our system examines:
+By leveraging cutting-edge AI techniques, we analyze diverse media formats—including news articles, documents, and images—to detect, categorize, and mitigate various forms of media bias. Our comprehensive system assesses:
-- Topic coverage and framing
-- Ideological leanings and sentiment
-- Language patterns and tone
-- Source credibility and transparency
+- Topic Coverage and Framing: How media outlets present and prioritize different subjects.
+- Ideological Leanings and Sentiment: The underlying tone and political inclinations in media.
+- Language Patterns and Tone: Examination of stylistic choices that influence perception.
+- Source Credibility and Transparency: Evaluation of the reliability and openness of information sources.
Key Features
-- Bias Analysis: Compare coverage of specific topics across multiple news sources
-- AI Safety Metrics: Track the use of AI in content analysis and its impact on bias
-- Disinformation Alerts: Detect and flag potential AI-generated fake news or deepfakes
-- Interactive Visualizations: Explore media bias trends and AI influence in journalism
+- Bias Analysis: Comparative analysis of media coverage across multiple sources to highlight differences in bias and framing.
+- AI Safety Metrics: Monitoring and assessing the role of AI in content generation and bias detection, ensuring its responsible deployment.
+- Disinformation Alerts: Identifying potential AI-generated disinformation, including deepfakes and fabricated news.
+- Interactive Visualizations: Engaging tools that allow users to explore media bias trends and the growing influence of AI in journalism.
Why It Matters
-Understanding the role of AI in media bias and disinformation is crucial for:
+In an era where artificial intelligence is increasingly shaping the media landscape, understanding its role in media bias and disinformation is essential for:
-- Promoting media literacy in the age of AI
-- Ensuring responsible AI development in journalism
-- Fostering trust in both AI systems and media institutions
+- Promoting Media Literacy: Empowering individuals to critically evaluate news content and sources.
+- Ensuring Responsible AI Development: Fostering the ethical use of AI in journalism and news production.
+- Building Trust: Strengthening trust in AI systems and media institutions by improving transparency and accountability.
+Dataset Access
+Download from Huggingface
+NewsMediaBias-Plus Dataset on Hugging Face
+Download Parquet and Images
+
Get Involved & Contact Us
-We welcome your contributions, questions, and feedback. Here's how you can engage with our project:
+We invite researchers, developers, and the broader public to contribute to our efforts in combating media bias and disinformation. You can support our project by:
-- Contribute to our bias and disinformation detection efforts
-- For data access requests or other inquiries, please fill out this form:
+- Collaborating on bias detection and disinformation research.
+- Requesting Data Access or providing feedback by completing the form below:
-Address:
+
Contact Information:
Vector Institute for Artificial Intelligence
Schwartz Reisman Innovation Campus
108 College St., Suite W1140
diff --git a/search/search_index.json b/search/search_index.json
index 5cb4484..5d04c94 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"News Media Bias Plus Project","text":""},{"location":"#our-mission","title":"Our Mission","text":"
The News Media Bias Plus Project is dedicated to promoting responsible AI development and addressing critical challenges in artificial intelligence, with a special focus on media bias and disinformation. We explore the intersection of AI safety and media integrity, focusing on:
- Bias Detection: Uncovering and mitigating biases in AI systems and media content
- Disinformation Challenges: Addressing misinformation and its societal impact
- Ethical AI: Promoting responsible use of AI in news reporting and production
"},{"location":"#our-framework","title":"Our Framework","text":"Using state-of-the-art AI methods, we analyze news articles, documents and images to detect and categorize different types of media bias. Our system examines:
- Topic coverage and framing
- Ideological leanings and sentiment
- Language patterns and tone
- Source credibility and transparency
"},{"location":"#key-features","title":"Key Features","text":" - Bias Analysis: Compare coverage of specific topics across multiple news sources
- AI Safety Metrics: Track the use of AI in content analysis and its impact on bias
- Disinformation Alerts: Detect and flag potential AI-generated fake news or deepfakes
- Interactive Visualizations: Explore media bias trends and AI influence in journalism
"},{"location":"#why-it-matters","title":"Why It Matters","text":"Understanding the role of AI in media bias and disinformation is crucial for:
- Promoting media literacy in the age of AI
- Ensuring responsible AI development in journalism
- Fostering trust in both AI systems and media institutions
"},{"location":"#get-involved-contact-us","title":"Get Involved & Contact Us","text":"We welcome your contributions, questions, and feedback. Here's how you can engage with our project:
- Contribute to our bias and disinformation detection efforts
- For data access requests or other inquiries, please fill out this form:
Loading\u2026 Address: Vector Institute for Artificial Intelligence Schwartz Reisman Innovation Campus 108 College St., Suite W1140 Toronto, ON M5G 0C6
Email: shaina.raza@vectorinstitute.ai
"},{"location":"About-us/","title":"Team","text":""},{"location":"About-us/#project-lead","title":"Project Lead","text":"Dr. Shaina Raza: Applied Machine Learning Scientist, Responsible AI"},{"location":"About-us/#contributors","title":"Contributors","text":"Marcelo Lotif: Senior Software Developer & ML Engineer , Vector Institute Caesar Saleh: Undergraduate Researcher, University of Toronto, Vector Institute Emrul Hasan: Ph.D. Candidate, Toronto Metropolitan University, Vector Institute Veronica Chatrath: Associate Technical Program Manager , Vector Institute Franklin Ogidi: Associate Machine Learning Specialist , Vector Institute Roya Javadi: Machine Learning Software Engineer, Vector Institute Sina Salimian: Research Assistant, Univerity of Calgary. Aditya Jain: Computational Scientist, Meta Dr. Gias Uddin: Professor, York University"},{"location":"Annotation/","title":"Annotation Framework","text":""},{"location":"Annotation/#1-annotation-guidelines-and-procedure","title":"1. Annotation Guidelines and Procedure","text":"This framework outlines a structured approach for annotating news articles, incorporating both text and images. The process begins with human annotators labeling a carefully selected subset of the data. Once this subset is annotated, Large Language Models (LLMs) take over to expand these labels across the entire dataset. By aligning annotations with corresponding text and images, the result is a Silver Standard Dataset.
"},{"location":"Annotation/#2-quality-control","title":"2. Quality Control","text":"To ensure the reliability and consistency of annotations, multiple quality control mechanisms are in place. Cohen's Kappa is employed to measure inter-annotator agreement, highlighting areas that may require further clarification. In addition to automated checks, human reviewers manually evaluate a portion of the annotations. This dual approach maintains the quality and accuracy of the labeled data.
"},{"location":"Annotation/#3-evaluation-pipeline","title":"3. Evaluation Pipeline","text":"The evaluation process is designed to convert the Silver Standard Dataset into a Gold Standard Dataset. Initially, an LLM-based jury provides judgments on the quality of the annotations. These judgments are then reviewed by human experts, who validate, refine, or discard annotations as necessary. This collaborative effort between machines and human reviewers ensures a high-quality final dataset.
"},{"location":"Annotation/#4-system-training-and-testing","title":"4. System Training and Testing","text":"Once the Gold Standard Dataset is established, it serves as the foundation for training and testing models in multi-modal bias detection. This process ensures the model\u2019s performance remains robust across various data types, including both text and images.
"},{"location":"Benchmark/","title":"Benchmarking for Annotation Framework","text":""},{"location":"Benchmark/#purpose","title":"Purpose","text":"The purpose of this benchmarking page is to evaluate the performance of Small Language Models (SLMs) and Large Language Models (LLMs) in our annotation framework. In this context, we refer to SLMs as those with fewer parameters, typically less than 15 million, such as BERT and GPT-2. LLMs, like Llama3, Mistral, Gemma, Phi, have significantly more parameters, often in the hundreds of millions to billions. This relative difference in scale allows us to compare the efficiency to handle more complex tasks and datasets, while SLMs are more efficient for simpler tasks or environments with limited resources.
"},{"location":"Benchmark/#benchmarking-on-texts","title":"Benchmarking on Texts","text":""},{"location":"Benchmark/#small-language-models-slms","title":"Small Language Models (SLMs)","text":"Model Training Method Architecture Classes Carbon Emissions (tCO\u2082e) BERT-base-uncased Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A BERT-large-uncased Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A DistilBERT Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A RoBERTa-base Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A GPT2 Fine-tuning Decoder Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A BART Fine-tuning Encoder-decoder Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A"},{"location":"Benchmark/#large-language-models-llms","title":"Large Language Models (LLMs)","text":""},{"location":"Benchmark/#llama-models","title":"Llama Models","text":"Model Training Method Architecture Classes Carbon Emissions (tCO\u2082e) Llama 3.1-8B-instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.1-8B Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-1B-Instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-1B Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-3B-instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-8B-sequence classifier Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias, Real/Unbias N/A Llama 3 (70B) N/A N/A N/A 1900"},{"location":"Benchmark/#other-llms","title":"Other LLMs","text":"Model Training Method Architecture Classes Carbon Emissions (tCO\u2082e) Mistral-v0.3-instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias, Real/Unbias N/A Mistral-v0.3 Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias, Real/Unbias N/A Mistral-large-instruct (IFT) IFT N/A Fake/Bias, Real/Unbias N/A Gemma-2-9b-Instruct 0-shot, 5-shot, IFT Decoder-only, Causal LM Fake/Bias, Real/Unbias N/A Gemma-2-9b Fine-tuning Decoder-only, sequence classification Fake/Bias, Real/Unbias N/A"},{"location":"Benchmark/#benchmarking-on-multimodality","title":"Benchmarking on Multimodality","text":""},{"location":"Benchmark/#small-language-models-slms_1","title":"Small Language Models (SLMs)","text":"Model Training Method Architecture (text-image) BERT + ResNet-34 Fine-tuning Encoder-Encoder SAFE (Text-CNN + Image2Sentence) Fine-tuning Encoder-Encoder SpotFake (XLNET + VGG-19) Fine-tuning Encoder-encoder MCAN (BERT + VGG-19/CNN) Fine-tuning Encoder-encoder FND-CLIP (BERT/ResNet + CLIP) Fine-tuning Encoder-encoder InstructBlipV Fine-tuning Encoder-encoder DistilBERT + CLIP Fine-tuning Encoder-encoder"},{"location":"Benchmark/#large-language-models-llms_1","title":"Large Language Models (LLMs)","text":"Model Training Method Architecture (text-image) google/paligemma-3b-pt-224 Instruction fine-tuning Decoder-encoder microsoft/Phi-3-vision-128k-instruct 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder Pixtral-12B-2409 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder LLaVA-1.6 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder Llama-3.2-11B-Vision-Instruct 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder meta-llama/Llama-3.2-11B-Vision Fine-tuning Decoder-encoder meta-llama/Llama-Guard-3-11B-Vision Inference Decoder-encoder"},{"location":"Publications/","title":"Publications","text":""},{"location":"Publications/#journal-articles","title":"Journal Articles","text":""},{"location":"Publications/#conference-papers","title":"Conference Papers","text":""},{"location":"Publications/#media-coverage","title":"Media Coverage","text":"List instances of media coverage, including articles, interviews, and mentions in popular press. Provide links and a brief description of each piece.
"},{"location":"dataset.md/","title":"Dataset Details","text":""},{"location":"dataset.md/#news-sources","title":"News Sources","text":" - CNN, Fox News, CBS News, ABC News, New York Times
- Washington Post, BBC, USA Today, Wall Street Journal
- AP News, Politico, New York Post, Forbes, Reuters
- Bloomberg, Al Jazeera, PBS NewsHour, The Guardian
- Newsmax, HuffPost, CNBC, C-SPAN, The Economist
- Financial Times, Time, Newsweek, The Atlantic
- The New Yorker, The Hill, ProPublica, Axios
- National Review, The Daily Beast, Daily Kos
- Washington Examiner, The Federalist, OANN
- Daily Caller, Breitbart, CBC, Toronto Sun
- Global News, The Globe and Mail, National Post
"},{"location":"dataset.md/#date-range","title":"Date Range","text":" - Date range : 2023-05-06 till 2024-09-06 Python script using feedparser, newspaper3k, and selenium to scrape articles from multiple news sources, supporting keyword searches, custom date ranges, and outputting data to CSV files, with features for deduplication and image downloading.
"},{"location":"dataset.md/#key-features","title":"Key Features","text":" - Scrapes articles from multiple news sources
- Supports keyword-based searches
- Downloads and stores article content and images
- Deduplicates articles based on unique identifiers
- Outputs data to CSV files
"},{"location":"dataset.md/#dataset-schema","title":"Dataset Schema","text":"-- news_article_analysis (\nunique_id VARCHAR(255) PRIMARY KEY,\n\n-- Bias assessment\ntext_label VARCHAR(10) CHECK (text_label IN ('Likely', 'Unlikely')),\nimage_label VARCHAR(10) CHECK (image_label IN ('Likely', 'Unlikely')),\ntext_label_reason TEXT,\nimage_label_reason TEXT,\nnews_category VARCHAR(255),\n\n-- Article content\ntitle VARCHAR(255),\ncanonical_link VARCHAR(255),\nfirst_paragraph TEXT,\noutlet VARCHAR(100),\nsource_url VARCHAR(255),\ntopics TEXT,\ntext_content TEXT,\ndate_published TIMESTAMP,\n\n-- Image details\nimg_description TEXT,\nimage_filename VARCHAR(255),\n\n-- Topic modeling\nbertopics TEXT\n
)
"},{"location":"dataset.md/#access","title":"Access","text":"Here's the information formatted in MkDocs style:
"},{"location":"dataset.md/#dataset-access","title":"Dataset Access","text":""},{"location":"dataset.md/#train","title":"Train","text":"https://example.com/datasets/train_data.csv\n
"},{"location":"dataset.md/#val","title":"Val","text":"https://example.com/datasets/train_data.csv```\n\n### Test\n
https://example.com/datasets/train_data.csv```
This format provides clear and organized access to the dataset links for train, validation, and test sets. Users can easily copy the URLs for each dataset split.
"},{"location":"dataset.md/#sample-data","title":"Sample Data","text":"Certainly! I'll create a sample dataset with 3 entries based on the schema provided. Here's how you can present it in MkDocs format:
"},{"location":"dataset.md/#sample-data_1","title":"Sample Data","text":""},{"location":"dataset.md/#article-1-sex-trafficking-victim-says-sen-katie-britt-telling-her-story-during-sotu-rebuttal-is-not-fair-cnn","title":"Article 1: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN","text":" - Unique ID: 1098444910
- Title: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN
- Text: CNN \u2014 The woman whose story Alabama Sen. Katie Britt appeared to have shared in the Republican response to the State of the Union as an example of President Joe Biden\u2019s failed immigration policies told CNN she was trafficked before Biden\u2019s presidency and said legislators lack empathy when using the issue of human trafficking for political purposes.
\"I hardly ever cooperate with politicians, because it seems to me that they only want an image. They only want a photo \u2014 and that to me is not fair,\" Karla Jacinto told CNN on Sunday.
- Outlet: CNN
- Source URL: CNN
- Topics: 5_bipartisan, border, border deal, border policy, border wall
- Date Published: 2024-03-10
- Image Description:
The image shows a person standing at a podium with a microphone, appearing to be giving a speech or presentation. The individual is wearing a pink blazer with a white shirt underneath. The background is indistinct but suggests an indoor setting with a wooden structure, possibly a room with a high ceiling. There are no visible logos, text, or other identifying features that provide context to the event or the person's identity.
- Text Label: Unlikely
- Text Bias Analysis:
\"failed immigration policies\", \"lack of empathy\", \"despicable\", \"almost entirely preventable\"
- Image Label: Unlikely
- Image Analysis:
The image alone does not provide enough context to analyze potential biases. The choice of the image could be influenced by the event's significance, the person's role, or the visual impact of the pink blazer. Without additional information, it is not possible to determine if the image is biased or Unbiased. The image does not appear to evoke strong emotions as it is a straightforward depiction of a person at a podium. There are no clear indications of stereotypes or oversimplification of complex issues in the image.
"},{"location":"dataset.md/#article-2-las-graffiti-tagged-skyscraper-a-work-of-art-and-symbol-of-citys-wider-failings-the-guardian-us","title":"Article 2: LA\u2019s graffiti-tagged skyscraper: a work of art \u2013 and symbol of city\u2019s wider failings - The Guardian US","text":" - Unique ID: 1148232027
- Title: LA\u2019s graffiti-tagged skyscraper: a work of art \u2013 and symbol of city\u2019s wider failings - The Guardian US
- Text:
An asparagus patch is how the architect Charles Moore described the lackluster skyline of downtown Los Angeles in the 1980s. \"The tallest stalk and the shortest stalk are just alike, except that the tallest has shot farther out of the ground.\" This sprawling city of bungalows has never been known for the quality of its high-rise buildings, and not much has changed since Moore\u2019s day. A 1950s ordinance dictating that every tower must have a flat roof was rescinded in 2014, spawning a handful of clumsy quiffs and crowns atop a fresh crop of swollen glass slabs. It only added further evidence to the notion that architects in this seismic city are probably better suited to staying on the ground.
- Outlet: The Guardian US
- Source URL: The Guardian US
- Topics: affordable housing, public housing, homeowners, housing crisis
- Date Published: 2024-03-17
- Image Description:
The image shows a tall, multi-story building with numerous windows. The building is covered in various graffiti tags and symbols, with words like 'READY', 'SHAKA', 'RAKM', 'TOOL', 'TOLT', 'KERZ', 'SMK', 'DZER', 'MSK', and 'OBER' prominently displayed. The building is situated in an urban environment with other structures visible in the background. The sky is clear, suggesting it might be daytime. The image is taken from a high angle, looking down on the building.
- Text Label: Likely
- Text Bias Analysis:
\"mind-numbingly generic glass boxes\", \"abandoned\", \"doing nothing\", \"if they ain\u2019t gon finish the job\", \"This building has needed love for years\", \"the streets of LA are happy to make something out of it\", \"the developer had ceased paying\"
- Image Label: Likely
- Image Analysis:
The image and accompanying headline from The Guardian US suggest a critical perspective on the state of urban development and the impact of graffiti on architecture. The choice of this image may be intended to highlight the issue of urban decay and the lack of maintenance in certain areas. The graffiti tags could be seen as a form of artistic expression, but within the context of the headline, they are likely to be interpreted as a symbol of the city's wider failings. The image does not provide a balanced view, as it focuses on the negative aspects of the building's appearance. The framing of the image, with the building as the central focus and the surrounding environment in the background, may lead viewers to associate the building's condition with the overall state of the city.
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"News Media Bias Plus Project","text":""},{"location":"#our-mission","title":"Our Mission","text":"The News Media Bias Plus Project is a leading initiative in the field of responsible AI, dedicated to advancing the understanding of media bias and disinformation through the lens of artificial intelligence. We focus on the critical intersection between AI safety and media integrity, with the goal of promoting a more balanced and transparent information ecosystem. Our key areas of interest include:
- Bias Detection: Identifying and addressing biases in both AI systems and media content.
- Disinformation Challenges: Combatting misinformation and its societal impact.
- Ethical AI: Advocating for the responsible use of AI in media production and journalism.
"},{"location":"#our-framework","title":"Our Framework","text":"By leveraging cutting-edge AI techniques, we analyze diverse media formats\u2014including news articles, documents, and images\u2014to detect, categorize, and mitigate various forms of media bias. Our comprehensive system assesses:
- Topic Coverage and Framing: How media outlets present and prioritize different subjects.
- Ideological Leanings and Sentiment: The underlying tone and political inclinations in media.
- Language Patterns and Tone: Examination of stylistic choices that influence perception.
- Source Credibility and Transparency: Evaluation of the reliability and openness of information sources.
"},{"location":"#key-features","title":"Key Features","text":" - Bias Analysis: Comparative analysis of media coverage across multiple sources to highlight differences in bias and framing.
- AI Safety Metrics: Monitoring and assessing the role of AI in content generation and bias detection, ensuring its responsible deployment.
- Disinformation Alerts: Identifying potential AI-generated disinformation, including deepfakes and fabricated news.
- Interactive Visualizations: Engaging tools that allow users to explore media bias trends and the growing influence of AI in journalism.
"},{"location":"#why-it-matters","title":"Why It Matters","text":"In an era where artificial intelligence is increasingly shaping the media landscape, understanding its role in media bias and disinformation is essential for:
- Promoting Media Literacy: Empowering individuals to critically evaluate news content and sources.
- Ensuring Responsible AI Development: Fostering the ethical use of AI in journalism and news production.
- Building Trust: Strengthening trust in AI systems and media institutions by improving transparency and accountability.
"},{"location":"#dataset-access","title":"Dataset Access","text":""},{"location":"#download-from-huggingface","title":"Download from Huggingface","text":"NewsMediaBias-Plus Dataset on Hugging Face
"},{"location":"#download-parquet-and-images","title":"Download Parquet and Images","text":"Zenodo Record
"},{"location":"#get-involved-contact-us","title":"Get Involved & Contact Us","text":"We invite researchers, developers, and the broader public to contribute to our efforts in combating media bias and disinformation. You can support our project by:
- Collaborating on bias detection and disinformation research.
- Requesting Data Access or providing feedback by completing the form below:
Loading\u2026 Contact Information: Vector Institute for Artificial Intelligence Schwartz Reisman Innovation Campus 108 College St., Suite W1140 Toronto, ON M5G 0C6
Email: shaina.raza@vectorinstitute.ai
"},{"location":"About-us/","title":"Team","text":""},{"location":"About-us/#project-lead","title":"Project Lead","text":"Dr. Shaina Raza Applied Machine Learning Scientist, Responsible AI"},{"location":"About-us/#contributors","title":"Contributors","text":"Marcelo Lotif Senior Software Developer & ML Engineer, Vector Institute Caesar Saleh Undergraduate Researcher, University of Toronto, Vector Institute Emrul Hasan Ph.D. Candidate, Toronto Metropolitan University, Vector Institute Veronica Chatrath Associate Technical Program Manager, Vector Institute Franklin Ogidi Associate Machine Learning Specialist, Vector Institute Roya Javadi Machine Learning Software Engineer, Vector Institute Sina Salimian Research Assistant, University of Calgary Maximus Powers Ethical Spectacle Research Mark Coatsworth Vector Institute"},{"location":"About-us/#advisors","title":"Advisors","text":"Dr. Gias Uddin Professor, York University Dr. Aditya Jain Computational Scientist, Meta Dr. Arash Afkanpour Advisor"},{"location":"About-us/#acknowledgments","title":"Acknowledgments","text":"We extend our sincere thanks to Michael Joseph, Manoj Athreya, Sara Kodeiri, Roya Javadi, Fatemeh Tavakoli, Nan Ajmain, Wu Rupert, and the entire team for their valuable assistance in reviewing the data.
"},{"location":"Annotation/","title":"Annotation Framework","text":""},{"location":"Annotation/#1-annotation-guidelines-and-procedure","title":"1. Annotation Guidelines and Procedure","text":"This framework outlines a structured approach for annotating news articles, incorporating both text and images. The process begins with human annotators labeling a carefully selected subset of the data. Once this subset is annotated, Large Language Models (LLMs) take over to expand these labels across the entire dataset. By aligning annotations with corresponding text and images, the result is a Silver Standard Dataset.
"},{"location":"Annotation/#2-quality-control","title":"2. Quality Control","text":"To ensure the reliability and consistency of annotations, multiple quality control mechanisms are in place. Cohen's Kappa is employed to measure inter-annotator agreement, highlighting areas that may require further clarification. In addition to automated checks, human reviewers manually evaluate a portion of the annotations. This dual approach maintains the quality and accuracy of the labeled data.
"},{"location":"Annotation/#3-evaluation-pipeline","title":"3. Evaluation Pipeline","text":"The evaluation process is designed to convert the Silver Standard Dataset into a Gold Standard Dataset. Initially, an LLM-based jury provides judgments on the quality of the annotations. These judgments are then reviewed by human experts, who validate, refine, or discard annotations as necessary. This collaborative effort between machines and human reviewers ensures a high-quality final dataset.
"},{"location":"Annotation/#4-system-training-and-testing","title":"4. System Training and Testing","text":"Once the Gold Standard Dataset is established, it serves as the foundation for training and testing models in multi-modal bias detection. This process ensures the model\u2019s performance remains robust across various data types, including both text and images.
"},{"location":"Benchmark/","title":"Benchmarking for Annotation Framework","text":""},{"location":"Benchmark/#purpose","title":"Purpose","text":"The purpose of this benchmarking page is to evaluate the performance of Small Language Models (SLMs) and Large Language Models (LLMs) in our annotation framework. In this context, we refer to SLMs as those with fewer parameters, typically less than 15 million, such as BERT and GPT-2. LLMs, like Llama3, Mistral, Gemma, Phi, have significantly more parameters, often in the hundreds of millions to billions. This relative difference in scale allows us to compare the efficiency to handle more complex tasks and datasets, while SLMs are more efficient for simpler tasks or environments with limited resources.
"},{"location":"Benchmark/#benchmarking-on-texts","title":"Benchmarking on Texts","text":""},{"location":"Benchmark/#small-language-models-slms","title":"Small Language Models (SLMs)","text":"Model Training Method Architecture Classes Carbon Emissions (tCO\u2082e) BERT-base-uncased Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A BERT-large-uncased Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A DistilBERT Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A RoBERTa-base Fine-tuning Encoder-only Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A GPT2 Fine-tuning Decoder Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A BART Fine-tuning Encoder-decoder Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A"},{"location":"Benchmark/#large-language-models-llms","title":"Large Language Models (LLMs)","text":""},{"location":"Benchmark/#llama-models","title":"Llama Models","text":"Model Training Method Architecture Classes Carbon Emissions (tCO\u2082e) Llama 3.1-8B-instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.1-8B Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-1B-Instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-1B Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-3B-instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias/Likely (0), Real/Unbias/Unlikely (1) N/A Llama 3.2-8B-sequence classifier Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias, Real/Unbias N/A Llama 3 (70B) N/A N/A N/A 1900"},{"location":"Benchmark/#other-llms","title":"Other LLMs","text":"Model Training Method Architecture Classes Carbon Emissions (tCO\u2082e) Mistral-v0.3-instruct 0-shot, 5-shot, IFT Decoder-only autoregressive CausalLM Fake/Bias, Real/Unbias N/A Mistral-v0.3 Fine-tuning Decoder-only autoregressive sequence classification Fake/Bias, Real/Unbias N/A Mistral-large-instruct (IFT) IFT N/A Fake/Bias, Real/Unbias N/A Gemma-2-9b-Instruct 0-shot, 5-shot, IFT Decoder-only, Causal LM Fake/Bias, Real/Unbias N/A Gemma-2-9b Fine-tuning Decoder-only, sequence classification Fake/Bias, Real/Unbias N/A"},{"location":"Benchmark/#benchmarking-on-multimodality","title":"Benchmarking on Multimodality","text":""},{"location":"Benchmark/#small-language-models-slms_1","title":"Small Language Models (SLMs)","text":"Model Training Method Architecture (text-image) BERT + ResNet-34 Fine-tuning Encoder-Encoder SAFE (Text-CNN + Image2Sentence) Fine-tuning Encoder-Encoder SpotFake (XLNET + VGG-19) Fine-tuning Encoder-encoder MCAN (BERT + VGG-19/CNN) Fine-tuning Encoder-encoder FND-CLIP (BERT/ResNet + CLIP) Fine-tuning Encoder-encoder InstructBlipV Fine-tuning Encoder-encoder DistilBERT + CLIP Fine-tuning Encoder-encoder"},{"location":"Benchmark/#large-language-models-llms_1","title":"Large Language Models (LLMs)","text":"Model Training Method Architecture (text-image) google/paligemma-3b-pt-224 Instruction fine-tuning Decoder-encoder microsoft/Phi-3-vision-128k-instruct 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder Pixtral-12B-2409 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder LLaVA-1.6 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder Llama-3.2-11B-Vision-Instruct 0-shot, 5-shot, Instruction fine-tuning Decoder-encoder meta-llama/Llama-3.2-11B-Vision Fine-tuning Decoder-encoder meta-llama/Llama-Guard-3-11B-Vision Inference Decoder-encoder"},{"location":"Publications/","title":"Publications","text":""},{"location":"Publications/#journal-articles","title":"Journal Articles","text":""},{"location":"Publications/#conference-papers","title":"Conference Papers","text":""},{"location":"Publications/#media-coverage","title":"Media Coverage","text":"List instances of media coverage, including articles, interviews, and mentions in popular press. Provide links and a brief description of each piece.
"},{"location":"dataset.md/","title":"Dataset Details","text":""},{"location":"dataset.md/#news-sources","title":"News Sources","text":"Our dataset includes articles from a broad range of reputable news organizations across the political and ideological spectrum, ensuring a comprehensive view of media bias:
- Major U.S. News Outlets: CNN, Fox News, CBS News, ABC News, New York Times, Washington Post, USA Today, Wall Street Journal, AP News, Politico, New York Post, Forbes, Reuters, Bloomberg
- Global & Alternative News Sources: BBC, Al Jazeera, PBS NewsHour, The Guardian, Newsmax, HuffPost, CNBC, C-SPAN, The Economist, Financial Times, Time, Newsweek, The Atlantic, The New Yorker, The Hill, ProPublica, Axios
- Conservative & Progressive News Outlets: National Review, The Daily Beast, Daily Kos, Washington Examiner, The Federalist, OANN, Daily Caller, Breitbart
- Canadian News Sources: CBC, Toronto Sun, Global News, The Globe and Mail, National Post
"},{"location":"dataset.md/#date-range","title":"Date Range","text":"The dataset spans from May 6, 2023 to September 6, 2024. Articles were collected using Python scripts incorporating feedparser, newspaper3k, and selenium to enable keyword-based searches, custom date ranges, deduplication of articles, and image downloads.
"},{"location":"dataset.md/#key-features","title":"Key Features","text":" - Multi-source Scraping: Collects articles from diverse media outlets.
- Keyword-based Search: Allows focused scraping on specific topics or terms.
- Comprehensive Data Collection: Captures both text and images from articles.
- Deduplication: Ensures only unique articles are included in the dataset.
- Structured Output: Outputs data in CSV format for easy analysis and processing.
"},{"location":"dataset.md/#dataset-schema","title":"Dataset Schema","text":"The dataset schema is designed for bias assessment and structured analysis of media content, including both textual and image data:
-- news_article_analysis (\n unique_id VARCHAR(255) PRIMARY KEY,\n outlet VARCHAR(255),\n headline TEXT,\n article_text TEXT,\n image_description TEXT,\n image BLOB, \n date_published VARCHAR(255),\n source_url VARCHAR(255),\n canonical_link VARCHAR(255),\n new_categories TEXT,\n news_categories_confidence_scores TEXT,\n text_label VARCHAR(255),\n multimodal_label VARCHAR(255)\n)\n
"},{"location":"dataset.md/#access","title":"Access","text":""},{"location":"dataset.md/#dataset-access","title":"Dataset Access","text":"You can access the NewsMediaBias-Plus dataset via the following link:
NewsMediaBias-Plus Dataset on Hugging Face
"},{"location":"dataset.md/#usage","title":"Usage","text":"To load the full dataset into your Python environment, use the following code:
from datasets import load_dataset\n\nds = load_dataset(\"vector-institute/newsmediabias-plus\")\nprint(ds) # Displays structure and splits\nprint(ds['train'][0]) # Access the first element of the train split\nprint(ds['train'][:5]) # Access the first five elements of the train split\n
The dataset is also available for download in Parquet format, along with the corresponding images, via Zenodo:
"},{"location":"dataset.md/#download-parquet-and-images","title":"Download Parquet and Images","text":"Zenodo Record
"},{"location":"dataset.md/#sample-data","title":"Sample Data","text":""},{"location":"dataset.md/#article-1-sex-trafficking-victim-says-sen-katie-britt-telling-her-story-during-sotu-rebuttal-is-not-fair-cnn","title":"Article 1: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN","text":" - Unique ID: 1098444910
- Title: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN
- Text: CNN \u2014 The woman whose story Alabama Sen. Katie Britt appeared to have shared in the Republican response to the State of the Union as an example of President Joe Biden\u2019s failed immigration policies told CNN she was trafficked before Biden\u2019s presidency and said legislators lack empathy when using the issue of human trafficking for political purposes.
\"I hardly ever cooperate with politicians, because it seems to me that they only want an image. They only want a photo \u2014 and that to me is not fair,\" Karla Jacinto told CNN on Sunday.
- Outlet: CNN
- Source URL: CNN
- Topics: 5_bipartisan, border, border deal, border policy, border wall
- Date Published: 2024-03-10
- Image Description:
The image shows a person standing at a podium with a microphone, appearing to be giving a speech or presentation. The individual is wearing a pink blazer with a white shirt underneath. The background is indistinct but suggests an indoor setting with a wooden structure, possibly a room with a high ceiling. There are no visible logos, text, or other identifying features that provide context to the event or the person's identity.
- Text Label: Unlikely
- Text Bias Analysis:
\"failed immigration policies\", \"lack of empathy\", \"despicable\", \"almost entirely preventable\"
- Image Label: Unlikely
- Image Analysis:
The image alone does not provide enough context to analyze potential biases. The choice of the image could be influenced by the event's significance, the person's role, or the visual impact of the pink blazer. Without additional information, it is not possible to determine if the image is biased or Unbiased. The image does not appear to evoke strong emotions as it is a straightforward depiction of a person at a podium. There are no clear indications of stereotypes or oversimplification of complex issues in the image.
"},{"location":"dataset.md/#article-2-las-graffiti-tagged-skyscraper-a-work-of-art-and-symbol-of-citys-wider-failings-the-guardian-us","title":"Article 2: LA\u2019s graffiti-tagged skyscraper: a work of art \u2013 and symbol of city\u2019s wider failings - The Guardian US","text":" - Unique ID: 1148232027
- Title: LA\u2019s graffiti-tagged skyscraper: a work of art \u2013 and symbol of city\u2019s wider failings - The Guardian US
- Text:
An asparagus patch is how the architect Charles Moore described the lackluster skyline of downtown Los Angeles in the 1980s. \"The tallest stalk and the shortest stalk are just alike, except that the tallest has shot farther out of the ground.\" This sprawling city of bungalows has never been known for the quality of its high-rise buildings, and not much has changed since Moore\u2019s day. A 1950s ordinance dictating that every tower must have a flat roof was rescinded in 2014, spawning a handful of clumsy quiffs and crowns atop a fresh crop of swollen glass slabs. It only added further evidence to the notion that architects in this seismic city are probably better suited to staying on the ground.
- Outlet: The Guardian US
- Source URL: The Guardian US
- Topics: affordable housing, public housing, homeowners, housing crisis
- Date Published: 2024-03-17
- Image Description:
The image shows a tall, multi-story building with numerous windows. The building is covered in various graffiti tags and symbols, with words like 'READY', 'SHAKA', 'RAKM', 'TOOL', 'TOLT', 'KERZ', 'SMK', 'DZER', 'MSK', and 'OBER' prominently displayed. The building is situated in an urban environment with other structures visible in the background. The sky is clear, suggesting it might be daytime. The image is taken from a high angle, looking down on the building.
- Text Label: Likely
- Text Bias Analysis:
\"mind-numbingly generic glass boxes\", \"abandoned\", \"doing nothing\", \"if they ain\u2019t gon finish the job\", \"This building has needed love for years\", \"the streets of LA are happy to make something out of it\", \"the developer had ceased paying\"
- Image Label: Likely
- Image Analysis:
The image and accompanying headline from The Guardian US suggest a critical perspective on the state of urban development and the impact of graffiti on architecture. The choice of this image may be intended to highlight the issue of urban decay and the lack of maintenance in certain areas. The graffiti tags could be seen as a form of artistic expression, but within the context of the headline, they are likely to be interpreted as a symbol of the city's wider failings. The image does not provide a balanced view, as it focuses on the negative aspects of the building's appearance. The framing of the image, with the building as the central focus and the surrounding environment in the background, may lead viewers to associate the building's condition with the overall state of the city.
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index ff0bd0b..8645a5a 100755
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ