Skip to content

Commit

Permalink
Deployed 2952c49 with MkDocs version: 1.6.0
Browse files Browse the repository at this point in the history
  • Loading branch information
llm-work committed Oct 21, 2024
1 parent a95191c commit 8539db1
Show file tree
Hide file tree
Showing 6 changed files with 204 additions and 96 deletions.
74 changes: 64 additions & 10 deletions About-us/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,24 @@
</span>
</a>

</li>

<li class="md-nav__item">
<a href="#advisors" class="md-nav__link">
<span class="md-ellipsis">
Advisors
</span>
</a>

</li>

<li class="md-nav__item">
<a href="#acknowledgments" class="md-nav__link">
<span class="md-ellipsis">
Acknowledgments
</span>
</a>

</li>

</ul>
Expand Down Expand Up @@ -382,6 +400,24 @@
</span>
</a>

</li>

<li class="md-nav__item">
<a href="#advisors" class="md-nav__link">
<span class="md-ellipsis">
Advisors
</span>
</a>

</li>

<li class="md-nav__item">
<a href="#acknowledgments" class="md-nav__link">
<span class="md-ellipsis">
Acknowledgments
</span>
</a>

</li>

</ul>
Expand All @@ -402,38 +438,56 @@
<h1 id="team">Team</h1>
<h2 id="project-lead">Project Lead</h2>
<div style="display: flex; align-items: center; margin-bottom: 20px;">
<span style="font-weight: bold; margin-right: 10px;">Dr. Shaina Raza:</span>
<span style="font-weight: bold; margin-right: 10px;">Dr. Shaina Raza</span>
<span>Applied Machine Learning Scientist, Responsible AI</span>
</div>

<h2 id="contributors">Contributors</h2>
<div style="display: flex; flex-wrap: wrap; gap: 10px;">
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Marcelo Lotif:</strong> Senior Software Developer & ML Engineer , Vector Institute
<strong>Marcelo Lotif</strong><br> Senior Software Developer & ML Engineer, Vector Institute
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Caesar Saleh</strong><br> Undergraduate Researcher, University of Toronto, Vector Institute
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Caesar Saleh:</strong> Undergraduate Researcher, University of Toronto, Vector Institute
<strong>Emrul Hasan</strong><br> Ph.D. Candidate, Toronto Metropolitan University, Vector Institute
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Emrul Hasan:</strong> Ph.D. Candidate, Toronto Metropolitan University, Vector Institute
<strong>Veronica Chatrath</strong><br> Associate Technical Program Manager, Vector Institute
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Veronica Chatrath:</strong> Associate Technical Program Manager , Vector Institute
<strong>Franklin Ogidi</strong><br> Associate Machine Learning Specialist, Vector Institute
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Franklin Ogidi:</strong> Associate Machine Learning Specialist , Vector Institute
<strong>Roya Javadi</strong><br> Machine Learning Software Engineer, Vector Institute
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Roya Javadi:</strong> Machine Learning Software Engineer, Vector Institute
<strong>Sina Salimian</strong><br> Research Assistant, University of Calgary
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Sina Salimian:</strong> Research Assistant, Univerity of Calgary.
<strong>Maximus Powers</strong><br> Ethical Spectacle Research
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Aditya Jain:</strong> Computational Scientist, Meta
<strong>Mark Coatsworth</strong><br> Vector Institute
</div>
</div>

<h2 id="advisors">Advisors</h2>
<div style="display: flex; flex-wrap: wrap; gap: 10px;">
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Dr. Gias Uddin</strong><br> Professor, York University
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Dr. Gias Uddin:</strong> Professor, York University
<strong>Dr. Aditya Jain</strong><br> Computational Scientist, Meta
</div>
<div style="flex: 1 1 300px; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
<strong>Dr. Arash Afkanpour</strong><br> Advisor
</div>
</div>

<h2 id="acknowledgments">Acknowledgments</h2>
<p>We extend our sincere thanks to Michael Joseph, Manoj Athreya, Sara Kodeiri, Roya Javadi, Fatemeh Tavakoli, Nan Ajmain, Wu Rupert, and the entire team for their valuable assistance in reviewing the data.</p>



Expand Down
110 changes: 46 additions & 64 deletions dataset.md/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -350,87 +350,69 @@

<h1 id="dataset-details">Dataset Details</h1>
<h2 id="news-sources">News Sources</h2>
<p>Our dataset includes articles from a broad range of reputable news organizations across the political and ideological spectrum, ensuring a comprehensive view of media bias:</p>
<ul>
<li>CNN, Fox News, CBS News, ABC News, New York Times</li>
<li>Washington Post, BBC, USA Today, Wall Street Journal</li>
<li>AP News, Politico, New York Post, Forbes, Reuters</li>
<li>Bloomberg, Al Jazeera, PBS NewsHour, The Guardian</li>
<li>Newsmax, HuffPost, CNBC, C-SPAN, The Economist</li>
<li>Financial Times, Time, Newsweek, The Atlantic</li>
<li>The New Yorker, The Hill, ProPublica, Axios</li>
<li>National Review, The Daily Beast, Daily Kos</li>
<li>Washington Examiner, The Federalist, OANN</li>
<li>Daily Caller, Breitbart, CBC, Toronto Sun</li>
<li>Global News, The Globe and Mail, National Post</li>
<li><strong>Major U.S. News Outlets:</strong> CNN, Fox News, CBS News, ABC News, New York Times, Washington Post, USA Today, Wall Street Journal, AP News, Politico, New York Post, Forbes, Reuters, Bloomberg</li>
<li><strong>Global &amp; Alternative News Sources:</strong> BBC, Al Jazeera, PBS NewsHour, The Guardian, Newsmax, HuffPost, CNBC, C-SPAN, The Economist, Financial Times, Time, Newsweek, The Atlantic, The New Yorker, The Hill, ProPublica, Axios</li>
<li><strong>Conservative &amp; Progressive News Outlets:</strong> National Review, The Daily Beast, Daily Kos, Washington Examiner, The Federalist, OANN, Daily Caller, Breitbart</li>
<li><strong>Canadian News Sources:</strong> CBC, Toronto Sun, Global News, The Globe and Mail, National Post</li>
</ul>
<h2 id="date-range">Date Range</h2>
<ul>
<li>Date range : 2023-05-06 till 2024-09-06
Python script using feedparser, newspaper3k, and selenium to scrape articles from multiple news sources, supporting keyword searches, custom date ranges, and outputting data to CSV files, with features for deduplication and image downloading.</li>
</ul>
<p>The dataset spans from <strong>May 6, 2023</strong> to <strong>September 6, 2024</strong>. Articles were collected using Python scripts incorporating <strong>feedparser</strong>, <strong>newspaper3k</strong>, and <strong>selenium</strong> to enable keyword-based searches, custom date ranges, deduplication of articles, and image downloads.</p>
<h2 id="key-features">Key Features</h2>
<ul>
<li>Scrapes articles from multiple news sources</li>
<li>Supports keyword-based searches</li>
<li>Downloads and stores article content and images</li>
<li>Deduplicates articles based on unique identifiers</li>
<li>Outputs data to CSV files</li>
<li><strong>Multi-source Scraping:</strong> Collects articles from diverse media outlets.</li>
<li><strong>Keyword-based Search:</strong> Allows focused scraping on specific topics or terms.</li>
<li><strong>Comprehensive Data Collection:</strong> Captures both text and images from articles.</li>
<li><strong>Deduplication:</strong> Ensures only unique articles are included in the dataset.</li>
<li><strong>Structured Output:</strong> Outputs data in CSV format for easy analysis and processing.</li>
</ul>
<hr />
<h1 id="dataset-schema">Dataset Schema</h1>
<pre><code>-- news_article_analysis (
unique_id VARCHAR(255) PRIMARY KEY,

-- Bias assessment
text_label VARCHAR(10) CHECK (text_label IN ('Likely', 'Unlikely')),
image_label VARCHAR(10) CHECK (image_label IN ('Likely', 'Unlikely')),
text_label_reason TEXT,
image_label_reason TEXT,
news_category VARCHAR(255),

-- Article content
title VARCHAR(255),
canonical_link VARCHAR(255),
first_paragraph TEXT,
outlet VARCHAR(100),
source_url VARCHAR(255),
topics TEXT,
text_content TEXT,
date_published TIMESTAMP,

-- Image details
img_description TEXT,
image_filename VARCHAR(255),

-- Topic modeling
bertopics TEXT
<p>The dataset schema is designed for bias assessment and structured analysis of media content, including both textual and image data:</p>
<pre><code class="language-sql">-- news_article_analysis (
unique_id VARCHAR(255) PRIMARY KEY,
outlet VARCHAR(255),
headline TEXT,
article_text TEXT,
image_description TEXT,
image BLOB,
date_published VARCHAR(255),
source_url VARCHAR(255),
canonical_link VARCHAR(255),
new_categories TEXT,
news_categories_confidence_scores TEXT,
text_label VARCHAR(255),
multimodal_label VARCHAR(255)
)
</code></pre>
<p>)</p>
<h2 id="access">Access</h2>
<p>Here's the information formatted in MkDocs style:</p>
<h2 id="dataset-access">Dataset Access</h2>
<h3 id="train">Train</h3>
<pre><code>https://example.com/datasets/train_data.csv
</code></pre>
<h3 id="val">Val</h3>
<pre><code>https://example.com/datasets/train_data.csv```

### Test
<h3 id="dataset-access">Dataset Access</h3>
<p>You can access the <strong>NewsMediaBias-Plus</strong> dataset via the following link:</p>
<p><a href="https://huggingface.co/datasets/vector-institute/newsmediabias-plus">NewsMediaBias-Plus Dataset on Hugging Face</a></p>
<h3 id="usage">Usage</h3>
<p>To load the full dataset into your Python environment, use the following code:</p>
<pre><code class="language-python">from datasets import load_dataset

ds = load_dataset(&quot;vector-institute/newsmediabias-plus&quot;)
print(ds) # Displays structure and splits
print(ds['train'][0]) # Access the first element of the train split
print(ds['train'][:5]) # Access the first five elements of the train split
</code></pre>
<p>https://example.com/datasets/train_data.csv```</p>
<p>This format provides clear and organized access to the dataset links for train, validation, and test sets. Users can easily copy the URLs for each dataset split.</p>
<p>The dataset is also available for download in Parquet format, along with the corresponding images, via Zenodo:</p>
<h3 id="download-parquet-and-images">Download Parquet and Images</h3>
<p><a href="https://zenodo.org/records/13961155">Zenodo Record</a></p>
<h2 id="sample-data">Sample Data</h2>
<p>Certainly! I'll create a sample dataset with 3 entries based on the schema provided. Here's how you can present it in MkDocs format:</p>
<h2 id="sample-data_1">Sample Data</h2>
<h2 id="article-1-sex-trafficking-victim-says-sen-katie-britt-telling-her-story-during-sotu-rebuttal-is-not-fair-cnn">Article 1: Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN</h2>
<ul>
<li><strong>Unique ID</strong>: 1098444910</li>
<li><strong>Title</strong>: <a href="https://www.cnn.com/2024/03/10/politics/katie-britt-sex-trafficking-victim-interview/index.html">Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN</a></li>
<li><strong>Title</strong>: <a href="https://www.cnn.com/2024/03/10/politics/katie-britt-sex-trafficking-victim-interview/index.html" target="_blank">Sex trafficking victim says Sen. Katie Britt telling her story during SOTU rebuttal is 'not fair' - CNN</a></li>
<li><strong>Text</strong>: CNN — The woman whose story Alabama Sen. Katie Britt appeared to have shared in the Republican response to the State of the Union as an example of President Joe Biden’s failed immigration policies told CNN she was trafficked before Biden’s presidency and said legislators lack empathy when using the issue of human trafficking for political purposes. <blockquote>
<p>"I hardly ever cooperate with politicians, because it seems to me that they only want an image. They only want a photo — and that to me is not fair," Karla Jacinto told CNN on Sunday.</p>
</blockquote>
</li>
<li><strong>Outlet</strong>: CNN</li>
<li><strong>Source URL</strong>: <a href="https://www.cnn.com">CNN</a></li>
<li><strong>Source URL</strong>: <a href="https://www.cnn.com" target="_blank">CNN</a></li>
<li><strong>Topics</strong>: 5_bipartisan, border, border deal, border policy, border wall</li>
<li><strong>Date Published</strong>: 2024-03-10</li>
<li><strong>Image Description</strong>: <blockquote>
Expand All @@ -452,13 +434,13 @@ <h2 id="article-1-sex-trafficking-victim-says-sen-katie-britt-telling-her-story-
<h2 id="article-2-las-graffiti-tagged-skyscraper-a-work-of-art-and-symbol-of-citys-wider-failings-the-guardian-us">Article 2: LA’s graffiti-tagged skyscraper: a work of art – and symbol of city’s wider failings - The Guardian US</h2>
<ul>
<li><strong>Unique ID</strong>: 1148232027</li>
<li><strong>Title</strong>: <a href="https://www.theguardian.com/us-news/2024/mar/17/los-angeles-graffiti-abandoned-skyscraper-downtown">LA’s graffiti-tagged skyscraper: a work of art – and symbol of city’s wider failings - The Guardian US</a></li>
<li><strong>Title</strong>: <a href="https://www.theguardian.com/us-news/2024/mar/17/los-angeles-graffiti-abandoned-skyscraper-downtown" target="_blank">LA’s graffiti-tagged skyscraper: a work of art – and symbol of city’s wider failings - The Guardian US</a></li>
<li><strong>Text</strong>: <blockquote>
<p>An asparagus patch is how the architect Charles Moore described the lackluster skyline of downtown Los Angeles in the 1980s. "The tallest stalk and the shortest stalk are just alike, except that the tallest has shot farther out of the ground." This sprawling city of bungalows has never been known for the quality of its high-rise buildings, and not much has changed since Moore’s day. A 1950s ordinance dictating that every tower must have a flat roof was rescinded in 2014, spawning a handful of clumsy quiffs and crowns atop a fresh crop of swollen glass slabs. It only added further evidence to the notion that architects in this seismic city are probably better suited to staying on the ground.</p>
</blockquote>
</li>
<li><strong>Outlet</strong>: The Guardian US</li>
<li><strong>Source URL</strong>: <a href="https://www.theguardian.com">The Guardian US</a></li>
<li><strong>Source URL</strong>: <a href="https://www.theguardian.com" target="_blank">The Guardian US</a></li>
<li><strong>Topics</strong>: affordable housing, public housing, homeowners, housing crisis</li>
<li><strong>Date Published</strong>: 2024-03-17</li>
<li><strong>Image Description</strong>: <blockquote>
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 8539db1

Please sign in to comment.