ProText Analyzer

Note

Apologies, but I did not use the NLTK package for some tasks. Instead, I used:

TextBlob for sentiment analysis

spaCy for various text processing tasks

Syllapy for counting syllables in words

Project Structure

🗂 Directories and Files

📝 Cleaned Articles

cleaned_articles: Contains cleaned articles ready for analysis.

📂 Extracted Articles

extracted_articles: Holds raw articles extracted for the project.

📚 Master Dictionary

master_dictionary: Collection of files for sentiment analysis.
- cleaned_negative_words.txt: List of cleaned negative words.
- cleaned_positive_words.txt: List of cleaned positive words.
- negative-words.txt: Raw negative words for sentiment analysis.
- positive-words.txt: Raw positive words for sentiment analysis.

📑 Project Introduction

project_introduction: Overview and objectives of the project.

🧪 Test Assessment

test_assessment: Contains test assignments and notebooks.
- dataextraction.ipynb: Jupyter Notebook for data extraction tasks.
- testassessment.ipynb: Jupyter Notebook for additional test assessments.

💻 Code and Markdown

testassignment: Code and markdown files related to assignments.
- Code + Markdown/: Contains code snippets and explanations.
- Run All/: Script to execute all code cells in notebooks.

🚫 Stop Words

Stop Words: Directory with various stop words files for preprocessing.

📊 Text Analysis

text_analysis: Files for performing text analysis.
- textanalysis.ipynb: Jupyter Notebook for text analysis.
- sentiment_analysis.log: Log file for sentiment analysis results.
- textblob_sentiment_result.csv: CSV file with sentiment analysis results.

📈 Additional Files

additional_files: Summary results and metrics.
- analysis_results.csv: Various analysis results.
- final_text_analysis_results.xlsx: Final compiled analysis results.

Blackcoffer Test Assignment

Assignment Overview

Objective: Extract textual data from provided URLs and perform text analysis.
Data Extraction:
- Input from Input.xlsx
- Tools: Python, BeautifulSoup, Selenium, Scrapy.
Data Analysis:
- Output in CSV or Excel format.
- Variables include Positive Score, Negative Score, Polarity Score, etc.
Timeline: Duration of 6 days.
Submission: Via Google Form with required files.

Methodology

Sentimental Analysis: Clean text using stop words, create dictionaries of positive/negative words, and extract variables.
Readability Analysis: Calculate average sentence length, percentage of complex words, and Fog Index.

Objective:
The ProText-Analyzer project extracts article content from provided URLs and performs various text analysis tasks like sentiment scoring, readability measurement, and more. The results are structured in a clean and organized format, ready for review and further use.

Project Overview

The goal of ProText-Analyzer is to:

Extract Textual Data: Fetch the article content from URLs provided in the Input.xlsx file.
Perform Textual Analysis: Calculate the following metrics:
- Sentiment scores (positive, negative, polarity, subjectivity)
- Readability scores (Fog Index, Avg. Sentence Length)
- Word count, syllable count, and other word statistics

Technologies Used

Python 🐍
- Libraries:
  - TextBlob for sentiment analysis
  - spaCy for text processing tasks (tokenization, POS tagging, etc.)
  - Syllapy for syllable counting
  - BeautifulSoup for HTML parsing during data extraction
  - Requests for handling HTTP requests
Pandas for data management
Excel/CSV for input/output handling

Installation

Clone the repository to your local machine:

git clone https://github.com/rubydamodar/ProText-Analyzer.git
cd ProText-Analyzer

Install the required Python libraries:
```
pip install -r requirements.txt
```

Data Extraction Process

The ProText-Analyzer extracts the article title and body from each URL listed in the Input.xlsx file and stores the text for further analysis.

Process Overview:

Read Input File: Load the URLs and their associated IDs from Input.xlsx.
Extract Article Content:
- Fetch HTML content using requests.
- Parse the HTML using BeautifulSoup to extract the article's title and body.
- Save the extracted content into text files named after the URL_ID.

File Management:

Each article's content is saved in text files, facilitating a clean process for further analysis.
Error handling ensures proper management of file I/O and network issues.

Text Analysis Process

The extracted text undergoes several analysis steps to compute the following variables:

Sentiment Analysis:
- Implemented using TextBlob to compute Positive Score, Negative Score, Polarity Score, and Subjectivity Score.
- Text is cleaned by removing stop words and irrelevant characters.
Readability Analysis:
- Calculated using the Gunning Fog Index.
- Additional metrics: Average Sentence Length, Percentage of Complex Words, and Fog Index.
Word-Level Metrics:
- Word Count, Complex Word Count, Syllable Count per Word (via syllapy), Personal Pronouns Count (using regex), and Average Word Length.

Output Structure

The results are saved in Excel/CSV format as per the structure outlined in Output Data Structure.xlsx. The following variables are included:

Positive Score
Negative Score
Polarity Score
Subjectivity Score
Average Sentence Length
Complex Word Count
Word Count
Syllable Count
Personal Pronouns Count
Average Word Length

How to Run

Data Extraction: Run the script to extract article data from the URLs:
```
python data_extraction.py
```
Text Analysis: Run the text analysis script to process the extracted articles:
```
python text_analysis.py
```

The results will be saved in the output directory in .csv or .xlsx format.

Challenges and Solutions

Error Handling: Implemented robust error handling to manage potential network and file-related issues.
Text Processing: Utilized advanced tools like spaCy for precise text tokenization and POS tagging, and syllapy for syllable counting.
Personal Pronouns: Regex was used to accurately capture pronouns without including words like "US" mistakenly.

Contributing

We welcome contributions to enhance ProText-Analyzer! To contribute:

Fork the repository.
Create a new branch for your changes.
Submit a pull request with a detailed description of your changes.

License

This project is licensed under the MIT License.

Project Maintainer

Ruby Poddar
Email: rubypoddarr@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ProText Analyzer

Note

Project Structure

Blackcoffer Test Assignment

Company Information

Assignment Overview

Methodology

Project Overview

Technologies Used

Installation

Data Extraction Process

Process Overview:

File Management:

Text Analysis Process

Output Structure

How to Run

Challenges and Solutions

Contributing

License

Project Maintainer

Files

README.md

Latest commit

History

README.md

File metadata and controls

ProText Analyzer

Note

Project Structure

Blackcoffer Test Assignment

Company Information

Assignment Overview

Methodology

Project Overview

Technologies Used

Installation

Data Extraction Process

Process Overview:

File Management:

Text Analysis Process

Output Structure

How to Run

Challenges and Solutions

Contributing

License

Project Maintainer