Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
rubydamodar authored Oct 9, 2024
1 parent 124c280 commit 0f9f904
Showing 1 changed file with 141 additions and 1 deletion.
142 changes: 141 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,141 @@
# TextualEvaluator
# 📝 ProText-Analyzer

**Objective**:
The **ProText-Analyzer** project extracts article content from provided URLs and performs various text analysis tasks like sentiment scoring, readability measurement, and more. The results are structured in a clean and organized format, ready for review and further use.

## Project Overview

The goal of **ProText-Analyzer** is to:
1. **Extract Textual Data**: Fetch the article content from URLs provided in the `Input.xlsx` file.
2. **Perform Textual Analysis**: Calculate the following metrics:
- Sentiment scores (positive, negative, polarity, subjectivity)
- Readability scores (Fog Index, Avg. Sentence Length)
- Word count, syllable count, and other word statistics

---

## Technologies Used

- **Python** 🐍
- Libraries:
- `TextBlob` for sentiment analysis
- `spaCy` for text processing tasks (tokenization, POS tagging, etc.)
- `Syllapy` for syllable counting
- `BeautifulSoup` for HTML parsing during data extraction
- `Requests` for handling HTTP requests
- **Pandas** for data management
- **Excel/CSV** for input/output handling

---

## Installation

1. Clone the repository to your local machine:
```bash
git clone https://github.com/rubydamodar/ProText-Analyzer.git
cd ProText-Analyzer
```

2. Install the required Python libraries:
```bash
pip install -r requirements.txt
```

---

## Data Extraction Process

The **ProText-Analyzer** extracts the article title and body from each URL listed in the `Input.xlsx` file and stores the text for further analysis.

### Process Overview:
1. **Read Input File**: Load the URLs and their associated IDs from `Input.xlsx`.
2. **Extract Article Content**:
- Fetch HTML content using `requests`.
- Parse the HTML using `BeautifulSoup` to extract the article's title and body.
- Save the extracted content into text files named after the `URL_ID`.

### File Management:
- Each article's content is saved in text files, facilitating a clean process for further analysis.
- Error handling ensures proper management of file I/O and network issues.

---

## Text Analysis Process

The extracted text undergoes several analysis steps to compute the following variables:

1. **Sentiment Analysis**:
- Implemented using `TextBlob` to compute **Positive Score**, **Negative Score**, **Polarity Score**, and **Subjectivity Score**.
- Text is cleaned by removing stop words and irrelevant characters.

2. **Readability Analysis**:
- Calculated using the Gunning Fog Index.
- Additional metrics: **Average Sentence Length**, **Percentage of Complex Words**, and **Fog Index**.

3. **Word-Level Metrics**:
- **Word Count**, **Complex Word Count**, **Syllable Count per Word** (via `syllapy`), **Personal Pronouns Count** (using regex), and **Average Word Length**.

---

## Output Structure

The results are saved in **Excel/CSV** format as per the structure outlined in `Output Data Structure.xlsx`. The following variables are included:
- Positive Score
- Negative Score
- Polarity Score
- Subjectivity Score
- Average Sentence Length
- Complex Word Count
- Word Count
- Syllable Count
- Personal Pronouns Count
- Average Word Length

---

## How to Run

1. **Data Extraction**:
Run the script to extract article data from the URLs:
```bash
python data_extraction.py
```

2. **Text Analysis**:
Run the text analysis script to process the extracted articles:
```bash
python text_analysis.py
```

The results will be saved in the output directory in `.csv` or `.xlsx` format.

---

## Challenges and Solutions

- **Error Handling**: Implemented robust error handling to manage potential network and file-related issues.
- **Text Processing**: Utilized advanced tools like `spaCy` for precise text tokenization and POS tagging, and `syllapy` for syllable counting.
- **Personal Pronouns**: Regex was used to accurately capture pronouns without including words like "US" mistakenly.

---

## Contributing

We welcome contributions to enhance **ProText-Analyzer**! To contribute:
1. Fork the repository.
2. Create a new branch for your changes.
3. Submit a pull request with a detailed description of your changes.

---

## License

This project is licensed under the [MIT License](LICENSE).

---

### Project Maintainer
Ruby Poddar
Email: [email protected]


0 comments on commit 0f9f904

Please sign in to comment.