professor-profile-scrapper

A Conditional Random Field based web profile scrapper for U.S. Research professors in the field of Computer Science.

Setup

Install necessary module dependencies

pip install -r requirements

Download TCRF_TrainingData.zip from Forward Shared Data Drive. Uncompress the zip file into a folder named data/

To find professor profile information, use command line to search a professor.

Using as a command-line utility

We will run main.py and this will begin a command-line based search.
This search will pull at most 3 urls related to professor and scrap them
Finally it will present the data merged together, each is labeled with category of data (ex. Bio, awards)

Demo Video

Project Structure

rohan-prakash-professor-profile-scrapper/
  - requirements.txt
  - data/
    - 898 labeled homepage html .txt files
  - src/
    - main.py
    - webProfiling.py
    - homepage_finding/
        - homepage_finder.py
    - homepage_extraction
        - Models/
            - scikit_learn_sgd
            - scikit_learn_svm
            - text_classifier
            - vectorizer
        - corpus
        - homepage_extractor.py
        - parse_data.py
        - random_forest.py
        - train_classifier.py
        - Test/
            - test_url_bank.txt
    - data_consolidator/
        - consolidate_data.py

data/: This folder of 898 labeled profile webpages is used to train TCRF model for homepage extraction
src/: Contains all module code for homepage finding, exraction, and consolidation; focus on extraction for this iteration.
- src/main.py: The driver for the professor search and scrapping pipeline. Connects modules together into search pipeline.
- src/webProfiling.py: This should be moved into src/homepage_extraction once a model may be trained and generated for use. This is CRF based extraction implementation attempt.
- src/homepage_finding/homepage_finder.py: individual module function that filters google search for at most three professor homepages.
- src/homepage_extraction/: This contains multiple modeling approaches to web profile extraction problem. Models/ contains few models used during development, TCRF model should be placed here once trained.
- src/data_consolidator/consolidate_data.py: Individual module function that aims to merge data extracted from each professor homepages mined. In progress, halted at using sentence similarities.

Functional Design

HomepageFinding

Finds the top 3 webpage search results for a given professor query, filters to find best homepages.

homepage_finder(query=(professor_name, institution)):
  ...
  return [homepage1_url, homepage2_url, homepage_3_url]

HomepageExtraction

Extracts information from a given homepage and works to classify them into 4 data categories and an unrelated category. Leverages ML models, latest must use TCRF model.

extract_homepage(homepage_url):
  ...
  return {'edu':..., 'bio':...,'research':..., 'award':...}

DataConsolidator

Aims to merge data packets(dictionaries) from each homepage into a single data packet. Looks to reduce redundancy currently based on sentence similarities.

consolidate_data(data_store={'edu':..., 'bio':...,'research':..., 'award':...}):
  ...
  return final_data_packet={'edu':..., 'bio':...,'research':..., 'award':...}

Algorithmic Design

Given a search query of a professor and their current institution, we can find some of their most fundamental biographical information from reputable online sources. The key is to find reputable facts about a professor and his interests. We thus turn to modern NLP models to determine what is usefule information and how to classify it into categories.

This process is split up into a three-staged pipeline.

We begin by looking for possible homepage urls on google, these are likely to be top results. Thus we filter based on reputable website url domains (ex. .edu). We cap the number of urls to 3 as too many increases redundancy and volatility of factual information.

Following this each url is then scrapped and mined for pertenant information that fits into 4 data categories. This is where nlp models are used, in particular the TCRF works the most effectively as it takes into account the html tree structure that most homepage's fit into. Once data it is extracted it moves onto consolidation.

In the final stage we aim to reduce overlapping information for better data management. This is currently carried out through sentence similarty using SentenceTransformers. This can be improved as it is a trivial module at current stage.

Issues and Future Work

Despite spending the last 2 weeks of Fall iteration of EduToday, I was unable to upgrade my profile extraction model to leverage a TCRF model to mine and classify professor information. It took me time to learn the concept but I got stuck training the general CRF implementation in order to obtain a TCRF connection structure.

Initializing and training CRF model with files in data/ (downloaded from aminer/datasets link below)
Configuring CRF initialization function in src/webProfiling.py to create parent-child connections as well as child-child connections
Applying the newly trained TCRF model for classification

Additional work on Homepage finding and Data consolidation include the following

If I had more time I would refine Homepage finding to check if headings in html indicate it is homepage
Better way to reduce redundant information from multiple homepage scrapes
Possibly use senetence similarity checking to summarize data

References

Some useful references stem from 'A Combination Approach to Web User Profiling'

http://keg.cs.tsinghua.edu.cn/jietang/publications/TKDD11-Tang-et-al-web-user-profiling.pdf

They have suggested the use of TCRF for this problem and linked aminer.org as a very similar problem, where they used the TCRF model. Below is a webpage used to store the training data for the model used.

https://www.aminer.org/lab-datasets/profiling/index.html

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

professor-profile-scrapper

Setup

Using as a command-line utility

Demo Video

Project Structure

Functional Design

HomepageFinding

HomepageExtraction

DataConsolidator

Algorithmic Design

Issues and Future Work

References

About

Releases

Packages

Contributors 2

Languages

Forward-UIUC-2021F/rohan-prakash-professor-profile-scrapper

Folders and files

Latest commit

History

Repository files navigation

professor-profile-scrapper

Setup

Using as a command-line utility

Demo Video

Project Structure

Functional Design

HomepageFinding

HomepageExtraction

DataConsolidator

Algorithmic Design

Issues and Future Work

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages