Skip to content

BenjiFischman/GSScrapeNode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GSScrapeNode

A web scraper for Google Scholar utilizing Node.js



Introduction

A methodology to grab all titles that reference a seed article was necessary for a research project, but Google does not provide a way to do this. Thus, GSScrape was born with the goal of automatically scraping Google Scholar for titles. As it stands today, May 24, 2016, GSScrape is only semi-automatic, and requires a user to navigate to each results page and click a button to extract titles to a file.

GSScrapeNode is a version of GSScrape implemented through a Node.js based webserver and a Google Chrome extension. It is built this way because extracting titles from a Google Scholar web page is trivial, but outputting the titles to a file is not. The file output aspect is what generated a need for a webserver. In a nutshell, GSScrapeNode will fetch the titles from a Google Scholar results page and post them to a local webserver. The webserver will then write the titles to a file.

Installation

  1. You'll need to be running linux with both npm and nodejs installed. Both are available using apt-get. You will also need Google Chrome.
  2. Clone this repo to wherever. That path will be referred to as "scraper_path" in the following steps.
  3. In Chrome load the unpacked extension located in scraper_path/extension.
  4. In the scraper_path directory, run the command npm install. This will install the express and body-parser node modules
  5. Next, run nodejs server.js
    You should see a message indicating the server is listening on localhost:8080. You may change the listening port as needed.
  6. Navigate to a Google Scholar page with results and click the GSScrapeNode extension button. In scraper_path, there will be a new file called titles.txt that will hold the titles from that page. You may do this on as many pages as needed, and all titles will be appended to titles.txt, one title per line.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published