Harvester Lite

Scrapes Google Storage website and API for USGS satellite TIFFS. From these TIFF files we take the relevant metadata, push them out to UNIS, and then send the files to DLT to be stored.

TODO

Hook up with Lib-DLT, automate upload process to DLT.
Run as a Daemon? Make the app 'listen' for new files? Or just add command line options for periodic polling? Currently polls everything within the last week.

IMPORTANT

After install, just run app.py. No longer using Selenium to scrape HTML, instead using Google's poorly documented Python libraries. Leaving in Web Scrapping stuff in case someone wants to try it or I ever decide to use Selenium for this again.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

What things you need to install the software and how to install them

Python 3.4 or >

Installing

Run python3 setup.py build install (perhaps in a local python virtual env). Use python3 app.py to begin fetching USGS landsat data.

Everything below is deprecated but still functional, leaving in my personal repo (should be ripped out before heading to data-logistics repo)

You need to have chromedriver inside of your path to run headless. To run a full featured browser you need to have geckodriver installed in your path.

On Mac just use brew install geckodriver and brew install chromedriver.

No workflow found for windows. If you need to use windows you'll have to google how to get geckodriver and chromedriver in the right place to run Selenium.

For CentOS deployment run the script webdriver/centos_chromedriver.sh. CentOS deployment only needs the chromedriver since there is generally no GUI to run the full browser from a terminal.

CentOS also needs protobuf library, but for whatever reason it doesnt install correctly on CentOS using pip. See here: https://blog.jeffli.me/blog/2016/12/08/install-protocol-buffer-from-source-in-centos-7/ to install it. It takes forever. If you have protobuf installed from your google-cloud python installation, you can run the app to check if it works. If protobuf module can't be found and protobuf shows up in your python environment then you will need to install the binaries unfortunately D: .

Deployment

TODO: Build out a Daemon that automatically gets and pushes TIFFS. Build out something that can 'listen' to file changes?

Built With

BS4 - Granular Html Scraping
Selenium Web Driver - Selenium Python Web Driver for scraping Javascript driven sites.
DLT - DLT Upload code ripped from.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
USGS_Google_Storage_Scraper.egg-info		USGS_Google_Storage_Scraper.egg-info
__pycache__		__pycache__
config		config
dist		dist
webdriver		webdriver
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
app.py		app.py
bucket.py		bucket.py
cd.log		cd.log
cloudstoragedriver.py		cloudstoragedriver.py
scraper.py		scraper.py
search.py		search.py
service_account.json		service_account.json
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harvester Lite

TODO

IMPORTANT

Getting Started

Prerequisites

Installing

Everything below is deprecated but still functional, leaving in my personal repo (should be ripped out before heading to data-logistics repo)

Deployment

Built With

About

Releases

Packages

Contributors 2

Languages

gskip17/USGS-Web-Scraper

Folders and files

Latest commit

History

Repository files navigation

Harvester Lite

TODO

IMPORTANT

Getting Started

Prerequisites

Installing

Everything below is deprecated but still functional, leaving in my personal repo (should be ripped out before heading to data-logistics repo)

Deployment

Built With

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages