GKGPreprocessing

Introduction:

This is the project that we run to preprocess the orgnizations from GKG database. Here are basic descriptions of files in this respository.

start_up.sh - which contains bash script that help you install/download required packages.
credentialkey.txt - which contains AWS credential key which can give you access to s3 bucket.

Step 1:

Start an ec2 virtual machine.
After you accept the colloborate invitation, fork this respository. You can think fork is use this respository as a template. Clone your forked respository to your virtual machine.
cd into the cloned folder and install packages:

bash start_up.sh

The process would take at most 2 mins, after it stoped you will see:

(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ ls
README.md           jdk-8u141-linux-x64.rpm      start_up.sh
TemplateCode.ipynb  stanford-ner-2018-10-16
credentialkey.txt   stanford-ner-2018-10-16.zip

Here jdk-8u141-linux-x64.rpm, stanford-ner-2018-10-16.zip are files we downloaded from website, and stanford-ner-2018-10-16 is the file we unzipped from stanford-ner-2018-10-16.zip. This folder contains an NER(Stanford Named Entity Recognizer) model trained by Stanford University.(Here is the description link). In this project, we will used this NLP model to recognize if the orgnization name in GKG dataset is an company or not.
Now we need to add aws access key to our virtual machine, since we have a huge database, we cannot store them in github repository, we use s3 bucket. Because the bucket is a private bucket, we need access to it. Using the following code:

aws configure

And you will see:

AWS Access Key ID [None]:

Copy the AWSAccessKeyId which you can find in credentialkey.txt:

(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ aws configure
AWS Access Key ID [None]:AKIAJN5VRXD7Q3XYM4OA

Press enter and you will see AWS Secret Access Key [None]:, paste the AWSSecretKey in and press enter.
And you will see:

(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ aws configure
AWS Access Key ID [None]: AKIAJN5VRXD7Q3XYM4OA
AWS Secret Access Key [None]: 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
Default region name [None]:

You can ignore it and press enter and you will see:

(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ aws configure
AWS Access Key ID [None]: AKIAJN5VRXD7Q3XYM4OA
AWS Secret Access Key [None]: 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
Default region name [None]:
Default output format [None]:

Ignore it and press enter, then you all set.
Now open an jupyter notebook server, and follow the description in the BasicTemplate.ipynb.

Step 2:

Go to your AWS Management Console and search s3.
Once you clicked into s3 Management console, you will see several buckets. Create your own result bucket.
Click create bucket, type in your firstname+result for example mine is fanresult. Keep clicking next until you created your bucket. And you bucket path is s3://[bucketname]/. For example mine is s3://fanresult/.
In your virtual machine, you can copy your result directly to you bucket using the following code:

aws s3 cp [filename] s3://[bucketname]

For example, if I want to copy result_1.csv to my bucket:

aws s3 cp result_1.csv s3://fanresult/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
BasicTemplate.ipynb		BasicTemplate.ipynb
README.md		README.md
generate_input_list.py		generate_input_list.py
get_df.py		get_df.py
jupyter_version_mp_getdf.ipynb		jupyter_version_mp_getdf.ipynb
multi_processing_get_df.py		multi_processing_get_df.py
rootkey.csv		rootkey.csv
start_up.sh		start_up.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GKGPreprocessing

Introduction:

Step 1:

Step 2:

About

Releases

Packages

Languages

Wentiing0421/GKRPreprocessing

Folders and files

Latest commit

History

Repository files navigation

GKGPreprocessing

Introduction:

Step 1:

Step 2:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages