Skip to content

Wentiing0421/GKRPreprocessing

Repository files navigation

GKGPreprocessing

Introduction:

This is the project that we run to preprocess the orgnizations from GKG database. Here are basic descriptions of files in this respository.

  • start_up.sh - which contains bash script that help you install/download required packages.
  • credentialkey.txt - which contains AWS credential key which can give you access to s3 bucket.

Step 1:

  • Start an ec2 virtual machine.

  • After you accept the colloborate invitation, fork this respository. You can think fork is use this respository as a template. Clone your forked respository to your virtual machine.

  • cd into the cloned folder and install packages:

bash start_up.sh
  • The process would take at most 2 mins, after it stoped you will see:
(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ ls
README.md           jdk-8u141-linux-x64.rpm      start_up.sh
TemplateCode.ipynb  stanford-ner-2018-10-16
credentialkey.txt   stanford-ner-2018-10-16.zip
  • Here jdk-8u141-linux-x64.rpm, stanford-ner-2018-10-16.zip are files we downloaded from website, and stanford-ner-2018-10-16 is the file we unzipped from stanford-ner-2018-10-16.zip. This folder contains an NER(Stanford Named Entity Recognizer) model trained by Stanford University.(Here is the description link). In this project, we will used this NLP model to recognize if the orgnization name in GKG dataset is an company or not.

  • Now we need to add aws access key to our virtual machine, since we have a huge database, we cannot store them in github repository, we use s3 bucket. Because the bucket is a private bucket, we need access to it. Using the following code:

aws configure
  • And you will see:
AWS Access Key ID [None]:
  • Copy the AWSAccessKeyId which you can find in credentialkey.txt:
(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ aws configure
AWS Access Key ID [None]:AKIAJN5VRXD7Q3XYM4OA
  • Press enter and you will see AWS Secret Access Key [None]:, paste the AWSSecretKey in and press enter.

  • And you will see:

(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ aws configure
AWS Access Key ID [None]: AKIAJN5VRXD7Q3XYM4OA
AWS Secret Access Key [None]: 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
Default region name [None]:
  • You can ignore it and press enter and you will see:
(base) [ec2-user@ip-172-31-46-169 GKGPreprocessing]$ aws configure
AWS Access Key ID [None]: AKIAJN5VRXD7Q3XYM4OA
AWS Secret Access Key [None]: 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
Default region name [None]:
Default output format [None]:
  • Ignore it and press enter, then you all set.

  • Now open an jupyter notebook server, and follow the description in the BasicTemplate.ipynb.

Step 2:

  • Go to your AWS Management Console and search s3.

  • Once you clicked into s3 Management console, you will see several buckets. Create your own result bucket.

  • Click create bucket, type in your firstname+result for example mine is fanresult. Keep clicking next until you created your bucket. And you bucket path is s3://[bucketname]/. For example mine is s3://fanresult/.

  • In your virtual machine, you can copy your result directly to you bucket using the following code:

aws s3 cp [filename] s3://[bucketname]
  • For example, if I want to copy result_1.csv to my bucket:
aws s3 cp result_1.csv s3://fanresult/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published