skillexplore

Aims to scrape IT job listings from Kenyan online Job Boards for analysis and data mining to extract a factual list of skills and requirements for certain fields in the tech space.

I found that some companies asked for an overwhelming amount of technologies, or sometimes have poorly written job descriptions/requirements lists that wore posted by HR representatives who have no knowledge of what skills are required for certain roles. This would have a fresh graduate (and even experienced job-seekers) misunderstand the role and what is required, leading to them flunking the interview.

So I decided to scrape, data mine and compile my findings regarding job skills and realtime market demand into a single interface to help other students decide what to invest their time into learning in campus before venturing into the job market to give them the best opportunity to succeed in getting a Tech Job in the currently competitive job market.

1. Scraping Module (Data Collection)

Job listings were scraped using a headless selenium instance.
Only listings that were in the IT space were crawled from:
1. Brighter Monday
2. My Job Mag
3. LinkedIn [to be added]
The job listings were scraped from the general listings page and stored in a data object as urls.
For example, in the listing below, the link in 'Data Science Interns at Nakala Analytics Ltd' would be scraped and stored in a Data Object.
These listings were then imported by an url crawler script that would then visit each and every link and scrape the Job Listing for the following fields:
1. Company
2. Job Title / Role
3. Job Description
4. Location
5. Nature --> [internship, fulltime, remote, part-time]
6. Salary (if available)
7. Date Posted
The data was then pre-processed by:
1. Removing newline characters "/n"
2. Removing tab spaces and large spaces.
3. Removing special HTML characters
Further cleaning steps such as lemmatisation, data splitting and transofmation into a pandas dataframe will be done in the NLP module

Pinch points

The speed of scraping the data needs to be improved. This can (and will be done) by making several instances of selenium webcrawlers run on different urls in parallel. The 'urls' list can be split into many fragments and separate instances called to work on each segment.
LinkedIn needs login credentials to allow one to view job listings.

2. NLP Module

An NLP model is used to differentiate qualification/requirements sentences from general description sentences and extract key skills that are required from each listing.
This module should be able to take in the data from the Scraping Module and perform the following tasks:
1. Clean it further if need be.
2. Take in the job listing and return a list of skills associated with the job listing.

The Dataset.

The dataset is collected from Singaporean government website, mycareersfuture.sg consisting of over 20, 000 richly-structured job posts. The detailed statistics of the dataset are shown below:

Mycareersfuture.sg Dataset	Stats
Number of job posts	20,298
Number of distinct skills	2,548
Number of skills with 20 or more mentions	1,209
Average skill tags per job post	19.98
Average token count per job post	162.27
Maximum token count in a job post	1,127

This dataset includes the following fields:

company_name
job_title
employment_type
seniority
job_category
location
salary
min_experience
skills_required
requirements_and_role
job_requirements
company_info
posting_date
expiry_date
no_of_applications
job_id

For my use and also to make the large dataset more managable, I dropped several fields that I considered not needed in trainning. These included:
1. company_name
2. location (all are either Singapore or remote)
3. company_info
4. posting_date
5. expiry_date
6. no_of_applicants
7. job_id
8. seniority

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.vscode		.vscode
Inference		Inference
analyser		analyser
assets		assets
modelling		modelling
scrapers		scrapers
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skillexplore

1. Scraping Module (Data Collection)

Pinch points

2. NLP Module

The Dataset.

About

Releases

Packages

Languages

License

koechian/skillexplore

Folders and files

Latest commit

History

Repository files navigation

skillexplore

1. Scraping Module (Data Collection)

Pinch points

2. NLP Module

The Dataset.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages