Aims to scrape IT job listings from Kenyan online Job Boards for analysis and data mining to extract a factual list of skills and requirements for certain fields in the tech space.
I found that some companies asked for an overwhelming amount of technologies, or sometimes have poorly written job descriptions/requirements lists that wore posted by HR representatives who have no knowledge of what skills are required for certain roles. This would have a fresh graduate (and even experienced job-seekers) misunderstand the role and what is required, leading to them flunking the interview.
So I decided to scrape, data mine and compile my findings regarding job skills and realtime market demand into a single interface to help other students decide what to invest their time into learning in campus before venturing into the job market to give them the best opportunity to succeed in getting a Tech Job in the currently competitive job market.
-
Job listings were scraped using a headless selenium instance.
-
Only listings that were in the IT space were crawled from:
- Brighter Monday
- My Job Mag
- LinkedIn [to be added]
-
The job listings were scraped from the general listings page and stored in a data object as urls.
-
For example, in the listing below, the link in 'Data Science Interns at Nakala Analytics Ltd' would be scraped and stored in a Data Object.
-
These listings were then imported by an url crawler script that would then visit each and every link and scrape the Job Listing for the following fields:
- Company
- Job Title / Role
- Job Description
- Location
- Nature --> [internship, fulltime, remote, part-time]
- Salary (if available)
- Date Posted
-
The data was then pre-processed by:
-
Further cleaning steps such as lemmatisation, data splitting and transofmation into a pandas dataframe will be done in the NLP module
- The speed of scraping the data needs to be improved. This can (and will be done) by making several instances of selenium webcrawlers run on different urls in parallel. The 'urls' list can be split into many fragments and separate instances called to work on each segment.
- LinkedIn needs login credentials to allow one to view job listings.
- An NLP model is used to differentiate qualification/requirements sentences from general description sentences and extract key skills that are required from each listing.
- This module should be able to take in the data from the Scraping Module and perform the following tasks:
- Clean it further if need be.
- Take in the job listing and return a list of skills associated with the job listing.
The dataset is collected from Singaporean government website, mycareersfuture.sg consisting of over 20, 000 richly-structured job posts. The detailed statistics of the dataset are shown below:
Mycareersfuture.sg Dataset | Stats |
---|---|
Number of job posts | 20,298 |
Number of distinct skills | 2,548 |
Number of skills with 20 or more mentions | 1,209 |
Average skill tags per job post | 19.98 |
Average token count per job post | 162.27 |
Maximum token count in a job post | 1,127 |
This dataset includes the following fields:
- company_name
- job_title
- employment_type
- seniority
- job_category
- location
- salary
- min_experience
- skills_required
- requirements_and_role
- job_requirements
- company_info
- posting_date
- expiry_date
- no_of_applications
- job_id
- For my use and also to make the large dataset more managable, I dropped several fields that I considered not needed in trainning. These included:
- company_name
- location (all are either Singapore or remote)
- company_info
- posting_date
- expiry_date
- no_of_applicants
- job_id
- seniority