-
ejs (JavaScript HTML templates)
-
mongoose (MongoDB interaction)
-
request (the module used to make the HTTP requests)
-
sse-pusher (for sending Server Sent Events)
- ejobs (ro)
- bestjobs (ro)
- hipo (ro)
- olx (ro)
Node-JS local application which gathers jobs listings based on on a list of **queries** and **locations**. It searches in a list of predefined sites, which can be manually defining them in the code.
The application makes an HTTP request to those websites using the: **query**, **location** and **page** arguments, receives an HTML text (string) as a response, which is parsed into a list(array) of `{name: String, url:String}` type of results by using Regular Expressions.
Starts with page 1 for each query/location combination, continues to increment the page until no results are found, in that moment it actually goes to the next query/location combination.
On the occasion of failed requests, it tried again for a number of 3 times, if all of them fail, it just skips the current query/location and tries the next one.
The search continues indefinitely, by going in a front-back in the query/location combination; the repetition is delayed to avoid excessive HTTP requests to the destination.
Site searches run in parallel.
-- Using Virtualbox (or other virtualization application that supports importing .ova
files) and then importing this VM (Used Lubuntu x64 16, has NodeJS and MongoDB already installed) then following the instructions on the desktop (which are in a file in the desktop, ~/Desktop/
).
-- make sure you have NodeJS, MongoDB and git installed
- clone this github repository (
git clone https://github.com/Slitthe/jobs-gatherer.git
) - change directory to 'jobs-gatherer' (the cloned git repo directory)
- install the node dependencies by using
npm install
in the cloned repository's root directory - start the mongod service
- run the
app.js
file using node (found in the root directory of the cloned repository) - visit http://localhost:3000/ or http://127.0.0.1:3000/ to start the actual application with the dedsired sites selected
It is possible to add additional sites by manually defining in the user_modules/sites.js
file:
-
Create an URL constructor for that site, which lives inside of the
user_modules/sites.js
file,getUrls
method:-
adding a base endpoint in the
urls
object, inside of thegetUrls
function (the name of that should be the name of the site). -
create the actual URL constructor as a method in
siteUrls
object inside of thereqUrls
function. The method needs to return a request URL given thequery
,location
andpage
arguments)
-
-
Define the regular expressions needed to parse the HTML request response for that site,
parse
method of theuser_modules/sites.js
-
add a property to the
expression
object in theparse
function, which should be the name of that site -
define three RegExp for that site:
items
,href
,name
-
items
should be a RegExp which captures thehref
andtitle
of the job result, example (/job-title.*?<\/a>
) -
href
should capture the link of that listing in its capture group 1 ($1
), example (/href="(.*?)"
) -
name
should capture the name of that result in its capture group 1($1
), example (/>(.*?)<\/a>
)
-
-
(optionally) some sites may require to further specifiy in which container the actual results will live.
- This is the case of sites that actually display other 'relevant' similar results when there are no actually results for your specific query.
- For these cases you can define an
array
ofregular expressions
in a property namedwrappers
(this array should be a property of the same object which holds the other RegExps:items
,href
andname
). - The regular expressions are used from back to front (first --> last array item).
- These expressions are used to actually pinpoint the HTML portion where the relevant results are.
- For example the actual relevant results might be in a
<div class='relevant-results>
, whereas the 'suggested' results might be in a<div class='suggestions-results>
. In this case you'd define a regexp to capture only the results inside therelevant-results
, and if that contain is found, then it's as if the page had no results,therefore the search service skips to the next query/location parameter to make the next search request.
- For example the actual relevant results might be in a
-
-
Add the name of the site in the
user_modules/sites.js
file, as a string in theexportData.sites
array (same name as used in the other defining places,parse
andgetUrls
)
After you run the the `app.js` file with `node`, to start the actual serch service, by visiting `/start`;
At this point, you can also delete database items by visiting /debugging
. This is mainly for debugging purposes.
Once the search service is run, you cannot access /start
and /debugging
until you restart the node application.
ITEMS DISPLAY
The first level of item categorization is by splitting them from which site that result came from.
Each site has its own route. Example /ejobs
will lead you to the results from the ejobs site.
At this point, you have the ability to view the results for each site, categorize them into saved, deleted or default. This categorization is mainly to display them (moving an item to the deleted category doesn't actually delete it in the DB) in their type container; however, the exception are the saved results, which are exempt from their expiration date.
Items are further categorized by their location. Only the locations in the location list are displayed, even if other items with different location value are present in the database. To display them again, you have to add that location back into the location list
SETTINGS CHANGE
This page lives in the /settings
route.
Serves as a settings change page, as well as status info display.
You can:
- modify the items from the queries and locations list
- start/stop the search service; This is necessary to modify the queries/locations list.
The live status for each site is also displayed on this page. Whenever a new search is made, the arguments used for the search are updated in the Live search status
table.
The app uses MongoDB to save the results of the searches, as well as the list of query/locations and the current current or last known position of the search (so it can resume from that place).
Expiry date: each item has a created/updated timestamp, which is used to determine its expiry date, at which time that result is removed from the DB permanently (default expiry value: 7 days). Results which have a category of saved
are exempt from this behaviour.
Note: Any data sanitzation is only meant to prevent the server from crashing. It does not concern the security of the application as it is meant to be run locally.