Command line can be particularly useful for searching through multiple and/or large documents.
The grep
command in particular allows you to search for patterns and grab the documents, lines or sentences containing them.
The dataset of MPs' expenses claims is rather large - over 67MB. Rather than work with this in Excel we might decide to create a smaller file using command line and grep
. Let's say we were interested in one MP called Adrian Bailey: we can write a command which identifies any lines in that CSV containing his name, and then puts the matching lines into a new CSV called 'baileysubset.csv'. Here it is:
grep ".*Adrian Bailey.*" Individual_claims_for_16_17.csv > baileysubset.csv
After grep
itself we need to identify the pattern we are looking for. This is placed inside quotation marks and can include special regex symbols:
".*Adrian Bailey.*"
In this case, .*
means 'none or more of any character', so we are looking for any line that contains none of more of any character, followed by his name, followed by none or any character.
We then specify the file we are looking in:
Individual_claims_for_16_17.csv
Left as it is, this would then print the results to Terminal or PowerShell itself - but we want to store the results in a new file. To do that we add the >
operator followed by a name for the file we want to put the results in
> baileysubset.csv
When the command runs a new file should appear with that name, in the directory that you have moved to (type pwd
to find out where you are - this should be the same location as the CSV file you are searching/grepping).
The ads for police on Reed.co.uk include page numbering like so: https://www.reed.co.uk/jobs?cached=True&pageno=1&keywords=police
We can use curl
to cycle through the 10 most recent pages and save them locally:
curl "https://www.reed.co.uk/jobs?cached=True&pageno=[1-10]&keywords=police" -o "#1.html"
Then we can use grep
to find the location of all those jobs:
grep '<li class="location">.*' *.html > locations.txt
Or, indeed, we could save it in a CSV:
grep '<li class="location">.*' *.html > locations.csv
This webpage is one example of where this can be useful. It contains links to a number of Excel spreadsheets, but the links are generated by JavaScript, so cannot be scraped using Google Sheets functions like IMPORTHTML or IMPORTXML.
You can instead, however, download the webpage and use grep to find the links to the spreadsheets buried in that JavaScript.
I've copied the JSON in the source HTML into a text file here
In Terminal, navigate to the folder containing that file and type the following command:
grep "[0-9a-zA-Z_-]*\.xlsx" TRAFIKKSTATISTIKK.txt -o
The bit after grep
is regex: square brackets indicate that you want to look for any of the characters within: 0-9
means any digit; a-z
means any lower case character; A-Z
any upper case character, and _
and -
are included because some file names contain those as well.
The *
means none or more of those characters, followed by...
\.
, which means a full stop. The backslash is used to indicate that the period .
should be taken literally - in regex the period has a special meaning otherwise (any character).
...followed by xlsx
So this is looking for a series of any of the specified characters, followed by .xlsx
. This will find our file names.
That regex is followed by the name of the file(s) to search, and then -o
-o
means we want to bring back the match only. If we don't include that, it will bring back any lines containing a match, but as our file is all one single line, that would mean the whole line.
The series of characters does not include /
so that is going to be the starting point before our matches.
Results can then be saved as new file.
In this example I'm using the Lords Register of Interests - elsewhere in this repo I explain how to download those pages using curl
, but you can also do this manually. Either way you'll need a few HTML pages first, and to navigate into the folder containing those.
If we wanted the names, the HTML before the information we want would look like this (note the space):
Lord
And the HTML after is:
</td></tr><tr><td class="lordsinterestcategory">
So we might write:
grep "Lord .*</td></tr><tr><td class=" [A-E].html -o > results.txt
Other options include:
grep "Police and Crime Commissioner" [A-E].html -o > results.txt
grep "Chairman, [A-Za-z]*" [A-E].html -o > results.txt
grep "Chairman, \w*\s\w*\s\w*\s\w*\s\w*" [A-E].html -o > results.txt