Notes on grep

Command line can be particularly useful for searching through multiple and/or large documents.

The grep command in particular allows you to search for patterns and grab the documents, lines or sentences containing them.

An example of using GREP to create a subset of a large dataset

The dataset of MPs' expenses claims is rather large - over 67MB. Rather than work with this in Excel we might decide to create a smaller file using command line and grep. Let's say we were interested in one MP called Adrian Bailey: we can write a command which identifies any lines in that CSV containing his name, and then puts the matching lines into a new CSV called 'baileysubset.csv'. Here it is:

grep ".*Adrian Bailey.*" Individual_claims_for_16_17.csv > baileysubset.csv

After grep itself we need to identify the pattern we are looking for. This is placed inside quotation marks and can include special regex symbols:

".*Adrian Bailey.*"

In this case, .* means 'none or more of any character', so we are looking for any line that contains none of more of any character, followed by his name, followed by none or any character.

We then specify the file we are looking in:

Individual_claims_for_16_17.csv

Left as it is, this would then print the results to Terminal or PowerShell itself - but we want to store the results in a new file. To do that we add the > operator followed by a name for the file we want to put the results in

> baileysubset.csv

When the command runs a new file should appear with that name, in the directory that you have moved to (type pwd to find out where you are - this should be the same location as the CSV file you are searching/grepping).

An example of using curl and GREP to download and search webpages

The ads for police on Reed.co.uk include page numbering like so: https://www.reed.co.uk/jobs?cached=True&pageno=1&keywords=police

We can use curl to cycle through the 10 most recent pages and save them locally:

curl "https://www.reed.co.uk/jobs?cached=True&pageno=[1-10]&keywords=police" -o "#1.html"

Then we can use grep to find the location of all those jobs:

grep '<li class="location">.*' *.html > locations.txt

Or, indeed, we could save it in a CSV:

grep '<li class="location">.*' *.html > locations.csv

An example of using GREP to extract links in one webpage

This webpage is one example of where this can be useful. It contains links to a number of Excel spreadsheets, but the links are generated by JavaScript, so cannot be scraped using Google Sheets functions like IMPORTHTML or IMPORTXML.

You can instead, however, download the webpage and use grep to find the links to the spreadsheets buried in that JavaScript.

I've copied the JSON in the source HTML into a text file here

In Terminal, navigate to the folder containing that file and type the following command:

grep "[0-9a-zA-Z_-]*\.xlsx" TRAFIKKSTATISTIKK.txt -o

The bit after grep is regex: square brackets indicate that you want to look for any of the characters within: 0-9 means any digit; a-z means any lower case character; A-Z any upper case character, and _ and - are included because some file names contain those as well.

The * means none or more of those characters, followed by...

\., which means a full stop. The backslash is used to indicate that the period . should be taken literally - in regex the period has a special meaning otherwise (any character).

...followed by xlsx

So this is looking for a series of any of the specified characters, followed by .xlsx. This will find our file names.

That regex is followed by the name of the file(s) to search, and then -o

-o means we want to bring back the match only. If we don't include that, it will bring back any lines containing a match, but as our file is all one single line, that would mean the whole line.

The series of characters does not include / so that is going to be the starting point before our matches.

Results can then be saved as new file.

An example of using GREP on multiple files

In this example I'm using the Lords Register of Interests - elsewhere in this repo I explain how to download those pages using curl, but you can also do this manually. Either way you'll need a few HTML pages first, and to navigate into the folder containing those.

If we wanted the names, the HTML before the information we want would look like this (note the space):

Lord

And the HTML after is:

</td></tr><tr><td class="lordsinterestcategory">

So we might write:

grep "Lord .*</td></tr><tr><td class=" [A-E].html -o > results.txt

Other options include:

grep "Police and Crime Commissioner" [A-E].html -o > results.txt

grep "Chairman, [A-Za-z]*" [A-E].html -o > results.txt

grep "Chairman, \w*\s\w*\s\w*\s\w*\s\w*" [A-E].html -o > results.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grep.md

grep.md

Notes on grep

An example of using GREP to create a subset of a large dataset

An example of using curl and GREP to download and search webpages

An example of using GREP to extract links in one webpage

An example of using GREP on multiple files

Files

grep.md

Latest commit

History

grep.md

File metadata and controls

Notes on grep

An example of using GREP to create a subset of a large dataset

An example of using curl and GREP to download and search webpages

An example of using GREP to extract links in one webpage

An example of using GREP on multiple files