-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathproject_management.Rmd
570 lines (423 loc) · 35.2 KB
/
project_management.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
---
pagetitle: "Package and project management"
author: "Dainius Masiliūnas, Loïc Dutrieux, Jan Verbesselt, Johannes Eberenz"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output:
rmdformats::html_clean:
highlight: zenburn
---
```{css, echo=FALSE}
@import url("https://netdna.bootstrapcdn.com/bootswatch/3.0.0/simplex/bootstrap.min.css");
.main-container {max-width: none;}
div.figcaption {display: none;}
pre {color: inherit; background-color: inherit;}
code[class^="sourceCode"]::before {
content: attr(class);
display: block;
text-align: right;
font-size: 70%;
}
code[class^="sourceCode r"]::before { content: "R Source";}
code[class^="sourceCode python"]::before { content: "Python Source"; }
code[class^="sourceCode bash"]::before { content: "Bash Source"; }
pre[class^="sourceCode r"] {background-color: #333366;}
pre[class^="sourceCode python"] {background-color: #555533;}
```
<font size="6">[WUR Geoscripting](https://geoscripting-wur.github.io/)</font> <img src="https://www.wur.nl/upload/854757ab-168f-46d7-b415-f8b501eebaa5_WUR_RGB_standard_2021-site.svg" alt="WUR logo" style="height: 35px; margin:inherit;"/>
# Learning objectives
* Learn about graphical interfaces for programming in R and Python
* Understand the structure of a package in R and Python
* Adopt some good scripting/programming habits
* Learn how to find help with solving programming problems
# Graphical user interfaces (GUIs) and integrated development environments (IDEs)
It's time for us to actually delve into R and Python programming!
But how do we do that efficiently?
We could write scripts in Notepad and run them, but that is not particularly efficient, as there are many graphical user interfaces (GUIs) that can help us write code faster.
Comprehensive GUIs that help with writing, debugging and packaging code are called integrated development environments (IDEs).
## R
There are multiple IDEs for R.
The most popular one is *RStudio*, as it is developed by a company that is very active in contributing to R (Hadley Wickham and others).
RStudio is cross-platform and open-source.
It comes in two types: *RStudio Desktop* is a regular desktop app, and *RStudio Server* is meant for running on a remote server, to which you can connect through your web browser.
Interestingly enough, the desktop version is in fact just *RStudio Server* running locally, with an integrated web browser based on *Google Chrome*.
Even though *RStudio* is very popular, it is not the only IDE for R.
*RKWard* is an alternative IDE, aimed at easier learning of R for people coming from other statistical software, such as *SPSS*.
It includes menus for common statistical analysis algorithms, such as getting descriptive statistics, running statistical tests, and making plots.
It also includes an editor for data tables.
*RKWard* is an open-source native desktop app that is developed on Linux, but has also been ported to other platforms, including Windows.
Furthermore, there are cross-language GUIs/IDEs, such as *Jupyter*, whose name is a portmanteau of **Ju**lia, **Pyt**hon and **R**.
*Jupyter* can run different language interpreters, what it calls kernels, for each script that is open, therefore scripts in multiple languages can be edited at the same time in the same interface.
One thing to keep in mind is that there is a distinction between a programming language, such as R, and its IDEs, such as *RStudio*.
Which IDE you use is up to your own personal preference.
Which IDE you used to develop code does not matter, because in the end you just have an R script that can be run on any of the IDEs, or directly from the command line.
What does matter is that the script is written in the R language, as the users will need to have the R interpreter installed in order to run the script.
Likewise, the packages that you use in your script are also important, as the users will need to install them before they can run your script.
Therefore, whenever you refer to scripts and packages, for instance when writing a thesis, you need to specify what language the scripts are written in, and what packages (and ideally what versions) you used.
You should not mention which IDE you used, as that is irrelevant to the readers.
For example, you should not write "the scripts are written in RStudio", but rather, you should write "the scripts are written in R, using package `lattice` version 0.22".
In fact, you should also cite the authors of the language and the packages you used.
R includes a handy function `citation()` to help you cite:
```{r}
citation()
```
It also works on packages:
```{r}
citation("lattice")
```
There are more IDEs, such as R Commander, but we will not be talking about them within the scope of this course.
Nevertheless, feel free to use whichever IDE is your favourite!
```{block type="alert alert-info"}
As a sidenote, IDEs can change how certain commands in R work.
For instance, there are two commands in R that load packages: `library()` and `require()`.
Most IDEs make no difference between the two; `require()` just gives a warning if it can't load a package, whereas `library()` stops with an error.
However, in *RKWard*, running `require()` on a package that does not exist will result in its package management dialogue opening with the required packages preselected for installation, and closing the dialogue will result in a successful loading of the package (if it was installed successfully).
Therefore, do not use `require("package")` to check if `package` is installed, otherwise it will be installed twice under RKward.
Instead, you can use `"package" %in% installed.packages()`.
```
Next, let's look a bit more in detail at the IDEs described above and their configuration options.
### RStudio
In the virtual machines provided in the course, RStudio is already installed for you.
In case you are working on your own computer and would like to know how to install *R* and *RStudio*, see the [RStudio website](https://www.rstudio.com/products/rstudio/download/).
If you are new to *RStudio* take a look at the following summary YouTube video about how to use RStudio: [Intro to RStudio (6 min)](https://www.youtube.com/watch?v=FIrsOBy5k58).
In it you will learn how to navigate the RStudio environment and even run some code.
This will be helpful for later in the course.
Additionally, the first time you open *RStudio* it's helpful to change some global settings.
To do this go to *Tools* → *Global Options...* in the top menu bar of *RStudio*.
The following box should appear.
Set the settings to match that of the screenshot, namely, *Never save the workspace to .RData file on exit* and uncheck the boxes under *History* and *Restore .RData*.
The effect is that RStudio will stop prompting you on exit about whether you want to save the workspace and reload it on next start, which will save you time answering prompts and save time opening RStudio.
Most importantly, it will make sure that restarting RStudio leaves you with a clean environment, equal to what another user of your code would start with, so you can properly test your code, and restart RStudio in case something goes wrong to get back to a clean state.
![Global Options](../Scripting4GeoIntro/figs/global_options.png)
To make a new script, click on *File* -> *New File* -> *R Script*, and a new editor pane will open, allowing you to write code.
When you write a line, you can run each individual line that your text cursor is on by hitting *Ctrl+Enter*.
To install packages, aside from using the R console below (`install.packages()` function), you can install them through the graphical *Packages* pane at the lower right corner of the screen, where you can search for and install packages and their updates.
### RKWard
RKWard is also preinstalled on the virtual machine.
On your own computer, you can get it from the [RKWard website](https://rkward.kde.org/Download_RKWard.html).
When you launch RKWard, it will start a first-start setup wizard, that suggests installing some additional packages.
Feel free to simply dismiss it by clicking *Cancel*.
To start a new file, click *Create* -> *Script File*.
Typically you can also use *Ctrl+Enter* to run each line that your cursor is on.
RKWard sometimes gives too many autocompletion tooltips, you can disable them by going to *Settings* -> *Configure Script Editor* and unchecking *Function call tip*.
You can install packages, aside from using the `install.packages()` function in the R console below, through *Settings* -> *Manage R packages and plugins...*.
You will be prompted to select a mirror, just use the top one (*0-Cloud*) to automatically select the best one.
The tab *Install / Update / Remove R packages* lists all packages that are available on CRAN, packages that are installed and are the latest version, and packages that are installed but can be updated.
To install packages, click the checkbox next to their names, and to remove packages, uncheck them.
The changes will only be applied once you click *Apply*.
## Jupyter
*Jupyter* comes in two types as well: *Jupyter Notebook* and *Jupyter Lab*.
*Jupyter Notebook* is an older interface that is based on the notebook concept, where you combine text with code in a single (`.ipynb`) file.
*Jupyter Notebooks* are extremely popular in the Python community, as they are easy for writing documentation and tutorials (vignettes).
They are displayed automatically in GitLab and GitHub.
However, *Jupyter Notebooks* are less popular in R, because *RMarkdown* provides a similar solution for R (and recently *Quarto* extends *RMarkdown* beyond R, therefore directly competing with *Jupyter Notebooks*).
An issue with *Jupyter Notebooks* is that, due to mixing of text and code, you cannot run a notebook from the terminal, or in fact use any IDE other than *Jupyter Notebook*.
They are also not well compatible with *Git*, as multiple developers cannot edit the same file without causing spurious conflicts, albeit there are workarounds.
*Jupyter Lab* is a newer interface that allows directly editing R and Python files, in addition to opening *Jupyter Notebooks*.
Its advantage is that it can edit multiple scripts in multiple languages at the same time, but its disadvantage is that it cannot provide functionality that is specific to any one language, therefore it is relatively barebones.
Like *RStudio Server*, *Jupyter* is mainly designed to be run on a server.
It can be run locally by launching a server and then connecting to it through a web browser.
However, since *Jupyter* is written in Python, it requires specific Python setup to install.
We will learn more about how to run *Jupyter* in the Python part of the course.
The fact that *Jupyter* is open source and designed to work as a server has created opportunities for Google to create something of a Google Docs version of *Jupyter*, that is called *Google Collaboratory*, or *Google Colab* for short.
Google provides every user of *Google Colab* with computer resources to run Python or R.
It is therefore the easiest way to run Python without changing anything on your own computer.
If anything goes wrong, you can always create a new *Colab* notebook and start over.
First, go to the [Google Colab website](https://colab.research.google.com) and click on *+ New notebook* to create a *Colab* notebook.
The default kernel type in *Google Colab* is Python, but you can change it to R by going to *Runtime* -> *Change runtime type* and selecting *R*.
Click *Connect* at the top right to start the kernel, so that you can run code blocks.
Note that *Google Colab* has a special character, the exclamation point `!`, which allows you to run Bash commands within Python code blocks.
This is very useful and important for installing Python packages, as they are typically installed using the `pip` command from Bash.
You can try running a code block like this to install and load the rasterio package:
```{python, eval=FALSE}
!pip install rasterio
import rasterio
```
## Python
Aside from Jupyter, you can use other IDEs for Python.
A simple to install IDE is called [Spyder](https://www.spyder-ide.org/), as it is its own Python package.
You can also use more complex IDEs for Python scripting, such as [PyCharm](https://www.jetbrains.com/pycharm/) and Microsoft [Visual Studio Code](https://code.visualstudio.com/), but they are more complicated to install.
You will learn more about Python IDEs later in the course. For now, we will stick with using *Google Colab* to run Python code.
# The overall system architecture
![System architecture graph](../IntroToRaster/figs/geoscripting-system-overview.svg){width=100%}
To recap, the overall system architecture is comprised of integrated development environments (IDE), engines, packages, bindings, and libraries. Let's start with the engine, the core program that executes the foundation or crucial tasks of the programming language. The most relevant engines in this course are Python and R. To interact with the engine, we can either code in the command line directly or we can use an IDE, which provides many tools and features for working with the engine integrated in a single environment. An IDE allows the developer to write code, test it, and debug it all in a single software application.
We then install packages in the IDE, these are collections of related code files, libraries, and resources that are related and compatible with each other. Libraries are pre-written code that provide specific functions and services. The idea of libraries and packages is to make coding more efficient by the reuse of common functions.
# What are packages?
Packages extend the functionality of a programming language by providing new functions (and/or datasets).
There is a huge variety of packages in both R and Python.
But how are packages made?
In essence, packages are nothing more than bundles of scripts!
Typically, these scripts provide functions: small pieces of scripts that do one particular task, typically with a name and input arguments.
By loading a package, we simply place these functions in what is called the global environment: a part of your computer memory that holds your variables, function definitions, datasets etc. for the currently running session of R or Python.
Packages are formalised by a set of rules that all developers have to abide by.
One important rule is that the action of merely loading a package should not cause harm to the user (or their environment): functions can be loaded into the global environment, but should not overwrite what is already there, or run any code (which could potentially delete files etc.).
Therefore, packages typically consist of script files with nothing other than function definitions.
Running those functions is left up to the users in their own scripts.
How to run them is documented using examples, either in comments next to the functions, or in an external help file, but never directly as executable code.
In addition to the scripts (and their documentation), packages typically include some metadata, such as the name of the author, package version, what other packages are needed to run this package, etc.
In R, this metadata is mandatory, whereas in Python it's to some extent optional.
Packages also typically follow a particular structure, i.e. where files are located and how.
Let's take a closer look at the package structure in both Python and R.
# Package structure
## In Python
A simple package (module) in Python is simply a directory with a Python file in it.
In Colab, this code will create a directory and a Python file containing a Hello World example:
```{bash, eval=FALSE}
! mkdir -p MyPackage
! echo "def helloworld():" > MyPackage/helloworld.py
! echo " print('hello world')" >> MyPackage/helloworld.py
```
Now we can directly make use of our new `helloworld()` function in Python:
```{python, eval=FALSE}
import MyPackage.helloworld as mphw
mphw.helloworld()
```
In the import name, `MyPackage` is the name of the directory, and `helloworld` is the name of the file.
We rename `MyPackage.helloworld` to `mphw` for brevity, but it's optional.
Well, that was easy!
Though our package is technically not a complete package yet, as it can only be run if we copy the file to every code project that wants to use our function.
A true package can be installed and then run from any place on your computer.
Thankfully that's also easy to do!
All we need is to provide a bit more structure and metadata.
The metadata file in Python is called `pyproject.toml`, and it has to be in a subdirectory with the name of the package, as the name will appear in the list of packages.
The actual functions need to be in a subdirectory of that subdirectory, with the name of the package, as the package will appear in Python.
Typically, both of these names should be the same, otherwise it will result in user confusion.
Lastly, if we want to be able to just import the package to get our functions, without having to specify the name of the file that holds the functions we want to import, we should call the script with our functions `__init__.py`.
In summary, the package structure looks like this:
```
MyPackage
├── MyPackage
│ └── __init__.py
└── pyproject.toml
```
Let's make our files match this structure!
First, make the subdirectory `MyPackage/MyPackage`, and then rename our `helloworld.py` to `MyPackage/MyPackage/__init__.py`:
```{bash, eval=FALSE}
! mkdir MyPackage/MyPackage
! mv MyPackage/helloworld.py MyPackage/MyPackage/__init__.py
```
Next, let's add the project metadata.
We only need to specify the name of the package and its version.
To prevent confusion, use the same name as the name of the two directories.
```{bash, eval=FALSE}
! echo '[project]' > MyPackage/pyproject.toml
! echo 'name = "MyPackage"' >> MyPackage/pyproject.toml
! echo 'version = "0.1"' >> MyPackage/pyproject.toml
```
Bam, we now have a true package! We can install it with `pip`:
```{bash, eval=FALSE}
! pip install ./MyPackage
```
And now we can use our function from any directory simply as:
```{python, eval=FALSE}
import MyPackage
MyPackage.helloworld()
```
## In R
Packages in R are also nothing more than an organised collection of R scripts.
Compared to Python, there are a few more files necessary to make an R package, but thankfully, we also have tools that help to create one!
Once again, let's make a Hello World function and put it into a file:
```{bash}
echo 'HelloWorld <- function() print("Hello world")' > HelloWorld.r
```
Let's make it into a package! We use the function `package.skeleton()` to autocreate a package structure:
```{r}
package.skeleton(name = "MyPackage", code_files = "HelloWorld.r")
```
Let's follow the instructions to read the file `MyPackage/Read-and-delete-me`.
And then delete it:
```{bash}
rm MyPackage/Read-and-delete-me
```
Let's see what is the R package structure:
```{bash}
tree -n MyPackage
```
It's a little bit more complicated than Python, but not by much.
The script files are stored in a directory called `R` (as R packages can also include C++ code etc.), and our script file is already there.
Next we have the `DESCRIPTION` file, which, like the `pyproject.toml` file we saw for Python, includes metadata about the package.
You can open it and edit it.
Lastly, we have the `NAMESPACE` file, which states which functions will be accessible to package users (as opposed to only internal to the package), and a `man` directory, which includes the documentation manual entries.
We have two entries: `HelloWorld.Rd`, which is the description of our `HelloWorld` function, and `MyPackage-package.Rd`, which is a description of the package itself.
These documentation files are written in an [R variant of LaTeX](https://cran.r-project.org/doc/manuals/R-exts.html#Rd-format).
Let's install our package!
First we need to build it, from the terminal:
```{bash}
R CMD build MyPackage
```
As you can see, the package is just a zip of our package files, plus some metadata cache.
Now we can use R itself to install our new package:
```{r}
install.packages("./MyPackage_1.0.tar.gz")
```
Perfect, we can immediately load it and use it:
```{r}
library(MyPackage)
HelloWorld()
```
As a bonus, we can also see the documentation we had in the `man` files:
```{r, eval=FALSE}
?HelloWorld
```
That was also pretty easy!
To summarise, we have learned that packages in both Python and R are nothing more than scripts that contain functions, in a particular directory structure.
If we start developing our projects with that structure in mind, we can make packages out of our scripts very easily!
```{block type="alert alert-info"}
**Tip**: Once you have a package, you can also have it uploaded to a package repository, so that others (or you from another computer) can easily download and run the code in the package.
However, it does require you to have your code in good working order and be well documented, since, after all, it's going to be public for everyone to see!
The package repository for Python is called *PyPI*, The Python Package Index.
PyPI is developer-friendly, which means that you can easily upload your packages on PyPI without much hassle, as long as your project is in a reasonable shape and declares all its dependencies properly.
See the [PyPI documentation](https://packaging.python.org/en/latest/tutorials/packaging-projects/) to learn more about how to submit packages to PyPI.
The package repository for R is called *CRAN*, the Comprehensive R Archive Network.
Unlike PyPI, CRAN is end-user-friendly.
That means that submitting packages to CRAN is not nearly as easy, because it has much higher quality requirements for any (new) packages.
All packages submitted to CRAN have to pass through both automatic tests (using `R CMD check` command), as well as human review.
Your package will be rejected if it depends on a package version that is not currently on CRAN, if you fail to document any function parameter, make any typo anywhere or even fail to use quotes correctly.
Therefore, the package submission process for CRAN is more akin to peer review for publishing a scientific paper.
However, it ensures that packages always work together with other packages, which immensely simplifies package dependency management and allows users to simply use `install.packages()` and expect the new package to just work.
Packages are also regularly removed from CRAN if they stop working due to updates in their dependencies, making sure that all packages on CRAN stay compatible with each other.
See [CRAN documentation](https://cran.r-project.org/web/packages/policies.html) for more information about package submission requirements.
```
```{bash include=FALSE}
rm -rf MyPackage MyPackage_1.0.tar.gz HelloWorld.r
```
```{r include=FALSE}
remove.packages("MyPackage")
```
# Good programming habits
## Project structure
Making packages is a great way to stay organized, keep track of what you are doing and be able to use it quickly and properly at any time. Packages are:
* Easy to share with others
* Dependencies are automatically imported and functions are sourced (reduces the risk of having broken dependencies)
* Documentation is attached to the functions and cannot be lost (or forgotten)
For these reasons, if you build a package, next year you will still be able to run the functions you wrote yesterday. Which is often not the case for stand alone functions that are poorly documented and may depend on many other functions ... that you cannot find any more. So to summarize, packages are not only a way to extend functionality, they are also the standard way to archive and save functions.
To make it easy for you to make a package out of your code, you need to stick with a project structure that packages have.
Another benefit of maintaining a consistent project structure is that it will make it easier for you to switch from one project to the other and immediately understand how things work.
To practice keeping a consistent project structure, in this course we will be following the structure below:
![Project Structure Schema](figs/projectStruture.png)
* A `main` script at the root of the project. This script performs step by step the different operations of your project. It is the only non-generic part of your project (it contains executable code outside functions, including paths, already set variables, etc.). The file extension of this file will depend on what language you are using for your project. We typically call these files `main.r` and `main.py`, but it could also be e.g. `task1.r`.
* As we will be working with multiple languages throughout this course we will keep things organized by placing the scripts into their respective language sub-directories (`R/`, `Python/`, and `Bash/`). These directories should contain the functions you have defined as part of your project. These functions should be as generic as possible and are *sourced* and called by the `main` script. The way this is done depends on the language used by the `main` script. For example, in R you would write `source("R/myfunction.R")`. Whereas in Python, you would use `import Python.myfunction`.
* Each file in the `R` and `Python` directory should ideally consist of a single function with the same name as the file itself, to make it easy to find. In Python it is common to combine multiple functions in one file, because typically you refer to the file (module) name in addition to the function name (through `import MyPackage`), but it is still good practice to keep each function in its own file (so you don't get confused where each function comes from even if you do `from MyPackage import *`).
* A `data/` subdirectory: This directory contains data sets of the project. Since Git is not as efficient with non-text files, and GitLab has storage limits, you should only put small data sets in that directory (<2-3 MB). Typically, you do not include it in your git repository at all; rather, this directory is created from your `main` script, and is used to store data downloaded from the internet. It can be safely removed after the script is finished running.
* An `output/` subdirectory (when applicable), where you place the final result of running your script. This should also not be tracked by git: your scripts create the output, so there is no need to store it.
* A `README.md` file should be included, this file should contain a description of your project, along with a description of what other packages your script needs to function correctly, and any other instructions needed to correctly run your code.
* Finally, you should include a `LICENSE.txt` file with the software licence which you would like your code to have.
```{block type="alert alert-danger"}
**Warning**: **Never** upload large (raster) files to git! If your repository exceeds 100 MiB in size, it will no longer be loaded by CodeGrade. Deleting the files will have no effect, because Git stores the history forever. **Fixing a repository like that is extremely complicated and time-consuming!**
**Note**: Without a `LICENSE` file, your code is copyright, which means that nobody has the right to even read it, let alone run it! Make sure to always specify a license that would at least allow that.
```
### Example `main` file
Typically the header of your main script will look like the following:
#### In R (`main.R`)
```{r, eval=FALSE}
# Team Teamname (John Doe and Jane Smith)
# January 2020
# Import packages
library(terra)
library(sf)
# Source functions
source('R/download_data.R')
source('R/function2.R')
# Use sourced function to download data to the directory "data"
download_data("data")
# Load datasets
postboxes <- st_read('data/postbox_locations.gpkg')
# Then the actual commands
```
#### In Python (`main.py`)
```{python, eval=FALSE}
# Team Teamname (John Doe and Jane Smith)
# January 2020
# Import packages
import geopandas as gpd
import matplotlib.pyplot at plt
# Import functions
import Python.download as download
# Use imported function to download data to the directory "data"
download.download_data("data")
# Load datasets
postboxes = gpd.read_file('data/postbox_locations.gpkg')
# Then the actual commands
```
## Working directory, relative and absolute file paths
*At the end of the following section you should be able to explain the difference between the following:*
* relative path
* absolute path
* working directory,
* And the following special directories:
* ` . `
* ` .. `
* ` / ` or the **root** directory
In the R and Python examples above we load the datasets by indicating the file location from the data directory `"data/postbox_locations.gpkg"`. However, you may have many data folders on your computer, for all types of different projects. So how does the system know to look in the correct one? Moreover, if you share your script with a friend the location of their project and data folders will be different than that of your setup. It would be a nuisance if they had to change all references to these files in their script. To deal with these issues, we use **relative file paths**.
In **relative** file paths, we don't include the location of the project (**working**) directory itself, these paths are *relative* to the *working directory* (the "Project_Structure" folder). In the example above, the **relative** file path for the post box locations file would be `data/postbox_locations.gpkg`, whereas the **absolute** file path would be `"/home/osboxes/Geoscripting/Project_Structure/data/postbox_locations.gpkg"`. The **absolute** file path refers to a file from the **root** of the entire file system. On Linux (and other UNIX-like systems like macOS), absolute file paths **always** start with ` / `.
```{block, type="alert alert-danger"}
**Note**: on Windows, you might see a backslash (`\\`) being used as a path separator instead of a slash (`/`). **Don't do this**! In many languages, including R, a backslash denotes an [escape sequence](https://en.wikipedia.org/wiki/Escape_sequence). In addition, a backslash is not a valid path separator on non-Windows platforms, whereas both a slash and a backslash are valid on Windows. So save yourself the trouble and **always use a slash as a path separator**!
```
So what is the **working directory**? By convention, the working directory is the same as the location of the script which you are working on. This means that you can simply assume that whoever runs your script, will run it from the directory that your script is located.
```{block, type="alert alert-danger"}
**Note**: this also means that when you test others' code, you should also make sure to run it from the directory that the script is located, unless stated otherwise!
```
To refer to directories or files that are within the working directory, we simply use their names. So to refer to a directory called `R` in our working directory, we type `R`. To refer to a file or directory within another directory, we type the name of the directory, a slash, and then the name of the file/directory, for instance, `R/function1.R`.
If we want to refer to a file or directory that is above the indicated directory, we use the special directory ` .. `. For instance, if our `main.R` is not in our project root, but located in the sub-directory `demo` (therefore our working directory is `demo`), we would refer to our `function1.R` file as `../R/function1.R`. Another special directory is ` . `, which refers to the indicated directory itself.
When making Git repository, you want to make sure that all your code is *portable* and self-contained, i.e. you can run it from any computer, and ideally using any operating system. That means that as a rule of thumb you should **always use relative file paths** in your scripts.
```{block, type="alert alert-success"}
> **Question 1**: what would be the location of this file: `././R/./.././R/./././function2.R`? How about `/./R/./.././R/./././function2.R`? What would be the meaning of `C:/Windows/cmd.exe` on Linux? Is it a relative or absolute file path?
```
## Documentation
As we have seen in the package examples, documentation of code is extremely important to understand what your code is doing.
It's not only for others reading your code, but also for yourself, when you have to revisit code you haven't touched in a while.
If you follow the good scripting habits mentioned above, you will already have your code divided into reusable functions, that are in their own function files.
Each function should have enough comments to understand each step the function is performing.
Comments are any words after the `#` character, in either language.
Same goes for the main script, it should also have each step described.
But you don't need to make it redundant: if it's already clear from the function name what it does, it's useless to repeat it in a comment.
Rather, you would describe sections of code at once, e.g. what a loop does by itself, rather than its indvidual steps.
In addition, if you use code from someone else, make sure to obey the license of the code you are using, and give credit to the original author using comments.
In addition to comments of your code, each function should also have a description and an explanation for all of its arguments.
In both R and Python, there is a special way to add this information in comments next to the function definition, so that help files can be generated out of them automatically.
The systems have different conventions, but effectively achieve the same purpose.
In Python, the system is called [Docstrings](https://peps.python.org/pep-0257/).
It consists of strings using triple quotes, that effectively acts as a multi-line comment.
Here's an example of a Python function documented with a Dosctring using the official reStructuredText (Sphinx) format:
```{python}
def hello(name):
"""Greet a person.
Give a customised "Hello" greeting for the input name.
:param name: A string containing the name of the person to greet.
:return: A greeting string.
"""
return("Hello " + name)
```
The Docstring will be automatically printed when you run `help(hello)`, and will be converted into a nice webpage using tools such as [Sphinx](https://sphinx-rtd-tutorial.readthedocs.io/en/latest/install.html).
In R, the system is called [roxygen2](https://roxygen2.r-lib.org/articles/rd.html) and is provided by the `roxygen2` package.
Here's an equivalent example in R, using roxygen2:
```{r}
#' Greet a person
#'
#' Give a customised "Hello" greeting for the input name.
#'
#' @param name A string containing the name of the person to greet.
#' @returns A greeting string.
hello <- function(name) {
return(paste("Hello", name))
}
```
The roxygen2 package was created to make it easier to write documentation in R, so that documentation does not have to be separate from the code.
The `roxygenize()` function generates the documentation files from the comments automatically, and then the documentation appears when running `?hello` in our example.
## Summary
Increasing your scripting/programming efficiency goes through adopting good scripting habits. Following the guidelines above will ensure that your work:
* Can be understood and used by others.
* Can be understood and reused by you in the future.
* Can be debugged with minimal effort.
* Can be re-used across different projects.
* Is easily accessible by others.
The summary of good practice guidelines below is not exhaustive, but already constitutes a good basis that will help you getting more efficient now and in the future when working on scripting projects:
* Comment your code.
* Write functions for code you need more than once:
* Document your functions.
* Keep consistent style.
* Make your own packages, or at least keep a similar directory structure across your projects.
* Use version control to develop/maintain your projects and packages.
<!-- Move the following sections from into to linux to here, once we move Linux out:
# How to find help
## Reproducible examples
-->