title | subtitle | layout |
---|---|---|
Environmental Informatics |
ESM 296-3W (winter 2016) |
default |
Environmental Informatics is an introduction to the management and analysis of environmental information, providing students with the necessary computational background for more advanced Bren courses. Topics include: the basic computing environment (hardware and operating systems); programming language concepts; program design; data organization; software tools; generic analytical techniques (relational algebra, graphics & visualization, etc.); and specific characteristics of environmental information. We'll focus on using the R environment for data reading, manipulation, analysis and visualization. An emphasis will be placed on reproducability, including versioning such as using git and github.
Topics will be presented in weekly 3-hour modules mixing lectures and hands-on examples, using students' own computers. There are no prerequisites.
- Naomi Tague [NT] [email protected] Office hours: TBD
- Ben Best [BB] [email protected] Office hours: Tuesdays 11:30 - 1pm in Bren 4524
- Forum at env-info.slack.com
- Stickies, aka post-it notes, available to pickup and return at front table, to be used sticking up off top of laptop screen:
- Issues for ucsb-bren/env-info Github repository
- Feedback using Google Forms
Each week you will be given an assignment in class, and we will spend some time working on it there. The completed assignment will be turned in via Github, due by 5pm the day before the next class, ie Thursday. Most assignments will have individual and group components.
- 70% assignments (7 assignments @ 10% each)
- 20% final project (paper + presentation)
- 10% participation
You will work in small groups for the final project. There will also be a short paper accompanied by an in class presentation to be submitted the final week of class. This project will provide a review of several examples of innovative applications of data analysis or computing that illustrate how the strategic use of informatics can change how we think about or approach solving environmental problems. One of these examples will present the cumulative analysis of your own small group's work applying analytical techniques to a custom dataset. The other examples should come from literature, and/or online in a related topic.
-
presentation: 20 min (incl. 5 min for questions). Please spend the first half the presentation in a traditional format of explaining the overall scientific question, background on the data and methods, and finally your results, hopefully with an aesthetically pleasing interactive visualizations. Spend the second half on presenting the biggest obstacles and solutions based on your experience, preferably diving into the data and code to share lessons learned with your fellow students.
-
paper: 5 pages
The "live" paper will live on your class group website <org>.github.io
, to be rendered from Rmarkdown to HTML, and contain the following headers:
-
Introduction Introduce your topic, providing specific scientific questions and problem statements.
-
Innovative Examples Provide at least 3 examples of innovative approaches to solving or at least communicating the problem. Figures and tables are especially welcome.
-
Methods Introduce the goals of your analysis. Summarize the methods of your analysis. Reference a Github release of your organization's repository, which should contain all the commented code for your analysis.
-
Results Present the tables and figures from your analysis along with text that summarizes these findings. You're encouraged to use sortable tables and interactive figures.
-
Discussion Discuss the implications of your findings. Include recommendations for future analysis.
-
References Mention at least 5 references above and provide the proper ctiations here. These can be any combination of scientific literature and/or websites.
Due: 5pm Friday, March 18th Submit URL of your final project (ie rendered Rmarkdown as HTML in http://<org>.github.io
or Shiny app published to http://<user>.shinyapps.io
) into GauchoSpace final presentation and final paper.
Listed by week...
Environmental science and management is increasingly a group enterprise involving many stakeholders from various disciplines. Environmental science also increasingly requires collection, processing, analysis and interpretation of large data sets. There are a variety of tools that help make collaborative data analysis easier. We'll focus this first week in getting you up to speed with the basics of operating two technologies that are currently the most popular and intuitive:
-
Git is the most popular file versioning software which allows you to play nicely with others when it comes to code and data. Github is the most popular online site for hosting git repositories, and has many bonus features for rendering formats (md, csv, geojson, ...) and handling project management (issues, wiki,...).
-
Rmarkdown enables you to weave rendered chunks of R code in with formatted text (as markdown). Rmarkdown enables you to most easily generate tables, figures, formulas and references into a variety of outputs: documents, PDFs, websites or interactive online applications.
Programming is a general term used for developing sets of instructions for data generation, analysis, interpretation and visualization. We will introduce some basic programming concepts: data types, flow control and functions We will also cover programming "best practices". While the specific syntax here applies to R, the concepts are universal to all programming languages.
Getting your data into the format you require is often one of the most frustrating and time consuming task involved in data analysis. Fortunately there are tools that make this easier. You will become inculcated into the "Hadley"-verse of R packages which represent a new wonderful paradigm of data science which embraces readability of code. We'll focus on these R packages in particular:
- readr: read and write tabular data with sensible defaults (ie no factors). We'll also cover related packages such as rgdal to read and write spatial data.
- dplyr is your main data wrangling tool with a piping idiom (
%>%
) that encourages very readable SQL-like sequential statements:select
,filter
,arrange
,group_by
,summarize
. The other beauty about dplyr is that you can initially write for a simple CSV, and scale up the back end to work with databases (such as sqlite, MySQL, PostgreSQL or even Google BigQuery) and dplyr translates the backend functions automatically, so no need to rewrite the rest of your code (concept of "middleware").
See also the data wrangling cheat sheet with dplyr, tidyr.
Data comes in a wide variety of formats. Literally. You'll learn about "wide" vs "narrow" formats with the tidyr package, as well as how to handle dates/times with lubridate, and strings with stringr. We'll throw in a bit about regular expressions for good measure.
Visualization allows you to find patterns in your data. Good visualization allows you to communicate what your learn from data to others. New tools provide users with efficient and flexible ways to generate elegant informative visualizations of their data. We will introduce you 'best practices' and R's powerful visualization "grammar" ggplot2 which allows you too quickly generate some pretty fancy plots and tailor them to your audience. See the ggplot2 cheat sheet.
The majority of exciting interactive application development is happening these days on the web, and specifically with powerful JavaScript libraries (especially with node framework). R and particularly the RStudio environment have taken advantage of this with the new htmlwidgets architecture, which enables exciting interactive visualizations right from the RStudio IDE (as a Viewer pane), rendered as a standalone HTML document (so easy to share with colleagues or on website), and/or integrated within a Shiny application (for full featured slice and dice capabilities but dependant on an R backend engine; see next week). Check out the htmlwidgets showcase for a sample of the types of interactive visualizations made easy to render:
- leaflet: geospatial mapping
- dygraphs: time series charting
- metricsgraphics: scatterplots and line charts with D3
- networkD3: graph data visualization with D3
- d3heatmap: interactive heatmaps with D3
- dataTables: tabular data display
- threejs: 3D scatterplots and globes
- DiagrammeR: Diagrams and flowcharts
Developing more complex programs involves breaking data analysis down into key components - and organizing these components so that they can be easily re-used, modified and linked with other programs. We will introduce you to techniques for structured programming. You'll learn how to create your own R package.
Two essential components of programming best practices are documentation and testing. Particularly when programming and data analysis involve multiple steps or collaborative programming, good documentation and testing are essential. We will introduce you to ways to help you to write documentation inline using roxygen2 and ways to automate testing of your programs.
Continuing with the online interactive theme, we'll explore the world of making Shiny apps for truly interactive applications that allow for backend R functions reactive to user inputs to a clean web interface easily rendered with the most minimal amount of code. See the shiny cheat sheet.
Because of Bren Group Project presentations, this class will be skipped. We've added an extra hour instead to the classes before and after.
You'll share your final project presentations in class, describing the scientific question asked, methodological steps taken to gather and clean data, analytical steps and visualizations. This will be done with an Rmarkdown presentation having a Shiny app embedded with all code made available on a Github repository (ie at your group's org.github.io site).