-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add all remainder blogs from google site
- Loading branch information
Showing
36 changed files
with
524 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
More online computational training for life scientists! | ||
08/06/22 Emma Rand | ||
|
||
The Cloud-SPAN team would like to let you know about more opportunities for online training! Ed-DaSH from the University of Edinburgh, is a Data Science training programme for Health and Biosciences funded under the same scheme as Cloud-SPAN (UKRI innovation scholars award). Like Cloud-SPAN, Ed-DaSH is partnered with the The Software Sustainability Institute and you will find similarities in our approach to teaching computational topics to life scientists. Their upcoming workshops are: | ||
|
||
13:00-17:00 14-17 June – FAIR in (Biological) Practice | ||
|
||
10:00-13:00 5-8 July – Introduction to Statistics with R (this is a course written by Cloud-SPAN's Emma Rand and Univerity of York Biology PhD student Ezra Herman) | ||
|
||
09:30-13:00 26-29 July – High dimensional statistics with R | ||
|
||
13:00-16:00 23-26 August – Machine learning | ||
|
||
You can register via the University of Edinburgh ePay system - the courses are free but have a refundable deposit. Contact them at [email protected] or on Twitter @EdDaSH_Training, with any questions | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
Cloud-SPAN April Code Retreat | ||
14/06/22 Evelyn Greeves | ||
|
||
In April we held our first code retreat event: a chance for course alumni to come together and work on their own data problems with the support of Cloud-SPAN instructors. The event took place at the University of York, with several participants travelling from other institutions for the day. | ||
|
||
Some people took the opportunity to revisit course materials and ask questions about the topics they didn’t understand. There were plenty of helpers on hand to answer questions and test understanding. | ||
|
||
Others chose to apply the workflows and analyses taught in the Genomics course to their own datasets. Again, helpers were on hand to discuss topics such as: | ||
|
||
how to organise bioinformatics projects; | ||
|
||
which tools are most effective; | ||
|
||
how to approach a problem; | ||
|
||
what analysis is best for a certain type of data; | ||
|
||
as well as many others. | ||
|
||
Some participants knew what help they needed and had specific questions to address during the day. Those with a less clear understanding of their problem benefited from talking through their data and getting guidance from our experienced instructors. Some even tried out new software tools not discussed during the course. | ||
|
||
Finally, some participants decided to trial our new self-study course on creating your own Amazon Web Services cloud instance. They were able to ask questions about the content and provided valuable feedback on which parts of the course needed improvement. | ||
|
||
Everyone enjoyed the chance to meet new people and find out about each others’ research. It was a great chance to network and build some community amongst course alumni. | ||
|
||
Our next code retreat for Cloud-SPAN course alumni will be on July 6th at the University of York. We cover travel expenses and lunch is provided. We’re looking forward to meeting more of our community members and providing valuable one-to-one support! | ||
|
||
Contact us at [email protected] to sign up. | ||
|
||
|
||
Image: Instructors and course alumni at April's code retreat. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
Bioinformatics Meeting on ‘Career pathways into bioinformatics’ | ||
22/06/22 Sarah Dowsland | ||
|
||
Join the University of York's Bioinformatics Meeting on ‘Career pathways into bioinformatics’. | ||
|
||
Hosted by Sarah Forrester, we have a jam packed hour and a half. | ||
|
||
Evelyn Greeves from the Cloud-SPAN team will be leading a short session on "Introduction to FAIR and metadata" with an opportunity for you to ask any questions that you may have. | ||
|
||
Following this, have you pondered what direction to take your career or how to use your data skills in future projects? We have 3 speakers who explain how data analysis has been incorporated into their work. | ||
|
||
🔸 Emma Rand highlights the different paths into academia and using big data skills. | ||
|
||
🔸 James Chong explains how learning bioinformatics was the only way to get past the bottleneck of being able to analyse data he was generating. | ||
|
||
🔸 Sarah Forrester explores how bioinformatics opens doors to moving between different research niches. | ||
|
||
We will then have a discussion with all speakers for the remainder of the session, which will include signposting resources. | ||
|
||
Session slides will be available following the event. | ||
|
||
Event details: Wednesday, 6th July, 15:00-16:30 in room B/T/019 University of York. | ||
|
||
Contact us on [email protected] to be added to the Bioinformatics regular mailing list! | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
FAIR at Cloud-SPAN | ||
29/06/22 Evelyn Greeves | ||
|
||
Previously we shared some information about FAIR data, and explained why it’s important to make sure your data is as reusable as possible. | ||
|
||
The FAIR principles aren't just for data. Our aim is to apply the principles to all our training resources to ensure that they can be reused and remixed by others for their own teaching purposes. Here's a look at how we're doing it: | ||
|
||
Findable | ||
Remember, findability is about making it easy to find your data or resource. We’ve added metadata to our resources, which enables us to register our courses with TeSS, a life sciences training repository. The metadata means people can search and filter to find our course based on what they need. You can see the metadata for our Prenomics course at the top of the source page here. | ||
|
||
In addition, we have also registered our training resources on Zenodo, another repository which assigns a DOI to each stored item. This persistent identifier will give our resources a permanent home, even after other links become deprecated. | ||
|
||
Accessible | ||
To be accessible it needs to be easy to retrieve a resource without any special tools. It should also be clear how to do this. We’ve made this really easy for ourselves by hosting our courses online for free on a dedicated set of webpages via GitHub Pages. | ||
|
||
Interoperable | ||
Interoperability means ensuring that computers can understand and open a resource. We do this by providing data for analysis in de facto file standards such as FASTQ and using Markdown (a widely-used and platform-independent text formatting language for writing resources) for course material. | ||
|
||
We also help computers to understand how our resources fit into a bigger picture by using an ‘ontology’ to describe the topics of our courses. This forms part of the metadata and helps people to filter and understand what our resources are about. For example, we use the EDAM ontology of bioscientific data analysis and data management. We've labelled our Prenomics course as falling under topics 3372 (software engineering) and 0622 (genomics). | ||
|
||
Reusable | ||
All of the things just described help promote reusability. In particular, we promote reusability by tagging our resources with rich metadata - we use the Bioschemas Training Material protocol which suggests a list of metadata properties for biosciences training materials. | ||
|
||
We also help people reuse our materials by applying a Creative Commons Attribution (CC-BY) licence, which means anyone can distribute, remix, adapt or build on our work as long as they credit us. We include details of this licence in our metadata, in our GitHub repositories and at the bottom of our course pages so it’s clear to everyone what the rules are. | ||
|
||
Over to you! | ||
What steps are you taking to make your data and other digital resources as FAIR as possible? There are some great resources available online to help you if you're not sure where to start - try howtofair.dk or the FAIR Cookbook for helpful articles, videos and step-by-step guides! | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
Making the Prenomics summary poster | ||
06/07/22 | ||
|
||
I recently designed a poster that acts as a “cheat sheet” for Cloud-SPAN’s Prenomics course. Here I’ll give a quick overview of my thought process while making the poster and how I chose what to include. | ||
|
||
Command cheatsheet | ||
I started with the command cheatsheet, which is a bit back-to-front given that the command line comes chronologically last in the course. However, it seemed like an easy entry point. I had previously produced a glossary for Prenomics, so I took this and organised the commands into logical groupings based on their function. | ||
|
||
I ended up with four groups: navigating files, viewing files, editing files and searching files. There were two commands which didn’t fit into these groups, which were history and man. I tried to find a way to include them, but ultimately left them out as I decided they were not crucial knowledge. | ||
|
||
I tried to format the commands in a way that made it clear how to use them while still keeping it general and easily readable. I used colour to clarify which parts of the command corresponded to which parts of the explanation. For example: | ||
|
||
Example from poster showing the command `mv file directory` and the explanation 'move file to directory'. | ||
In this case ‘file’ (the file to be moved) is pale blue while ‘directory’ (the location to which the file should be moved) is dark blue. I also used underlining to indicate where the command came from - in this case, mv comes from the m and the v of ‘move’. My aim here was to aid recall of the command in future, as hopefully it will help the reader make a stronger mental connection between the command mv and its function (‘move a file’). | ||
|
||
Files, paths and file types | ||
The command line makes up a significant portion of the Prenomics course, but it could be summarised in a relatively small amount of space. To work out what else should be included, I looked at the course more holistically. The first lesson of the course is about files and directories, so it was clear to me that there needed to be a section about this. This lesson also covers the file types used in the course, including .FASTQ and .PEM files which are likely to be new to most learners, so I wanted to include this too. | ||
|
||
I found it difficult to summarise the information about file paths and directories into a mostly graphical format. In particular, I found that I couldn’t rely on giving examples as much as we do in the course itself, due to the need to reduce text as much as possible. I ended up just including definitions of absolute and relative paths, along with a diagram of the file system inside the Cloud-SPAN AWS instance. | ||
|
||
If I’d had space, I would have included more information on working directories and examples of how the file system diagram can be represented with file paths. | ||
|
||
Why use…? | ||
The next sections I designed were the ‘Why learn command line?’ and ‘Why use the cloud?’ sections. | ||
|
||
The question of why the command line is useful is covered in episode three of the first day of Prenomics, as part of the introduction to the shell. I distilled the reasons given here down into four main themes: automation of repetitive tasks, reducing human error (as a result of automation), improving reproducibility and the ability to access new tools (either because the command line offers more functionality or because it opens up use of high performance computing systems (HPC) such as the cloud). | ||
|
||
Extract from poster section 'Why learn command line?'. Icons and text give four reasons: improve reproducibility, reduce human error, access new tools, and automate repetitive tasks. | ||
The other question- why cloud computing is useful- is not technically covered in Prenomics material. However, it is discussed in our Genomics course, where the three reasons given for using HPC are lack of resources needed to run analyses, analyses taking a long time to run and problems installing software. These reasons align closely with three of the challenges identified by Cloud-SPAN project lead, James Chong, as facing the field of metagenomics (hardware, time and software). | ||
|
||
However, these reasons could actually be used as a reason to use any kind of HPC resource, not just the cloud. I wanted to include reasons specific to the cloud. The two major reasons I came across were the ability to share software or data containers across different institutions, and use of the cloud when other HPC is inaccessible. | ||
|
||
In an earlier version of the poster, the ‘why use cloud?’ section was framed in terms of challenges that users might face, such as long analysis times or issues installing software. I reworded this section to match the framing of the ‘why learn command line?’ section; that is, a solutions-oriented summary, with ‘shorten analysis times’ replacing ‘long analysis times’ and ‘use pre-installed software’’ replacing ‘issues installing software’. | ||
|
||
Extract from poster section 'Why use the cloud?'. Icons and text give five reasons: access more hardware resources, use pre-installed software, shorten analysis time, share software or data containers, and overcome barriers to accessing high performance computing | ||
File types | ||
Lastly I wanted to include a brief summary of the file types used in the Prenomics course, as it is likely that two out of three of these will be new to learners. This part was quite easy - I just wrote a short sentence to describe each file type and paired it with the relevant file extension. | ||
|
||
In the course a significant amount of time is dedicated to introducing the .fastq file structure, which codes sequencing data into a text format with four lines per read. I considered including this information but I didn’t have room. | ||
|
||
The big reveal... | ||
And finally, here's the full poster! You can download a high-res version of the poster for your own enjoyment here. | ||
|
||
An image of the finished poster with sections: why learn command line?, why use the cloud?, file types, files and paths, and command cheatsheet | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
Challenges in Environmental 'Omics | ||
13/07/22 Evelyn Greeves | ||
Hardware | ||
The size and nature of ‘omics data means it is often necessary to employ high performance computing (HPC) resources for analysis. This presents an inherent challenge as use of such resources requires a specific skill set that not all researchers will have (see ‘skills’ below for more details). | ||
|
||
A secondary challenge relating to hardware is the rapidly changing HPC landscape. Between institutions HPC architectures can vary wildly. Although the basic skills needed to access them remain the same, the setup and execution of jobs may look quite different. Even within institutions, HPC setups mature and are replaced regularly as new technologies develop and the demand for resources grows. For example, the Biology department at the University of York has had access to three different setups (c2d2, YARCC and Viking) in the last nine years, with a new iteration (Viking2) currently in the works. This frequent turnover requires users to continually adjust and adapt their workflows to the new system. | ||
|
||
Software | ||
There are several issues surrounding the software involved in analysis of environmental omics datasets. Firstly, software tends to have a steep learning curve, requiring a substantial time investment for researchers. This investment will not necessarily always pay off, if the end result is not what is required. | ||
|
||
Secondly, even if a piece of software does do what is needed, it is not guaranteed that it will be usable on the HPC architecture available. Installation of software is not always straightforward, if it is allowed in the first place. The rapid turnover and replacement of HPC architectures only serves to compound this problem, and the heterogeneity of HPC setups between institutions makes it difficult to find bespoke instructions for software installation. | ||
|
||
The final, broader issue is around access to learning resources and tutorials. Some popular, non-field-specific tools such as R or Python have countless online tutorials and instructions dedicated to their use, aimed at all different levels of understanding. Others, especially more niche software programs, have very few resources. Those that do exist may be out of date, or assume a level of knowledge beyond that of most novices (for example, many documentation pages are entirely inaccessible to a newcomer). As new software emerges and supersedes previously popular programs, the lack of help available only worsens. | ||
|
||
Skills | ||
As previously mentioned, environmental omics analysis has a steep learning curve. A major challenge for many new researchers is grappling with previously unencountered skills such as using the UNIX command line, navigating file systems, writing shell scripts, grappling with dependencies and specifying resources for HPC. This is all before any specific pieces of software are involved, each of which will require its own set of skills and understanding. | ||
|
||
These skills are required on top of the experimental design and data collection skills needed to generate datasets in the first place. Often those collecting data are the ones best placed to know how to interrogate it, as all experiments are different and bespoke analysis is crucial. This requires researchers to learn and juggle a large collection of skills, not all of which are immediately relevant to their chosen area of study. | ||
|
||
Time | ||
Finally, there are time investments involved in all of the above challenges. There is the ‘brain time’ involved in learning new skills, problem-solving and working with new software. Then, once an analysis is ready to run, it will take time to run. HPC resources are usually shared across many users, with jobs being added to a queue to run when resources are available - analyses requiring large amounts of compute may be queued for days or weeks waiting for the required resources to come available. In addition, some analyses take a long time to run given the size of the datasets involved and the complexity of the analysis. | ||
|
||
Once analysis is completed time must be invested in interpreting and visualising the results. If parameters need to be adjusted following this, then the whole process must begin again. This makes optimisation of analysis difficult and time-consuming to the point that it may not even happen at all. | ||
|
||
At Cloud-SPAN our goal is to help you overcome these challenges. Read more about the courses we offer or take a look at our introductory 'Prenomics' course materials or specialised Genomics course to see how our training can equip you better! | ||
|
||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
💸Scholarships to attend the Northern Bioinformatics User Group meeting 9th September | ||
18/07/22 Emma Rand | ||
|
||
Hello Cloud-SPAN community! | ||
We have scholarships to cover expenses to attend the Northern Bioinformatics User Group (Northern BUG) meeting to be held on September 9th 2022 at the University of Bradford. | ||
|
||
NorthernBUG is a network of bioinformaticians and users or bioinformatics services in the north of England which hold quarterly meetings to build a community of researchers and others using big data in biology. NorthernBUG meetings are open to anyone interested in bioinformatics or its application in life science research and beyond. Meetings are free and early career researchers are especially encouraged career to attend and present their work. It’s a great forum to practice your talks, float new ideas and approaches, and present early work. | ||
|
||
One of my project students, Chloe Brook, presented her work there: | ||
|
||
|
||
She is now at the Edinburgh Parallel Computing Centre. | ||
|
||
We have scholarships to cover travel expenses for five of our Cloud-SPAN 'graduates' to attend NBUG. Register for NBUG here and complete a scholarship application with us here by Monday 22nd August. | ||
|
Oops, something went wrong.