This genome annotation workflow was created by Idaho EPSCoR GEM3 participants. The purpose of this workflow module is to share resources and processes for genome annotation using publicly available data.
Over the last decade, affordable access to sequencing platforms has expanded the availability of quality genomic datasets in publicly available repositories. The challenge is no longer sequencing, but the manual and computational effort to sift through these repositories to mine and analyze available genomes.
This workflow aims to provide a template for individuals to manually annotate genes of interest from non-model species using publicly available genomic data. To do this, we aim to walk readers through the process with a specific example: the manual annotation of toll-like receptors (TLRs) in Falco peregrinus and Falco rusticolus.
Disclaimer: This workflow was developed in August 2023 with bioinformatic and genomic research recommendations at that time. Please note that program versions and recommendations will likely change with time due to the nature of this quickly evolving field.
You will need access to a High Performance Computing (HPC) machine, Geneious software, and R for downstream analyses. Knowledge of job script format, how to submit jobs, and how to compute in an interactive node specific to your institution’s HPC will also be necessary.
Use the upper navigation tabs in the following order:
Slurm Basics
Review a quick breakdown of slurm command line for submission of SBATCH files.
If your HPC uses a different job managing system, review alternative criteria specific to your institution.
HPC Cluster Workspace
Learn or refresh your command line basics with a tutorial.
Organize your cluster work space.
Confirm required mining modules are available on your institution’s HPC.
Identify functional genes of interest in model species
Select a model organism that closely relates to your non-model species.
Use the KEGG database to search and find function genes of interest in the selected model organism.
Transfer Data to the HPC cluster
Mine the Genome Using UNIX/LINUX Command line
Geneious Project Preparation
Manual Annotation
This project was funded with support from National Science Foundation grant No. OIA-1757324: RII Track-1: Linking Genome to Phenome to Predict Adaptive Responses of Organisms to Changing Landscapes.
Mentorship was kindly provided by Stephanie Galla, with support from the Conservation Genetics Laboratory at Boise State University.
Workflow development was adapted from a collaborative genomics effort to understand CYP450 gene evolution led by Matthew Holding. The collaborative project was funded with support from the National Science Foundation grant No. OIA-1826801: RII Track-2 FEC: Genomics Underlying Toxin Tolerance (GUTT): Identifying Molecular Innovations that Predict Phenotypes of Toxin Tolerance in Wild Vertebrate Herbivores.
Access to Geneious software for the project described was supported by an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences of the National Institutes of Health under Grant #P20GM103408.
Many thanks to Jenn Foreby, Denise Pfiefer, Tami Noble, Stephanie Sevigny, Rick Shumaker, Elise Conner and the GEM3 collaborators for continued efforts to support undergraduate research experiences at Primary Undergraduate Institutions.
This work is licensed under a Creative Commons Attribution 4.0 International License.