Hypothetical Protein Annotation Homework 12

The following analysis uses a gene expression dataset of nitrogen deficient and nitrogen sufficient (control) root and shoot tissues from Oryza sativa under phosphorous deficient conditions acquired from NCBI’s Gene Expression Omnibus (NCBI identifier: GSE73775). 

Sort samples by logFC in each of the the 2 GSE73775 “FDR & fold-changes for all genes” files downloaded during Homework 10. Use “Sort by smallest to largest” to rank genes by logFC. Generate a .gmt file containing query sets comprised of 500 genes with the most positive and negative logFC values from each file. Finally, convert root and shoot .txt files provided in Homework 10 into .gct files and prepare .cls files accordingly for Gene Set Enrichment Analysis (GSEA).

To save time, the relevant files have been provided in the attached files. Run GSEA on these files using the following parameters rather than using the default provided:

  • NoNitrogenvsNormal for phenotype labels
  • No_Collapse
  • Gene_set for Permutation type
  • tTest for Metric for ranking genes
  • Save results in desired location



Use the appropriate .tsv files to collect genes GSEA identified as part of the leading-edge for each analysis into a single Excel file. Use “Conditional Formatting” to identify duplicates for both over- and under-expressed genes between root and shoot. Generate 2 lists – one for over- and under-expressed separately – containing these duplicates. Open the platform file from Homework 10 in Excel also and compare the 2 generated lists to the platform file using “Conditional Formatting” to identify duplicates. Next, use “Custom Sort” by cell color to separate duplicates from non-duplicates for removal. Another “Sort by smallest to largest” for the Description column using just duplicates will make it easier to find hypothetical proteins. Use these results to answer the following questions