Gene Validation and Remodelling Using Proteogenomics of Phytophthora cinnamomi, the Causal Agent of Dieback

Phytophthora cinnamomi is a pathogenic oomycete that causes plant dieback disease across a range of natural ecosystems and in many agriculturally important crops on a global scale. An annotated draft genome sequence is publicly available (JGI Mycocosm) and suggests 26,131 gene models. In this study, soluble mycelial, extracellular (secretome), and zoospore proteins of P. cinnamomi were exploited to refine the genome by correcting gene annotations and discovering novel genes. By implementing the diverse set of sub-proteomes into a generated proteogenomics pipeline, we were able to improve the P. cinnamomi genome annotation. Liquid chromatography mass spectrometry was used to obtain high confidence peptides with spectral matching to both the annotated genome and a generated 6-frame translation. Two thousand seven hundred sixty-four annotations from the draft genome were confirmed by spectral matching. Using a proteogenomic pipeline, mass spectra were used to edit the P. cinnamomi genome and allowed identification of 23 new gene models and 60 edited gene features using high confidence peptides obtained by mass spectrometry, suggesting a rate of incorrect annotations of 3% of the detectable proteome. The novel features were further validated by total peptide support, alongside functional analysis including the use of Gene Ontology and functional domain identification. We demonstrated the use of spectral data in combination with our proteogenomics pipeline can be used to improve the genome annotation of important plant diseases and identify missed genes. This study presents the first use of spectral data to edit and manually annotate an oomycete pathogen.

implementing the diverse set of sub-proteomes into a generated proteogenomics pipeline, we were 48 able to improve the P. cinnamomi genome. Liquid chromatography mass spectrometry was used to 49 obtain high confidence peptides with spectral matching to both the annotated genome and a generated 50 6-frame translation. 2,764 annotations from the draft genome were confirmed by spectral matching. 51 Using a proteogenomic pipeline, mass spectra were used to edit the P. cinnamomi genome and allowed 52 identification of 23 new gene models and 60 edited gene features using high confidence peptides 53 obtained by mass spectrometry, suggesting a rate of incorrect annotations of 3% of the detectable 54 proteome. The novel features were further validated by total peptide support, alongside functional 55 analysis including the use of Gene Ontology and functional domain identification. We demonstrated 56 the use of spectral data in combination with our proteogenomics pipeline can be used to improve the 57 genome of important plant diseases and identify biologically relevant missed genes. This study presents 58 the first use of spectral data to edit and manually annotate an oomycete pathogen. 59 Background 60 The primary role of a genome sequence is to elucidate the entire set of genes expressed by an organism. 61 In silico prediction platforms are the main methods for predicting reliable gene sets. However, they can 62 be problematic as transcriptome data does not always correlate with the protein products and their 63 abundance [1]. Curating genes correctly and accurately is fundamental in defining the biochemical 64 composition of an organism [2]. Sequence transcripts and orthologues from related and similar 65 organisms are the primary methods in accurately predicting such genes and identifying interesting and relevant biological components [3]. Evidence-based curation includes transcript data and associated Proteomics data has proven a useful tool to improve the genome annotation of several phytopathogens 117 where high quality mass spectra complemented transcriptomic data and identified potential annotation 118 inaccuracies of exon boundaries and unsuspected gene models. We aimed to use spectral data from 119 several P. cinnamomi sub-proteomes to assist in gene calling. These sub-proteomes represent a wide 120 coverage of the P. cinnamomi proteome and include a diverse repertoire of soluble proteins. Zoospores 121 characterize the infective life stage and the extracellular proteome is likely to contain proteins related 122 to virulence. These were analysed by 2D LC-MS/MS and resulting spectra were matched to the current 123 gene prediction models. To generate a list of peptides which potentially do not match current models, 124 a 6-frame translation was generated and used for spectral matching. A list of peptides indicating 125 potential altered or novel gene models was generated using the genomic coordinates and peptide open 126 reading frames. These were subsequently used to carefully manually edit current annotations and 127 curate novel features on a homology basis with proteins of similar species. Using this proteomics 128 dataset, we refined the genome for downstream proteomic work which will aid the identification of 129 virulence factors and metabolic targets for chemical control. By working towards completing the P. 130 cinnamomi genome, downstream proteomic work will be more accurate as the gene set is more 131 representative of what is being expressed. There is also the potential for effector virulence gene 132 discovery and improved biochemical characterisation which can lead to development of resistance 133 gene inclusion in hosts and more targeted methods of chemical control. 134

Protein extraction 148
Mycelia and zoospores were ground using mortar and pestle in liquid nitrogen and extraction buffer 149 used to extract and solubilise proteins as previously described [28]. Samples were kept on ice for 30 150 minutes with regular gentle mixing and centrifuged at 20,000g at 4°C for 30 minutes. The protein 151 solutions were subsequently desalted and protein amount was estimated using Direct Detect cards 152 (Merck Millipore, Darmstadt). All samples were freeze dried before further processing. SDS-PAGE was 153 performed for all samples to ensure proteolysis was minimal. 154

Sample preparation 155
To visualise each sub-proteome, 20 μg of each sample was loaded onto a 1D SDS-PAGE. To determine 156 the amount of intracellular contamination in the extracellular proteome, the activity of an intracellular 157 enzyme marker glyceraldehyde phosphate dehydrogenase (GAPDH) was assayed on each sub-158 proteome as per the manufacturer's instructions (Sigma, St Louis). 500 ug of each sample was 159 resuspended in 250 uL 0.5 M triethylammonium bicarbonate (pH 8.5) before reduction and alkylation 160 with 25 uL of 50 mM tris(2-carboxyethyl)phosphine (Thermo Scientific, Waltham) and 12.5 uL 200 mM 161 methyl methanethiosulfonate (Sigma, St Louis) respectively. Samples were digested overnight at 37°C 162 with trypsin (Sigma, St Louis) at a ratio of 1:10, subsequently desalted on a Strata-X 33 um polymeric 163 reverse phase column (Phenomenex, Torrance, CA, USA) and dried in a vacuum centrifuge. 164 High pH reverse phase chromatography 165 Dried peptides were separated by high pH reverse phased liquid chromatography on an Agilent 1100 166 HPLC system using a Zorbax Eclipse column (2.1 X 150 mm, 5 um) (Agilent Technologies, Palo Alto). 167 Peptides were eluted with a linear gradient of 20 mM ammonium formate pH 10, 90 % acetonitrile over 168 80 minutes. A total of 98 fractions were collected, concatenated into 12 fractions based on collection 169 order and dried in a vacuum centrifuge. The UV trace was also used to visualise the total peptide 170 content and depth of each sub-proteome.  Peptides that did not return significant results were not used for this analysis. All significant BLAST hits 213 were transferred onto the Phytophthora draft genome and manually edited to comply with sequence 214 features such as start/stop codons and non-sequenced regions. The novel annotated genes were further analysed for total number of supporting peptides (as per Protein Pilot methods described 216 above). Genes that had only one high confidence peptide were included for the purposes of gene 217

discovery [31]. Protein Family domains (PFAM), Gene Ontology (GO) terms and Kyoto Encyclopedia of 218
Genes and Genomes (KEGG) were also assigned using Interpro scan (version 5.44-79.0, 2020) using and 219 EGGNOG-mapper (version 2, 2019) using default parameters. To determine whether any pathogenesis 220 related proteins were present within the novel set, annotations were analysed for presence of any 221 potential virulence factors using PHI-BASE (version 4.9, 2020) using default parameters [32]. The Codon 222 Adaptation Indexes (CAIs) of each novel gene were also calculated using Emboss CAI (version 6.6, 223 default parameters), which indicated gene annotation with anomalous usage of codons [33]. 224 Results 225

Sub-proteome enrichment 226
To obtain a representative proteome of P. cinnamomi, vegetative mycelia and transient short-lived 227 zoospores of P. cinnamomi were used as these are the dominant cell types that grow and initiate 228 infection in hosts. In addition, we extracted soluble secreted proteins (secretome) from the mycelia, 229 which are widely studied due to their implications on pathogen-host interactions. The purity of the 230 mycelia and zoospores was observed under a stereoscope ( Figure 1). Figure 1A shows no evidence of 231 intercellular contamination and demonstrated the purity of these cell types. The large mass of mycelia 232 had not produced zoospores or their precursor (the sporangia) in this method of in vitro cell culture. 233 Similarly, vegetative mycelia was not observed in the zoospore preparation ( Figure 1B). 234 1D SDS-PAGE was run to visualize the sub-proteomes of each cell type ( Figure 2A). The banding patterns 235 of each sub-proteome show differences in total protein content. The extracellular proteome showed 236 enrichment in lower molecular weight proteins whereas the mycelia and zoospores had proteins that 237 spanned over the whole mass range. To test the purity of the secretome, an enzyme activity assay of 238 the cytoplasmic marker GAPDH was measured, which should only be present in small amounts ( Figure  239 3b) [34]. Both the mycelia and zoospores had similar detected amounts of GAPDH detected, 240 approximately 4.7 and 4.8 mU/mg protein, respectively. GAPDH was also detected in the secretome, 241 however at lower amounts (1.6 mU/mg protein). The RP-HPLC UV total ion count traces indicated 242 differing protein content between the three sub-proteomes, as majority of the peaks do not match in 243 intensity and retention time (Figure 3c). The majority of proteins detected in the mycelial and zoospore 244 were localised intracellularly at 45% and 41%, respectively as predicted by WolfPSORT ( Table 1). The 245 secretome was enriched in extracellular localisation proteins with a predicted 18% compared to 5% in 246 both the mycelia and zoospores. 247 Validation of V1 gene models using sub-proteome spectra 248 The mass spectra were used to validate the draft annotation of the P. cinnamomi genome. The 249 annotations acquired from JGI Mycocosm (assembly annotation version 1.0) were designated in this 250 study as 'V1' and the annotation set containing subsequently manually edited loci was designated 'V2'. 251 Non-redundant peptide matches (at least two 95% confident peptides) resulted in 2,554, 1,362, and 252 2,304 proteins from the mycelia, secretome and zoospores respectively. From this data, 2,764 unique 253 proteins from the V1 predicted gene set were identified (Figure 4). 526, 215 and 432 proteins were 254 unique to the mycelia, secretome and zoospores respectively, which implies a wide range of the whole 255 proteome detected. The mycelia and zoospores had more unique protein identifications than the 256 secretome, which may be a result of an expected lower mass range of an extracellular proteome that 257 were below the acquisition detection limits. 258 When matched to 4,874,027 generated open reading frames (ORF) of the 6-frame translation, 2,752, 259 1,355, and 2,334 ORFs from the mycelia, secretome and zoospores were identified (Table 2). Although 260 this does allow us to match more peptides to the genome than the V1 annotation, some level of 261 redundancy is expected from matching to reading frames that do not form genes. The false discovery 262 rate for all mass spectra analysis was <0.1% using the Protein Pilot decoy database method, which is 263 within the limits of the general consensus for large scale proteomic data [35,36]. Of the V1 detected 264 by mass spectrometry, 2,398 had additional support by assigned GO terms and/or PFAM domain.

Annotating new gene models by homology criteria 266
Although there is peptide support for a large number of the V1 genes, it is expected that there are some 267 forms of incorrect intron and exon boundary annotations that can be detected using spectral data. In These were considered as candidates for new gene models. 277 To select peptide candidates that would likely result in alteration of V1 genes and curation of new 278 genes, Blastp was used. Peptides that returned significant hits to other Phytophthora species were used 279 to manually edit and curate new genes (Table 4). This largely reduced the number of potential edited 280 and new genes due to both the redundancies of 6-frame peptides and rigorous Blastp parameters used 281 for peptide matches. Of those with conflicting boundaries, 70 peptides showed significant homology to 282 other Phytophthora species. Of the peptides that were further than 200 bp from any gene, a total of 283 118 peptides returned significant BLASTp hits, suggesting the presence of previously unannotated 284 genes on the P. cinnamomi genome. The homologous sequences were transferred onto the P. 285 cinnamomi genome and the annotations were manually integrated, taking into consideration 286 differences in the genome and features such as introns. 287 Using these criteria, a total of 60 genes were edited, which equates to an error rate of approximately 288 2% of the detected proteome. The CDS coordinates of the edited genes are shown in Additional file 1. 289 Of these, 44 were modified by extending the exon boundaries and there were 16 instances of merged 290 genes. Additionally, 23 new previously undefined genes were annotated (Table 5). These annotations 291 were uploaded to the GenBank under accessions MT820663-MT820655. The edited annotations will 292 be referred to by original annotation identification with 'V2' suffixed, as listed in the Additional files 2 293 and 3, respectively. In summary, we identified errors in 60 V1 genes which were manually altered and 294 added a further 23 annotations to the gene set of P. cinnamomi. 295 Validating edited and new genes 296 The edited genes were subsequently analysed for total peptide support and differences in functional 297 assignment compared to the original annotation. Peptides within the edited regions were manually 298 counted (Table 6). Of the extended genes, only one had no other supporting information other than 299 the support of one 95% confident peptide in the extended portion of the gene (e_gw1.28.366.1_V2). 300 All other extended genes had support from more than two high confidence peptides and/or 301 homologous functional assignment. Similarly, only one merged gene had a single peptide supporting 302 the merged region of the annotation (gw1.160.19.1_V2). All others were supported by two or more 303 high confidence peptides, which is the general requirement for protein identification in proteomics 304 [37]. Genes were analysed for GO terms, PFAM domains and KEGG orthologues (KO) to determine 305 whether the altered boundaries change their functional annotation assignment ( Table 6). Details of 306 each functional annotation are shown in Additional files 2 and 3. 307 The original mass spectra were matched to the set of new genes (using Protein Pilot-see methods) to 308 determine how many peptides supported each gene (i.e. determine if any genes were a product of 309 single peptide matches) ( Table 7). Of the 23 new genes identified, one new gene had support from only 310 one high confidence peptide (MT820633). All new genes were detected in the mycelia and most were 311 also identified in the secretome and zoospore (Additional file 4). The remaining 22 genes had at least 312 two or more supporting peptides. 313 To further support this new gene set, protein sequences were analysed for protein function by 314 assignment of PFAM domains, GO terms and KO assignment ( Table 7). Details of these annotations for 315 each entry are shown in Additional file 4. The new annotations were analysed for virulence factors using 316 PHI-BASE. None of these annotations returned a significant hit to any known virulence factors. 317

Codon adaptation Index 318
The codon adaptation indices were calculated for the set of new features and compared to the V1 gene 319 set to identify significant differences in codon usage and distribution that could indicate possible causes 320 for errors and missed genes (Figure 8). The distribution of the CAIs of the new set were significantly 321 different (t-test, p value <0.05) than those of the predicted gene set suggesting a higher proportion of 322 less common codon usage in the new set. These were also significantly lower to the CAIs of all original 323 annotations that had high confidence supporting peptides. Each new gene was also analysed for 324 unusual codon usage, primarily the use of start codons other than methionine and not terminated by a 325 stop codon (Table 8) The aim of using these three sub-proteomes was firstly to validate as much as the P. cinnamomi draft 340 gene set as possible. Through spectral matching, we verified 10.6% of the predicted gene set. The 341 differing sub-proteomes, as shown above, are reflected in the validation of V1 genes by spectral 342 matching. Unique protein identifications of approximately 19%, 8% and 16% of the mycelia, secretome 343 and zoospores respectively, accounted for the differences in observed SDS-PAGE banding patterns and 344 RP-HPLC traces. Proteomic studies of other Phytophthora species indicated variable numbers of unique 345 proteins to these sub-proteomes. A 2-dimensional proteomic study of the oomycete P. palmivora 346 indicated 1% unique proteins for mycelia and zoospores [41]. However, a profile of the P. infestans 347 secretome indicated similar coverage of extracellular proteins to this study [42]. 348 Using the mass spectra matched to the 6-frame translation of the draft genome, we refined the draft 349 genome of P. cinnamomi. Total peptide support and functional assignment were used to validate and 350 to obtain the most accurate representation of edited and newly curated genes. We compared the 351 functional assignments of V1 and V2 edited genes to identify differences inferred by the changes in Poly(ADP-ribose polymerase domains and one WGR domain. This edit also resulted in one gene 368 ontology difference, the presence of an NAD+ ADP-ribosyltransferase. There were no other differences 369 in the GO and KEGG ontologies between V1 and V2 merged genes. Although these functional 370 differences do not indicate major functional differences, they can impact the way in which we classify 371 these proteins when trying to understand their role in a system. 372 The newly curated genes were validated using total peptide support, functional assignment and also 373 examined for their codon usage to gain a better understanding of why they were missed in the V1 The data generated by shotgun LC-MS/MS confirmed 2,764 previously annotated gene models from 410 the P. cinnamomi draft genome using high quality mass spectra from a diverse range of sub-proteome 411 fractions. The spectral data suggested potential errors in gene calling, and using the spectral data, we 412 were able to alter 60 genes by extending and merging exons, and identify 23 previously undescribed 413 annotations in the P. cinnamomi genome. This demonstrates that the correlation between genes called by methods in silico are not always correlated to protein products, with evidence of annotation error uncover evolutionary origins and mechanisms of pathogenesis. Science. 2006;313:1261-6.        Table 3. Confirmation of genes supported by peptides within or crossing exon boundaries. Total number of genes 26478 Genes confirmed by spectral matching (Protein Pilot) Genes with 6-frame peptide support-with boundary conflicts Genes with no 6-frame peptide support (inc. with boundary conflicts) Genes with only mismatched 6-frame peptides Genes with matched and mismatched 6-frame peptides 3468 52 19724 69 24