Methods
Building the Reference Database: Our Approach and Process
Publicly available molecular markers were deployed for identifying Arabica coffee varieties, from
a set of 96 Single nucleotide polymorphism (SNP) markers recently published by Zhang et al.
(2021). From this set of markers, we refined our selection to a subset of 41 markers that most
differentiated between multiple coffee varieties and exhibited efficient high-throughput
genotyping performance. Various sources of the most commonly cultivated Arabica coffee
varieties were sampled in Latin America and Africa. These samples were then sent to
Intertek/Agritech (https://www.intertek.com/agriculture/agritech/), to genotype the samples.
Once the genotyping process was completed, we established the SNP fingerprints of the
samples and developed the reference database.
Due to the incomplete inbreeding of most Arabica coffee varieties at the time of their release, in
some cases, there is considerable intra-variety variation. Additionally, certain varieties with the
same genetic backgrounds and parentage but were given different names in different countries
or regions, making the discrimination of these varieties impossible. This is further complicated
by the fact that closely related varieties often exhibit very similar fingerprint marker patterns. As
a result, we opted to collect samples from reliable sources and genotype multiple samples of the
same variety to establish the most precise SNP fingerprint profile. For that reason, a reference
database was developed from multiple samples per variety, to capture and use that variation to
better aid the identification and verification process.
After the reference database was completed, we used the R-package assignPOP developed by
(Chen et al., 2018) to identify varieties or variety groups from the SNP markers database. This
package facilitates the assignment of populations from genetic datasets using a machine-
learning approach. Key features of this package include principal component analysis (PCA) for
dimensionality reduction, Monte-Carlo cross-validation for assignment accuracy estimation, and
K-folds cross-validation for membership probability estimation.
It is important to note that assignPOP is not the only package that can be used for variety
assignment. There are other packages such as STRUCTURE, Discriminant Analysis of
Principal Components (DAPC), Structure, and LEA (Landscape and Ecological Association
Studies), among others that also enable the assignment of populations based on markers data.
The reason we use assignPOP is due to its internal machine-learning validation approaches,
effective discrimination power in predicting sources of populations, ability to handle low-SNP
data, and ability to process relatively large datasets.
Conducting the Variety ID Assessment with Genotypic Data
Once the samples are genotyped these data will be used to conduct the variety ID assessment
with the assignPOP R-package. For more details regarding the assignPOP R-package please
visit: https://alexkychen.github.io/assignPOP/
In the following section, we provide a detailed guide on how to carry out the variety assessment
using the R studio software (R Studio Team, 2020).
R-Script considerations
First work on the reference database:
To access the Excel reference database (low-density SNP marker panel), please download it
from the WCR resources page. Save the file in a designated working folder on your computer.
The table below provides an overview of the reference database's structure. Please note that in
the R-script you will need an identification column giving the individual names of the samples
and a population column giving the population of each individual.
Example reference database
Reference database to compare variety samples
Depending on the variety assigning R-package better results are achieved by using a subset of
the reference varieties, as having many fills the genetic space and decreases the power of the
analyses. Especially in the case of varieties that are genetically similar, but were given different
names in different countries. In general, it is important to include varieties representative of the
broad diversity while also including varieties that are known candidate varieties for possible
confounders (neighbor varieties at the seed production level, or at the nursery). To address this
concern, we developed a table that lists each variety in the reference panel for comparison. This
enables the creation of appropriate subsets of references, mitigating possible confounding
results when generating the assigning analysis. Please note that this table was created based
on previous knowledge of the genetic background of the varieties and this table and its
discretion rely on WCR internal expertise, which involved the analysis of multiple samples from
the listed varieties in this reference dataset.
Make sure you use the subset of the reference varieties accordingly.
R- Script
[Link to the R script]
Install required packages if not already installed
required_packages <- c("readxl", "adegenet", "graph4lg", "assignPOP", "openxlsx")
new_packages <- required_packages[!(required_packages %in%
installed.packages()[,"Package"])]
if (length(new_packages) > 0) {
install.packages(new_packages)
}
Load required libraries and sources
library(readxl)
library("adegenet")
library(graph4lg)
library(assignPOP)
library(openxlsx)
Set the working directory
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
Read the reference file
Make sure you use the subset of the reference varieties accordingly. You have the flexibility to
subset the reference database according to your requirements. For example, you may want to
include all reference varieties in a single database and then create subsets based on preferred
varieties for assessment. To subset the 'snp' data frame, we will use the subset() function
with specific criteria.
snp <- read.csv("FinalWCRFullReferencesLDPA-07-03-23.csv")
snp<- subset(snp, testAna14=="Y")
#only use the first 45 SNPs that we routinely use
snp2 <- snp[,c(10:54)] # Make sure you use only the columns where the markers are in the
database. In the example code line, the markers are located between columns 10 and 54
Read the file in adegenet and convert the reference file to genind object
?df2genind #The function df2genind converts a data.frame (or a matrix) into a genind object.
obj <- df2genind(snp2, ploidy = 2, NA.char = "", sep = ":", ind.names = snp$id, pop =
snp$Variety) # Remember that this code is used for the references and the ind.names= and
pop = have to match according to the database. In the example database the
ind.names=snp$id, the id is the vector giving the individuals names in the database named snp.
The pop = snp$Variety, is the vector giving the population of each individual.
Convert genind object to genepop format
genind_to_genepop(obj, output = "Referencetogenepop_test.txt") # In this case we named the
output as “Referencetogenepop.txt” You can name it as you wish.
Replace "000000" with "0000" in Referencetogenepop1.txt
IMPORTANT - When the file is converted it makes NAs as 000000 (six zeroes), while it needs
0000 (four zeroes). Go to the file produced in the folder ‘Referencetogenepop_test.txt' and
replace 000000 to 0000
Using AssignPop
https://alexkychen.github.io/assignPOP/. It is important to put the pop.name in the alphabetic
order
Read reference genepop file
YourGenepoprefs <- read.Genepop("Referencetogenepop_test.txt", haploid = FALSE,
pop.names=c("Anacafe14", "Catigua Mg2", "Centroamericano", "Marsellesa", "Pacamara",
"PacasTekisic", "SanPacho")) # In the script, these references have to be listed in alphabetic
order: If not sure of the order look at the Referencetogenepop.txt created in the step before
Next step is predicting the samples
Example of the samples table
IMPORTANT: Make sure the heterozygous markers in the samples database are in the same
allele order as the reference database.
Het marker in ref. Het marker in Sample Het marker in the sample DB
DB DB after modification
A:C C:A A:C
A:G G:A A:G
A:T T:A A:T
C:G G:C C:G
C:T T:C C:T
G:T T:G G:T
If the heterozygous markers are not in the same allele order as the reference database, then
order the markers accordingly. (I.e, if the marker in the reference database is A:C and the
marker in the samples database is C:A, replace C:A for A:C in the samples database)
Read sample file
samples <- read.csv("Rawdata 1-16 149.016_report.csv")
samples2 <- samples[, 5:45] # Make sure you use only the columns where the markers are in
the database. In the example code line, the markers are located between columns 4 and 45
samplesg <- df2genind(snp2, ploidy = 2, NA.char = "", sep = ":", ind.names = snp$id, pop =
snp$Variety) # Remember that this code is used for the references and the ind.names= and
pop = have to match according to the database. In the example database the
ind.names=snp$id, the id is the vector giving the individuals names in the database named snp.
The pop = snp$Variety, is the vector giving the population of each individual.
Samplesg
Assess loci and genotype missing data
#?? missingno
samplesg <- missingno(samplesg, type = "loci", cutoff = 0.3, quiet = FALSE, freq = FALSE) #
Confirm whether the markers are missing in all genotyped plates and varieties. If any markers
have missing data across all plates and varieties, consider eliminating the data for those
markers accordingly. In this code, the threshold for removal is set at 30% of missing data.
samplesg <- missingno(samplesg, type = "geno", cutoff = 0.3, quiet = FALSE, freq = FALSE)
#Confirm whether the genotypes are missing in all genotyped plates and varieties. If any
genotype has missing data across all plates and varieties, consider eliminating the data for
those genotypes accordingly. In this code, the threshold for removal is set at 30% of missing
data.
samplesg
Convert genind object to genepop format
IMPORTANT - When the file is converted it makes NAs as 000000 (six zeroes), while it it needs
0000 (four zeroes).
genind_to_genepop(samplesg, output = "WCR_AssigPop_test.txt")
# Replace "000000" with "0000" in WCR_AssigPop_test.txt
#replace_zeros("WCR_AssigPop_test.txt")
Read unknown individuals in the genepop file
YourGenepopunknown <- read.Genepop("WCR_AssigPop_test.txt", haploid = FALSE)
Perform assignment test
The assign.X function utilizes markers data, along with known individuals (in this case the
references), to assign the unknown individuals to possible source populations. The results are
then saved as an ‘AssignmentResult.txt’ file.
?assign.X
assign.X( x1=YourGenepoprefs, x2=YourGenepopunknown, dir="ResultTest/",
model="naiveBayes") #The files are saved in the dir="ResultTest/" folder. Please make sure you
create the folder before running this codel line. If you have created the folder make sure this is
empty before you run this script line.
Read assignment result from AssignmentResult.txt
assignment_data <- read.table(“ResultTest/AssignmentResult.txt", header = FALSE)
After executing the script, the resulting 'AssignmentResult.txt' file needs to be opened and
saved as either a .csv or .xls file. The output table will include the predicted variety and the
corresponding membership probabilities, similar to the sample table provided below. It is
essential to understand that when the expected variety does not match the predicted variety, it
should not be assumed that the predicted variety accurately represents the sample. For
instance, in the table, the expected variety of the samples was "Anacafe14," and all the trees
except tree 2 had their predicted varieties matching the expected variety. However, for tree 2,
the predicted variety was "Marsellesa." Due to the closely related and similar SNP profiles of
Arabica coffee varieties, the assignment of "Marsellesa" is not entirely certain. So, this
assignment test is only for “yes” or “no” results.
References
Raymond M. & Rousset F, (1995). GENEPOP (version 1.2): population genetics software for
exact tests and ecumenicism. J. Heredity, 86:248-249
Chen, K. Y., Marschall, E. A., Sovic, M. G., Fries, A. C., Gibbs, H. L., & Ludsin, S. A. (2018).
assign POP: An R package for population assignment using genetic, non-genetic, or integrated
data in a machine-learning framework. Methods in Ecology and Evolution. 9(2)439-446.
https://doi.org/10.1111/2041-210X.12897
Zhang, D., Vega, F.E., Solano, W. et al. Selecting a core set of nuclear SNP markers for
molecular characterization of Arabica coffee (Coffea arabica L.) genetic resources.
Conservation Genet Resour 13, 329–335 (2021)
RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL
http://www.rstudio.com/.