0% found this document useful (0 votes)

12 views8 pages

LDP Webpage Release Methods

Uploaded by

Riki Okta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

LDP Webpage Release Methods

Uploaded by

Riki Okta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Methods

Building the Reference Database: Our Approach and Process

Publicly available molecular markers were deployed for identifying Arabica coffee varieties, from
a set of 96 Single nucleotide polymorphism (SNP) markers recently published by Zhang et al.
(2021). From this set of markers, we refined our selection to a subset of 41 markers that most
differentiated between multiple coffee varieties and exhibited efficient high-throughput
genotyping performance. Various sources of the most commonly cultivated Arabica coffee
varieties were sampled in Latin America and Africa. These samples were then sent to
Intertek/Agritech (https://www.intertek.com/agriculture/agritech/), to genotype the samples.
Once the genotyping process was completed, we established the SNP fingerprints of the
samples and developed the reference database.

Due to the incomplete inbreeding of most Arabica coffee varieties at the time of their release, in
some cases, there is considerable intra-variety variation. Additionally, certain varieties with the
same genetic backgrounds and parentage but were given different names in different countries
or regions, making the discrimination of these varieties impossible. This is further complicated
by the fact that closely related varieties often exhibit very similar fingerprint marker patterns. As
a result, we opted to collect samples from reliable sources and genotype multiple samples of the
same variety to establish the most precise SNP fingerprint profile. For that reason, a reference
database was developed from multiple samples per variety, to capture and use that variation to
better aid the identification and verification process.

After the reference database was completed, we used the R-package assignPOP developed by
(Chen et al., 2018) to identify varieties or variety groups from the SNP markers database. This
package facilitates the assignment of populations from genetic datasets using a machine-
learning approach. Key features of this package include principal component analysis (PCA) for
dimensionality reduction, Monte-Carlo cross-validation for assignment accuracy estimation, and
K-folds cross-validation for membership probability estimation.

It is important to note that assignPOP is not the only package that can be used for variety
assignment. There are other packages such as STRUCTURE, Discriminant Analysis of
Principal Components (DAPC), Structure, and LEA (Landscape and Ecological Association
Studies), among others that also enable the assignment of populations based on markers data.
The reason we use assignPOP is due to its internal machine-learning validation approaches,
effective discrimination power in predicting sources of populations, ability to handle low-SNP
data, and ability to process relatively large datasets.
Conducting the Variety ID Assessment with Genotypic Data
Once the samples are genotyped these data will be used to conduct the variety ID assessment
with the assignPOP R-package. For more details regarding the assignPOP R-package please
visit: https://alexkychen.github.io/assignPOP/
In the following section, we provide a detailed guide on how to carry out the variety assessment
using the R studio software (R Studio Team, 2020).

R-Script considerations

First work on the reference database:

To access the Excel reference database (low-density SNP marker panel), please download it
from the WCR resources page. Save the file in a designated working folder on your computer.
The table below provides an overview of the reference database's structure. Please note that in
the R-script you will need an identification column giving the individual names of the samples
and a population column giving the population of each individual.

Example reference database

Reference database to compare variety samples

Depending on the variety assigning R-package better results are achieved by using a subset of
the reference varieties, as having many fills the genetic space and decreases the power of the
analyses. Especially in the case of varieties that are genetically similar, but were given different
names in different countries. In general, it is important to include varieties representative of the
broad diversity while also including varieties that are known candidate varieties for possible
confounders (neighbor varieties at the seed production level, or at the nursery). To address this
concern, we developed a table that lists each variety in the reference panel for comparison. This
enables the creation of appropriate subsets of references, mitigating possible confounding
results when generating the assigning analysis. Please note that this table was created based
on previous knowledge of the genetic background of the varieties and this table and its
discretion rely on WCR internal expertise, which involved the analysis of multiple samples from
the listed varieties in this reference dataset.

Make sure you use the subset of the reference varieties accordingly.

R- Script
[Link to the R script]

Install required packages if not already installed

required_packages <- c("readxl", "adegenet", "graph4lg", "assignPOP", "openxlsx")
new_packages <- required_packages[!(required_packages %in%
installed.packages()[,"Package"])]
if (length(new_packages) > 0) {
install.packages(new_packages)
}

Load required libraries and sources

library(readxl)
library("adegenet")

library(graph4lg)

library(assignPOP)
library(openxlsx)

Set the working directory

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

Read the reference file

Make sure you use the subset of the reference varieties accordingly. You have the flexibility to
subset the reference database according to your requirements. For example, you may want to
include all reference varieties in a single database and then create subsets based on preferred
varieties for assessment. To subset the 'snp' data frame, we will use the subset() function
with specific criteria.

snp <- read.csv("FinalWCRFullReferencesLDPA-07-03-23.csv")

snp<- subset(snp, testAna14=="Y")
#only use the first 45 SNPs that we routinely use
snp2 <- snp[,c(10:54)] # Make sure you use only the columns where the markers are in the
database. In the example code line, the markers are located between columns 10 and 54
Read the file in adegenet and convert the reference file to genind object
?df2genind #The function df2genind converts a data.frame (or a matrix) into a genind object.
obj <- df2genind(snp2, ploidy = 2, NA.char = "", sep = ":", ind.names = snp$id, pop =
snp$Variety) # Remember that this code is used for the references and the ind.names= and
pop = have to match according to the database. In the example database the
ind.names=snp$id, the id is the vector giving the individuals names in the database named snp.
The pop = snp$Variety, is the vector giving the population of each individual.

Convert genind object to genepop format

genind_to_genepop(obj, output = "Referencetogenepop_test.txt") # In this case we named the
output as “Referencetogenepop.txt” You can name it as you wish.

Replace "000000" with "0000" in Referencetogenepop1.txt

IMPORTANT - When the file is converted it makes NAs as 000000 (six zeroes), while it needs
0000 (four zeroes). Go to the file produced in the folder ‘Referencetogenepop_test.txt' and
replace 000000 to 0000

Using AssignPop

https://alexkychen.github.io/assignPOP/. It is important to put the pop.name in the alphabetic

order

Read reference genepop file

YourGenepoprefs <- read.Genepop("Referencetogenepop_test.txt", haploid = FALSE,

pop.names=c("Anacafe14", "Catigua Mg2", "Centroamericano", "Marsellesa", "Pacamara",
"PacasTekisic", "SanPacho")) # In the script, these references have to be listed in alphabetic
order: If not sure of the order look at the Referencetogenepop.txt created in the step before

Next step is predicting the samples

Example of the samples table

IMPORTANT: Make sure the heterozygous markers in the samples database are in the same
allele order as the reference database.
Het marker in ref. Het marker in Sample Het marker in the sample DB
DB DB after modification

A:C C:A A:C

A:G G:A A:G

A:T T:A A:T

C:G G:C C:G

C:T T:C C:T

G:T T:G G:T

If the heterozygous markers are not in the same allele order as the reference database, then
order the markers accordingly. (I.e, if the marker in the reference database is A:C and the
marker in the samples database is C:A, replace C:A for A:C in the samples database)

Read sample file

samples <- read.csv("Rawdata 1-16 149.016_report.csv")

samples2 <- samples[, 5:45] # Make sure you use only the columns where the markers are in
the database. In the example code line, the markers are located between columns 4 and 45
samplesg <- df2genind(snp2, ploidy = 2, NA.char = "", sep = ":", ind.names = snp$id, pop =
snp$Variety) # Remember that this code is used for the references and the ind.names= and
pop = have to match according to the database. In the example database the
ind.names=snp$id, the id is the vector giving the individuals names in the database named snp.
The pop = snp$Variety, is the vector giving the population of each individual.

Samplesg

Assess loci and genotype missing data

#?? missingno

samplesg <- missingno(samplesg, type = "loci", cutoff = 0.3, quiet = FALSE, freq = FALSE) #
Confirm whether the markers are missing in all genotyped plates and varieties. If any markers
have missing data across all plates and varieties, consider eliminating the data for those
markers accordingly. In this code, the threshold for removal is set at 30% of missing data.

samplesg <- missingno(samplesg, type = "geno", cutoff = 0.3, quiet = FALSE, freq = FALSE)
#Confirm whether the genotypes are missing in all genotyped plates and varieties. If any
genotype has missing data across all plates and varieties, consider eliminating the data for
those genotypes accordingly. In this code, the threshold for removal is set at 30% of missing
data.

samplesg

Convert genind object to genepop format

IMPORTANT - When the file is converted it makes NAs as 000000 (six zeroes), while it it needs
0000 (four zeroes).
genind_to_genepop(samplesg, output = "WCR_AssigPop_test.txt")

# Replace "000000" with "0000" in WCR_AssigPop_test.txt

#replace_zeros("WCR_AssigPop_test.txt")

Read unknown individuals in the genepop file

YourGenepopunknown <- read.Genepop("WCR_AssigPop_test.txt", haploid = FALSE)

Perform assignment test

The assign.X function utilizes markers data, along with known individuals (in this case the
references), to assign the unknown individuals to possible source populations. The results are
then saved as an ‘AssignmentResult.txt’ file.
?assign.X
assign.X( x1=YourGenepoprefs, x2=YourGenepopunknown, dir="ResultTest/",
model="naiveBayes") #The files are saved in the dir="ResultTest/" folder. Please make sure you
create the folder before running this codel line. If you have created the folder make sure this is
empty before you run this script line.

Read assignment result from AssignmentResult.txt

assignment_data <- read.table(“ResultTest/AssignmentResult.txt", header = FALSE)

After executing the script, the resulting 'AssignmentResult.txt' file needs to be opened and
saved as either a .csv or .xls file. The output table will include the predicted variety and the
corresponding membership probabilities, similar to the sample table provided below. It is
essential to understand that when the expected variety does not match the predicted variety, it
should not be assumed that the predicted variety accurately represents the sample. For
instance, in the table, the expected variety of the samples was "Anacafe14," and all the trees
except tree 2 had their predicted varieties matching the expected variety. However, for tree 2,
the predicted variety was "Marsellesa." Due to the closely related and similar SNP profiles of
Arabica coffee varieties, the assignment of "Marsellesa" is not entirely certain. So, this
assignment test is only for “yes” or “no” results.

References

Raymond M. & Rousset F, (1995). GENEPOP (version 1.2): population genetics software for
exact tests and ecumenicism. J. Heredity, 86:248-249

Chen, K. Y., Marschall, E. A., Sovic, M. G., Fries, A. C., Gibbs, H. L., & Ludsin, S. A. (2018).
assign POP: An R package for population assignment using genetic, non-genetic, or integrated
data in a machine-learning framework. Methods in Ecology and Evolution. 9(2)439-446.
https://doi.org/10.1111/2041-210X.12897

Zhang, D., Vega, F.E., Solano, W. et al. Selecting a core set of nuclear SNP markers for
molecular characterization of Arabica coffee (Coffea arabica L.) genetic resources.
Conservation Genet Resour 13, 329–335 (2021)

RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL
http://www.rstudio.com/.

R Package adegenet Tutorial
No ratings yet
R Package adegenet Tutorial
63 pages
Biological Versus Technical Variability
No ratings yet
Biological Versus Technical Variability
5 pages
Moesm1 Esm PDF
No ratings yet
Moesm1 Esm PDF
66 pages
Simple R Tools For Genetic Markers Research
No ratings yet
Simple R Tools For Genetic Markers Research
3 pages
LEA: An R Package For Landscape and Ecological Association Studies
No ratings yet
LEA: An R Package For Landscape and Ecological Association Studies
14 pages
SNPassoc: R Package for Genome Studies
No ratings yet
SNPassoc: R Package for Genome Studies
35 pages
RNASeq Data Analysis in R
No ratings yet
RNASeq Data Analysis in R
4 pages
Ggplot2 Slides
No ratings yet
Ggplot2 Slides
82 pages
R Package for Tree Diversity Analysis
No ratings yet
R Package for Tree Diversity Analysis
18 pages
DNA Based Techniques For Studying Genetic Diversity
No ratings yet
DNA Based Techniques For Studying Genetic Diversity
30 pages
Lai 2014
No ratings yet
Lai 2014
14 pages
R Tutorial For Identification of Positional and Functional Candidate Genes Using R
No ratings yet
R Tutorial For Identification of Positional and Functional Candidate Genes Using R
15 pages
Media 2
No ratings yet
Media 2
10 pages
Molecular Markers in Plant Breeding
No ratings yet
Molecular Markers in Plant Breeding
51 pages
Clustering 2
No ratings yet
Clustering 2
11 pages
R Graphics Essentials Great Data Visualization
No ratings yet
R Graphics Essentials Great Data Visualization
248 pages
GWAStutorial 23feb
No ratings yet
GWAStutorial 23feb
3 pages
Rcourse Partviz
No ratings yet
Rcourse Partviz
9 pages
Descriptive Analysis in R For Metagenomics
No ratings yet
Descriptive Analysis in R For Metagenomics
79 pages
RAPD
No ratings yet
RAPD
35 pages
Study of Genetic Diversity of Tomato Varieties and
No ratings yet
Study of Genetic Diversity of Tomato Varieties and
17 pages
Affy Diffexp Clustering Exercise-1
No ratings yet
Affy Diffexp Clustering Exercise-1
16 pages
Lecture 5 Molecular Markers
No ratings yet
Lecture 5 Molecular Markers
32 pages
Genepop
No ratings yet
Genepop
51 pages
Package Spader': R Topics Documented
No ratings yet
Package Spader': R Topics Documented
22 pages
07 - Diversity - Stats in R
No ratings yet
07 - Diversity - Stats in R
25 pages
Biodiversity R
No ratings yet
Biodiversity R
85 pages
MANOVA
No ratings yet
MANOVA
12 pages
Package Dismo': R Topics Documented
No ratings yet
Package Dismo': R Topics Documented
68 pages
Introduction To R For Gene Expression Data Analysis
No ratings yet
Introduction To R For Gene Expression Data Analysis
11 pages
Chapman 2018appendixs2
No ratings yet
Chapman 2018appendixs2
10 pages
Tutorial Genomics
No ratings yet
Tutorial Genomics
51 pages
16S Metagenomic Analysis Tutorial
No ratings yet
16S Metagenomic Analysis Tutorial
9 pages
Aplicaciones de SNPs en Plantas
No ratings yet
Aplicaciones de SNPs en Plantas
8 pages
RNA-Seq Analysis Course
No ratings yet
RNA-Seq Analysis Course
40 pages
Molecular Ecology - 2017 - Neethiraj - Investigating The Genomic Basis of Discrete Phenotypes Using A Pool Seq Only
No ratings yet
Molecular Ecology - 2017 - Neethiraj - Investigating The Genomic Basis of Discrete Phenotypes Using A Pool Seq Only
13 pages
Lecture 6
No ratings yet
Lecture 6
76 pages
Package Spaa': R Topics Documented
No ratings yet
Package Spaa': R Topics Documented
32 pages
PMMOV BelBi 2016
No ratings yet
PMMOV BelBi 2016
13 pages
BiodiversityR PDF
No ratings yet
BiodiversityR PDF
128 pages
DA Lab Week-1
No ratings yet
DA Lab Week-1
7 pages
BiodiversityR PDF
No ratings yet
BiodiversityR PDF
145 pages
Utilization of Ensemble
No ratings yet
Utilization of Ensemble
13 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
25 pages
An Introduction To Data Analysis Visualization Using R
No ratings yet
An Introduction To Data Analysis Visualization Using R
30 pages
Biodiversity R
No ratings yet
Biodiversity R
158 pages
Bioinformatics 24-11-1403
No ratings yet
Bioinformatics 24-11-1403
3 pages
Mapping
No ratings yet
Mapping
19 pages
R Packages for GWAS Analysis
No ratings yet
R Packages for GWAS Analysis
18 pages
De Novo Transcriptome Assembly Guide
No ratings yet
De Novo Transcriptome Assembly Guide
1 page
Ecology Analysis with BiodiversityR
No ratings yet
Ecology Analysis with BiodiversityR
149 pages
Introduction To Differential Gene Expression Analysis Using RNA-seq
No ratings yet
Introduction To Differential Gene Expression Analysis Using RNA-seq
97 pages
Biotools
No ratings yet
Biotools
34 pages
Plant Breeding Genetic Markers
No ratings yet
Plant Breeding Genetic Markers
59 pages
Diversity: Assessing Plant Genetic Diversity by Molecular Tools
No ratings yet
Diversity: Assessing Plant Genetic Diversity by Molecular Tools
17 pages
QGIS Developers Guide: Release Testing
No ratings yet
QGIS Developers Guide: Release Testing
49 pages
The "Sudden" Transition To The Free Floating Exchange Rate Regime in Russia in 2014
No ratings yet
The "Sudden" Transition To The Free Floating Exchange Rate Regime in Russia in 2014
12 pages
Multi-Timeframe Strategy Testing Guide
No ratings yet
Multi-Timeframe Strategy Testing Guide
9 pages
The Modern History of Exchange Rate Arrangements: A Reinterpretation
No ratings yet
The Modern History of Exchange Rate Arrangements: A Reinterpretation
41 pages
Analysis Modeling and Simulation of A Poly-Bag Man PDF
No ratings yet
Analysis Modeling and Simulation of A Poly-Bag Man PDF
11 pages
Forex Trend Strength Indicator
No ratings yet
Forex Trend Strength Indicator
26 pages
Analysis, Modeling and Simulation of A Poly-Bag Manufacturing System
No ratings yet
Analysis, Modeling and Simulation of A Poly-Bag Manufacturing System
11 pages
Reset A Forgotten password-ArcGIS Server Administration (Linux) - ArcGIS Enterprise
No ratings yet
Reset A Forgotten password-ArcGIS Server Administration (Linux) - ArcGIS Enterprise
2 pages
Filter A Feature Layer - ArcGIS For Developers
No ratings yet
Filter A Feature Layer - ArcGIS For Developers
8 pages
Table Prefix - ADempiere ERP Wiki
No ratings yet
Table Prefix - ADempiere ERP Wiki
4 pages
Proceedings of The Eastern Asia Society For Transportation Studies, Vol. 5, Pp. 1281 - 1300, 2005
No ratings yet
Proceedings of The Eastern Asia Society For Transportation Studies, Vol. 5, Pp. 1281 - 1300, 2005
3 pages
GIS & ITS: Building New Networks
No ratings yet
GIS & ITS: Building New Networks
18 pages
Web-Based Traffic System
No ratings yet
Web-Based Traffic System
7 pages
IOM 533 - Systems Analysis and Design Jay Miller System Design Project Final Report
No ratings yet
IOM 533 - Systems Analysis and Design Jay Miller System Design Project Final Report
20 pages
Training Manual On GIS For Hydrologist and Hydraulic Engineer
No ratings yet
Training Manual On GIS For Hydrologist and Hydraulic Engineer
129 pages
Op 7 User Guide
No ratings yet
Op 7 User Guide
476 pages
It430 Midterm Paper 01
No ratings yet
It430 Midterm Paper 01
7 pages
WIZ State of The Cloud 2023
No ratings yet
WIZ State of The Cloud 2023
9 pages
PM - Red Bus
No ratings yet
PM - Red Bus
11 pages
Abstract of The Project College Website
No ratings yet
Abstract of The Project College Website
20 pages
School Tuckshop Sales and Control Sysytem
No ratings yet
School Tuckshop Sales and Control Sysytem
55 pages
Security in Computing: 5th Edition Charles P. Pfleeger and Shari Lawrence Pfleeger Full
No ratings yet
Security in Computing: 5th Edition Charles P. Pfleeger and Shari Lawrence Pfleeger Full
156 pages
Hyperion Financial Reporting (HFR) Report Development Best Practices
No ratings yet
Hyperion Financial Reporting (HFR) Report Development Best Practices
27 pages
Imperva Hardware Appliances Datasheet
No ratings yet
Imperva Hardware Appliances Datasheet
5 pages
10 1 1 733 434 PDF
No ratings yet
10 1 1 733 434 PDF
118 pages
SQL Server 2014 Licensing Datasheet
No ratings yet
SQL Server 2014 Licensing Datasheet
3 pages
RFID Based Library Management System: Article
No ratings yet
RFID Based Library Management System: Article
5 pages
IBM BI Tookit Datastage V1 0
No ratings yet
IBM BI Tookit Datastage V1 0
141 pages
Fanuc Mt-Link Custom Webui Manual: Manual Title Specification Number
No ratings yet
Fanuc Mt-Link Custom Webui Manual: Manual Title Specification Number
31 pages
WhitePaper Migrate Project SQL Server
No ratings yet
WhitePaper Migrate Project SQL Server
4 pages
Fundamentals of Database Systems Course Outline
No ratings yet
Fundamentals of Database Systems Course Outline
2 pages
80117-03 QA BeamChecker Plus Manual
No ratings yet
80117-03 QA BeamChecker Plus Manual
50 pages
VIVA Software For Flying Probe Systems MA-VI-VIVASWEN-02
100% (2)
VIVA Software For Flying Probe Systems MA-VI-VIVASWEN-02
198 pages
Cip Resource Packet s2023 Revised
No ratings yet
Cip Resource Packet s2023 Revised
11 pages
Bibtex Thesis Master
100% (3)
Bibtex Thesis Master
8 pages
Database Concepts for Students
No ratings yet
Database Concepts for Students
7 pages
Paper Blood Bank
No ratings yet
Paper Blood Bank
9 pages
Student Attendance Management System Project File
No ratings yet
Student Attendance Management System Project File
31 pages
Openvswitch & OVSDB Overview
No ratings yet
Openvswitch & OVSDB Overview
8 pages
The Database Hackers Handbook 1st Edition David Litchfield Full
100% (15)
The Database Hackers Handbook 1st Edition David Litchfield Full
161 pages
FileNet P8 Security
No ratings yet
FileNet P8 Security
291 pages
Wincor Guide V103 Rev2 1
No ratings yet
Wincor Guide V103 Rev2 1
43 pages
Kaustubh Webmethods Resume 3.5yrs 3
No ratings yet
Kaustubh Webmethods Resume 3.5yrs 3
3 pages
UK Seafarer Market Analysis
No ratings yet
UK Seafarer Market Analysis
11 pages

LDP Webpage Release Methods

Uploaded by

LDP Webpage Release Methods

Uploaded by

Methods

Building the Reference Database: Our Approach and Process

First work on the reference database:

Example reference database

Reference database to compare variety samples

Install required packages if not already installed

Load required libraries and sources

Set the working directory

Read the reference file

snp <- read.csv("FinalWCRFullReferencesLDPA-07-03-23.csv")

Convert genind object to genepop format

Replace "000000" with "0000" in Referencetogenepop1.txt

https://alexkychen.github.io/assignPOP/. It is important to put the pop.name in the alphabetic

Read reference genepop file

YourGenepoprefs <- read.Genepop("Referencetogenepop_test.txt", haploid = FALSE,

Next step is predicting the samples

Example of the samples table

A:C C:A A:C

A:G G:A A:G

A:T T:A A:T

C:G G:C C:G

C:T T:C C:T

G:T T:G G:T

Read sample file

Assess loci and genotype missing data

Convert genind object to genepop format

# Replace "000000" with "0000" in WCR_AssigPop_test.txt

Read unknown individuals in the genepop file

Perform assignment test

Read assignment result from AssignmentResult.txt

You might also like