0 ratings0% found this document useful (0 votes) 49 views105 pagesNIOSH-The Development and Application of Algorithms For Generating Estimates of Toxicity For The NOHS Data Base
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
NIOSH TECHNICAL REPORT
‘THE DEVELOPMENT AND APPLICATION OF ALGORITHMS
FOR GENERATING ESTIMATES OF TOXICITY
FOR THE NOHS DATA BASE
HERBERT L. VENABLE
U.S DEPARTMENT OF HEALTH AND HUMAN SERVICES
Public Health Service
Centers for Disease Control
National Institute for Occupational Safety and Health
Division of Surveillance, Hazard Evaluations and Field Studies
Cincinnati, Ohio 45226
July, 1986DISCLAIMER
Mention of company names or products does not constitute endorsement by the
National Institute for Occupational Safety and Health
ACKNOWLEDGEMENT
The development and application of the algorithms presented in this technical
report were accomplished under NIOSH contracts 210-78-0077, 210-78-0066, and
210-80-0044 (Genesee Computer Center, Inc., with Health Designs, Inc. and the
Franklin Research Institute under subcontract).
I would like to thank the following people for their critical review of this
document:
Ms. Alice Griefe Or. Harold Resnick
Occupational Toxicologist Science Advisor
Industrial Hygiene Section Office of the Director
Industrywide Studies Branch DROS, NIOSH
OSHEFS, NIOSH
Dr. Sanford Leffingwel1 Or, Curtis Travis
Chief, Research Analysis Section Office of Risk Analysis
Priorities Research Analysis Branch Health and Safety Research Division
DBBS, NIOSH Oak Ridge National Laboratory
Dr. Robert W. Mason Dr. Joseph Kelaghan
Technical Advisor for Science Epidemic Intelligence Service Officer
BBS, NIOSH National Cancer Institute
I would also like to express my thanks to Or. Wm. Kar] Sieber, Jr. of NIOSH
for his review of the statistical methodologies, Mr. David H. Pedersen of
NIOSH for his suggestions and comments on organizing and writing this
document, and to Ns. Kathy Mitchell for manuscript preparation.
DHHS (NIOSH) Publication No, 87-101
iiIn.
Il.
Vv.
VI.
TABLE OF
Introduction. . 2 2. -
Development of Algorithms
A. General Background .
B. Modeling the Algorithms .
€. Statistical Methodologies
D. Development of Individual Estimation Algorithms .
Estimation and Ranking of NOHS Compounds
Discussion . 2 2 1 we
References . . . 1.
Appendices . 2 . .
iii
Page No.
n
a9
62
66
69Table No.
10.
i.
zs
13.
4,
15.
16.
W.
18.
19.
20.
A Sampling of Molecular Descriptors Used in Structural
Activity Relationship Studies . . . Parnes
Potential ProblemKeys . . . . . s+. -
Oso Algorithm: Regression Statistics for Subset
Models of 1,000, 1,500, and 2,000 Compounds . . . . .
Distribution of Log 1/C for 1,968 Compound Model. . . .
L0gq Algorithm Equation . . 2 2... 1 7 we
Test Compounds - Characteristics of Residuals. . . . .
Mutagen Algorithm Equation . . . 2. ee ee
Mutagen Algorithm - Design Compounds. . . . - ~
Mutagen Algorithm - Misclassification in Ranges .
Mutagen Algorithm ~ Test Compounds se ee
Carcinogen Algorithm Equation . . . .
Carcinogen Algorithm - Classification by Discriminant
Equation lsc 0 4) ol Sele = ee
Carcinogen Algorithm - Misclassification in Ranges
Criteria for Evaluation of Teratogenicity . . . -
Teratogen Algorithm Equation . . . . oe
Distribution of Teratogenicity Scores . . . . -
Teratogen Algorithm - Discriminant Equation Evaluation .
Teratogen Algorithm - Misclassification in Ranges. . .
Number of Chemical Compounds by Selected Ranges
(0.750 or Greater) of Estimated Toxicity Endpoint Values
Some Predictive Toxicology Oriented Models for the
Correlation of Chemical Structure with a Biologic Endpoint
iv
Page No.
4
16
W
22
25
31
32
33
36
a3
44
6
50
56
57
58
60
63ETEURES
Figure No.
1. Procedures for Developing an Algorithm . . . . . .
2. Translation Process for Obtaining a Quantifiable
(Numerical) Representation of Molecular Structure . .
3. LDso Estimating Equation . . 2 2. 2 1 ww ee
4. Equations for Calculating Mutagen and Nonmutagen Scores .
5. Mutagenicity Estimating Equation . . 2 . 2. .
6. Equations for Calculating Definite (Carcinogen)
and Indefinite (Noncarcinogen) Scores. . . . . «
Carcinogenicity Estimating Equation . . . . 2 2.
Equations for Calculating Teratogen and Nonteratogen
scores wo a See Se
Teratogenicity Estimating Equation. . . . . . . .
Page No.
12
29
30
36
a2
ar
48APPENDICES,
Appendix Page No.
A. Wiswesser Line-Formula Notation Symbols and Definitions . 69
8. WLN Example for an Acyclic Compound. . . . . . . . . n
C. WIN Example for a Cyclic Compound . . . . . . . . 2. 72
D. Molecular Substructure Keys and Their Definitions . . . . . 74
E. Example of Generating an LOso Estimate. . . . . . . 2 . 93
Tog tee ioe (Compara | eed injetel sqHAlWGRI unLn sda 1h
Data Base. mare fp eee e ®
B.E1S€ of ‘Compounds, Used! tn the Mutagen Algor{ thm KodeTing
DataBase . 2. 1 we Se a Ae ee
H. Example of Generating an Estimate of Mutagenicity . . . . . 94
IO ely ceseonededinpareloaee toast atop] sau Laden
DataBase . . 2 2... pen oie _. *
J. Example of Generating an Estimate of Carcinogenicity . . . . 95
SNE Ee ee ee eta cae
DataBase . . 2. wee a>
L. Example of Generating an Estimate of Teratogenicity . . . . 96
M. List of NOHS Compounds Receiving an LOsq Estimate . . . . . *
List of NOHS Compounds Receiving an Estimate of Mutagenicity . *
0. List of NOHS Compounds Receiving an Estimate of Carcinogenicity *
P. List of NOHS Compounds Receiving an Estimate of Teratogenicity . *
These appendices have been placed on microfiche and attached to the back
cover of the printed report.
viSpecial Note from the Author
The research and final products of the work presented in this report were
accomplished under NIOSH contracts 210-718-0077, 210-719-0066, and 210-80-0044,
with the author serving as the NIOSH Project Officer. However, time and funds
allocated for this project expired before an approved final report was
submitted by the contractor.
Since the author believes that the products of this project have significant
value in the field of occupational safety and health, results of the project
are reported here despite the lack of a final report from the original
contractor.
The author wishes to extend his appreciation to and acknowledge the following
individuals for their contribution, through the draft report, in the
compilation of this report:
Mr. Kurt Enslein, President
Health Designs, Inc.
Rochester, New York
Or. Paul Craig
National Library of Medicine
Bethesda, Maryland
Or. John Strange
Franklin Research Institute
Philadelphia, Pennsylvania
Mr. Tom Lander
Health Designs, Inc.
Rochester, New York
Hr. Michael Tomb
Health Designs, Inc
Rochester, New York
The text of this report is extracted largely from the contractor's incomplete
draft report and is cited extensively throughout this report as are several
publications by Enslein et al, which were written and published as the models
were developed. These publications should be consulted in conjunction with
this report to obtain a more comprehensive understanding of the project and
‘its intent. Copies of the contractor's incomplete final report to NIOSH are
available upon request from the author.
viiABSTRACT
This project developed computer-based algorithms designed to provide estimates
of toxicity for four toxicologic endpoints; LDsq (oral, rat), mutagenicity,
carcinogenicity, and teratogenicity. These algorithms are the end result of a
series of models tested against available toxicity data for each of the four
toxic endpoints. The modeling data base for each endpoint contained a listing
of chemical compounds determined to be toxic or non-toxic for each endpoint
based on a subjective analysis of the bioassay data available.
Once the algorithms had been developed and tested, they were applied to the
chemicals in the National Occupational Hazard Survey (NOHS) data base to
generate estimates of toxicity for those chemical compounds known to be in the
workplace. These estimates of toxicity are particularly useful in assessing
the toxicity of those chemical compounds for which little or no toxicity data
has been reported.
The algorithms produce estimates of toxic effect based on statistical
computation and are therefore known to incorporate a certain degree of
unavoidable statistical error. This and other limitations discussed in the
report preclude the use of such theoretical toxicity data as a substitute for
reported animal bioassay data or as the sole basis in making regulatory or
other decisions of similar magnitude regarding the use of and exposure to
chemical compounds. Instead, these toxicity data are intended only for
rank-ordering a list of compounds according to relative toxicity or as a part
of an overall process of selecting, testing, and evaluating chemical compounds
for toxicity.
viiiTs
Introduction
A.
Purpose
This project developed and applied computer-based algorithms to the
chemical compounds (hereafter referred to as compounds) listed in
the NIOSH National Occupational Hazard Survey (NOHS) data base in
order to generate estimates of toxicity for these compounds for the
following toxic endpoints:
LDsq (oral, rat)
Mutagenicity
Carcinogenicty
Teratogenicity
The theoretical toxicity data thus generated is intended for use
only as an additional tool in assessing the toxicity of those
compounds found in the workplace.
‘The compounds listed in the NOHS data base are a result of the
National Occupational Hazard Survey which was a two-year study
(1971-74) “intended to describe the health and safety conditions in
the American work environment and, more specifically, to determine
the extent of worker exposure to chemical and physical agents" (1).
Observational data were gathered by surveying approximately 5,000
facilities encompassing all types of industrial activity covered by
the Occupational Safety and Health Act (OSHA) of 1970.
Approximately 8,000 separate chemical substances were identified as
Present in the workplace during the course of the survey. These
8,000 plus chemical substances are included in the NOHS data base.
The application of these four algorithmns to these compounds known
to be in the work environment extends the utility of the data base
by providing NIOSH with a unique toxicology information resource.
Such a resource can be effectively utilized in a number of areas.
For NIOSH, a major application could be for risk assessment and
prioritization of research on chemical hazards in the workplace.
Structural Activity Relationships (SARs)
All four algorithms were developed on the assumption that a
structure-activity relationship (SAR) exists among groups of
compounds that exhibit similar chemical characteristics, For
example, a SAR may exist among a group of compounds that possess a
certain degree of ionic charge per molecule and may therefore have a
similar degree of water solubility. SARs may be based on one or
more of a number of molecular structure descriptors. Some of the
more commonly used structural parameters are listed in Table 1.
The concept of SARs has been applied in several areas. For example,
the primary use of the SAR concept in pharmaceutical chemistry has
been for the evaluation of therapeutic effects of potential new drug
compounds. Several approaches have been used in the application ofTABLE 1. A SAMPLING OF MOLECULAR DESCRIPTORS USED
IN STRUCTURAL ACTIVITY RELATIONSHIP STUDIES
Physiochemical descriptors
Molecular weight
Density
Melting point
Boiling point
Logarithm of n-octy! alcohol/
water partition coefficient
Molecular refractivity*
Topological descriptors
‘Atom and bond fragments
Substructures (atom groups)
Substructure environment
Number of carbon atoms
Number of rings (in polycyclic compounds)
Molecular connectivity (extent of branching)
Geometrical descriptors
Molecular volume
Molecular shape
Molecular surface area
Substructure shape
Taft steric parameter*
Verloop sterimol constants*
Electronic descriptors
Hammett-Taft sigma constants*
Electron density —- bond reactivity
Dielectric constant
Dipole and higher moments
Ionization potential
Electron affinity
* These "complex descriptors" could be placed in other categories as well.
Reprinted with permission from Chemical and Engineering News, March 9,
1981 (2).SAR research. Craig and Enslein (3) divided these methods of
approach into four categories.
1, Intuitive Approach - which applies the organic chemists’ skill,
knowledge, and intuition. More recently this approach has
focused on creating an additive model SAR which is based on the
hypothesis that each structural feature of a molecule plays a
consistent role in contributing to the overall activity of the
molecule.
2. Multiple Parameter Approach - which combines known
physical-organic chemical relationships into a novel
mathematical expression to relate the biological activities of a
closely related series of compounds to one or more physical
properties (e.g., water-octanol solubility ratio or more
commonly referred to as the partition coefficient).
3. Quantum Chemical Approach - which employs the principles of
quantum mechanics and calculations. For example, one approach
obtains electronic indices for a series of structurally related
chemicals.
4. Substructural Analysis Approach - which is based on the analysis
of type and, in some cases, frequency of occurrence of
substructural or molecular fragments of molecular substructures,
(2.9. ,-NOg).
Unlike the multiple parameter, additive model, or the intuitive
approach methods, Adamson et al, state that the substructural
analysis method may be used for a large number of structurally
well-diversified compounds (4). Statistical analysis may then be
applied to the type and frequency of substructural fragments to
provide a quantitative value (i.e., coefficient value) for specific
fragments that represents the amount of influence that each fragment
exerts in the overall statistical variation of a group of compounds.
II. Development of Algorithms
AL
General Background
Prior to 1975, the concept of SARs was generally applied to groups
of structurally similar compounds, usually for the purpose of
evaluating potential therapeutic effects in new drug research.
Beginning about 1975, SAR concepts were applied to structurally
similar groups of compounds for evaluating toxicity (5-10). Papers
presented at the Symposium on Structural Correlates of
Carcinogenesis and Mutagenesis, held at the U. S. Naval Academy,
Annapolis, Maryland, 1977, reflect some of the areas of interest,
endeavor, and success in application of SAR concepts for the
evaluation of toxicity (11).
The application of quantitative structure-activity relationships
(QSARS) to structurally diverse compounds for the evaluation of
toxicity was first reported by Craig and Waite (12) and Enslein andCraig (13). This project is an extension of this application of SAR
concepts and employs the substructural analysis approach described
by Enslein et al, (3).
Modeling the Algorithms
A number of molecular descriptors were considered for use in
modeling the algorithms, (e.g., octanol-water partition coefficients
and molar connectivity indices). In this project, regression
analysis was used to select those molecular descriptor parameters
most useful in modeling the algorithms. Ultimately, the occurrence
of substructural fragments (and, in the carcinogen and Ls
models, molecular weight) were selected and used as the chemical
descriptor variables in these algorithms.
Al] four algorithms were developed in a similar fashion. However,
‘there were some differences and these will be pointed out in the
presentation of the individual models. Basically, the procedure was
as shown in Figure 1. A data base was created for use in developing
each model. These data bases listed compounds selected on the basis
of evidence indicating their ability to induce or not induce the
effect of the selected toxicologic endpoint (e.g., carcinogen or
Noncarcinogen). Once the modeling data base was established, the
resulting algorithm was designed and tested and then applied to the
compounds listed in the NOHS data base for which the required
‘information, (molecular formula, molecular weight, and a Wiswesser
Line-Formula Notation) was available or could be generated.
Molecular structure plays a key role in all four algorithms in that
a multi-step process is used to translate molecular data from a
three-dimensional concept to a quantifiable value useful in
generating toxicity estimates. These steps are summarized in
Figure 2.
Wiswesser Line-Formula Notation (WLN) is used as the initial step in
this translation process. The use of WLN is summarized by Smith and
Baker as "...a precise and concise means of expressing the
structural formulas of chemical compounds. Its basic idea is to use
letter symbols to denote functional groups (chemical) and to use
numbers to express the lengths of alkyl chains and sizes of rings.
These symbols then are cited in connecting order from one end of the
molecule to the other" (14)
The symbols employed by the WLN are the numerals 1-10, the 26
capital letters, the four punctuation marks & -, /, and *, and a
blank space (See Appendix A). According to Smith and Baker (14),
with these symbols and approximately "a dozen new chemical symbols
to supplement the old familiar ones, plus half a dozen operating
symbols and the fundamental rules for manipulating them", a chemist
should be able to write a WLN or read one as you would read a
conventional structural formula.
‘As might be expected, the accuracy and usefulness of a toxic
endpoint prediction, as estimated by these four algorithms, depends
Jargely on an accurate description of the molecular structure.FIGURE 1. PROCEDURES FOR DEVELOPING AN ALGORITHM
Create Data Base for
Modeling of Algorithm
Generation of WLNs for
Compounds in ata Base
Generation of Chemical
Descriptor Keys based on WLNs
Analysis of Variance Applied
To Keys; Retain Keys with a
Value of F 71.7
Statistical Calculation of
Coefficient Values for Keys
Create Subset of Keys
for each AlgorithmFIGURE 2. TRANSLATION PROCESS FOR OBTAINING A QUANTIFIABLE (NUMERICAL)
REPRESENTATION OF MOLECULAR STRUCTURE
2-dimensional drawing of
molecular structure
Generate a WLN based on
rules and guidelines
prescribed
Application of Gen Key #1
Computer Program to WLN
for generation of relevant
keys*
Execute Gen Key #2 Computer
Program to obtain estimation
* Note that keys listed as problem keys (Table 2) must be manually checked as
being relevant or not to the assigned WLN.The accuracy of the generated WLN is subsequently expressed in the
generation of the relevant chemical descriptor keys which are
numerical representations of substructural molecular fragments. For
example, the ~OH (hydroxyl) substructure is the letter Q in WLN and
Key 38 in chemical descriptor key terminology.
Assigning an accurate WLN to a compound requires a complete
knowledge of Wiswesser Line-Formula Notation in conjunction with a
considerable background in organic chemistry structure and
nomenclature. However, techniques have been developed for
generating WLNs by drawing the structures on an electronic graphics
pad linked to an appropriately programmed computer which then
generates the WLN (15).
AWLN cannot be accurately generated for certain compounds. Most
notable of these are polymers or compounds for which the molecular
structure may vary or is not known. It was also determined that
inorganic compounds do not perform well in any of the four models
(3). This is due largely to the inadequate WLN representation of
the relatively simple structures of inorganic compounds because too
few keys are generated. Conversely, the more complex the molecule,
the more involved the WLN, and inaccuracies or alternate
representations may occur, possibly resulting in the erroneous
generation of keys or failure to generate valid keys.
To demonstrate the use of WLN in molecular description, examples of
assignment of WLNs to compounds are presented in Appendix B for an
acyclic compound and in Appendix C for a cyclic compound. The need
for accuracy in the generation of the WLN warrants emphasis, since
WLN notation is the major factor in the equations for all four
algorithms.
The next step in the translation procedure is to generate chemical
descriptor keys for a compound based on the assigned WLN. This is
accomplished by submitting the WLN to a computer program, developed
by Enslein et al, as a part of this project, called Genkey 1/
Genkey 2.
Chemical descriptor keys provide an expression of molecular
structure in terms of substructural fragments and lead to the
development of quantifiable (key coefficient) values for use in the
algorithms. Obtained from several sources, a total of 309
descriptor keys (with an additional 50 keys assigned based on the
Presence of certain combinations of keys 1-309) were used in the
development of the four algorithms. None of the models employ all
359 keys in describing molecular structure. A subset of keys is
generated by using statistical procedures that are described later
in this report. Essentially, keys are selected by determining their
contribution to the toxicity endpoint in question. This is
determined by the frequency of the occurrence of a key (representing
a specific molecular substructure) in the compounds listed in the
modeling data base. In effect, the greater the Frequency of
occurrence the greater the probability that the key contributes
significantly to the toxicity of that endpoint. Statistical methods
are then used to calculate a coefficient value for each key in aselected subset of the 359 possible keys. It is this quantifiable
(numerical) representation of a molecular substructure that is used
in the modeling equations to generate estimates of toxicity.
The number of keys selected from the 359 possible used in each model
are as follows:
Endpoint # Keys Selected
L059 - 82
Carcinogenicity - 18
Mutagenicity - 57
Teratogenicity -61
A list of a11 359 keys and a description of the structure each
represents is provided in Appendix D. A list of the keys in each
model, their descriptions, and their coefficient values are provided
as that model is described in this report.
Unfortunately the key generation programs are not error free. The
contractor was unable to "de-bug" these computer programs within the
time and funds allocated for this project. Three types of potential
key generation problems are known to occur:
1. Keys not generated when they should be.
2. Keys generated when they should not be.
3. Keys erroneously generated. (Keys 310-350 represent certain
combinations of keys 1-309 as defined in the description of each
key presented in Appendix 0). This is a particular problem with
keys 311, 337, 342, and 349.
As a consequence, key files must be manually reviewed and compared
against the WLN files for specific compounds to insure that all of
the keys generated are correct on the basis of the assigned WLN.
Corrections are made if necessary, and the data is resubmitted to
the estimating program. Potential problem keys are listed in
Table 2.
Statistical Methodologies
1. Selection of Variables.
For each modeling data base, variables to be included in
defining the algorithm were determined using regression
techniques (16). Stepwise regression or stepwise discriminant
analysis as used based on whether the endpoint of the algorithm
was considered as continuous or discriminant (3). If the
endpoint was continuous, stepwise regression analysis was used.
If the endpoint was discriminant (teratogen algorithm)
discriminant analysis was used. As discussed later, it was
necessary to use discriminate instead of regression analysis in
developing the teratogen model because of the scoring processTABLE 2.
POTENTIAL PROBLEM KEYS
Problem Key Key wLN
Type No. Description Symbo1(s)
i 2 Positive charge
150 Chain primary amide 2V or Vz
151 Chain secondary amide VM or MV
152 Chain tertiary amide © N_V or VN
181 Substituent primary
amide 2V or Vz
182 Substituent secondary
amide VM or HV
183 Substituent tertiary
amide NV or WW
162/193 Sul fonamide (N)-SW or SW(N)
163 Chain Guanidine (N)-Y-U(N) or
(N)-Y-U(N)-(N) or
(n)uY-(N)=(N)
165/196 Thioamide SUYZ or YZUS
186/197/304 Dialkylamino wn)
167/198 Methoxy 01 or 10
aD Chain Phenylethy1 2R or R-(*)2
112/203 Phenoxy OR or R-(*)O
178/209 Urea (N)=V(N) Note: (N)
can be in ring
180 Bipheny] R-(*DR
189 lactam (N)V or V(N) within
ring
269 Potassium -KA-
306/309 Carbamate OV(N) or (N)-VO
158 Chain N-substituted Does not apply
acylThydrazide
162 Chain sulfonamide Does not apply
163 Chain guanidine Does not apply
166 Chain dialkylamine Does not apply
(bonded to carbon)
3 310-350 Refer to Appendix D Does not apply
Note: (*) represents any locant;
From Enslein et al, (3).
for description
(N) represents any nitrogen.used in determining teratogenicity of compounds in the modeling
data base.
Stepwise regression procedures used to select variables may not
always produce the best set of variables. The variables
selected may be correlated and, as a result, produce a biased
model (3, 17). To avoid such bias, candidate variables were
selected froma larger set of variables, similar to those listed
jn Table 1, all of which were thought to make a possible
contribution to the explanation of the statistical variance of
the modeling data base (3). Ridge regression and a second
stepwise regression were done using the candidate variables
following the preliminary regression analysis.
The initial regression used a backward elimination procedure.
All variables were included in the model and were selected out
if their F-values were not significant at P=.05 (3, 17). In
effect, candidate variables with low criterion or where F-values
contributed least to the variance analysis equation were removed
from the putative equation until the F-value reached was 1.7
(18). Ridge regression was performed on the remaining variables
and ridge traces for each variable were examined to see whether
any singularities existed which might suggest that the variable
be omitted from the algorithm (3, 19). Least square estimates
used in the backward elimination procedure might give results
far removed from true variable values if the variables are
correlated (17). The ridge regression was used to check the
results of the stepwise regression. Finally, stepdown
regression was repeated using only those variables retained
following the ridge regression analysis.
In performing the regressions, outlier compounds (i.e.,
compounds that are not statistically characteristic of the main
group of compounds) were identified and removed. The effect of
removing a few outlier compounds from a large data set of
several hundred compounds was felt to be minimal (3).
Statistical Evaluations of the Algorithms
Several statistical tests were used to evaluate the accuracy of
classification by the algorithms. Of the evaluation tests used,
the subset verification test was used to evaluate the accuracy
of classification. This test probably provides the only
practical evaluation of performance testing currently available
(3). Using this test, a randomly selected subset of compounds
js withheld from the data base that was created for the purpose
of modeling the algorithm. The algorithm is then designed on
the remaining compounds in the data base and is then tested with
the subset of compounds set aside for that purpose. Residual
plots, misclassification rates, and the Kilmogorov-Smirnov
‘two-sample tests (18) were also used to test the model by
comparing estimated values for endpoints with those values
assigned based on actual values (i.e., reported bioassay
testing) for the compounds in the verification test subset.
10The results of the various statistical evaluation methods are
presented following the description of the respective models.
Statistical references cited should be consulted for a more
detailed description of the statistical methods mentioned in
this report.
D. Development of Individual Estimation Algorithms
ils
General Modeling Considerations.
In developing all four algorithms, calculated data was easier to
use if converted to equivalent logorithm values. Such
conversion produces a normally distributed data base, (1.e.,
Jog-linear) and also eliminates the problem of dealing with a
wide range of values such as 1:1000 which might occur in the
dose ranges of 1 milligram to 1 gram seen in the LOso
algorithm. In the LOsg algorithm, the use of the reciprocal
(1/C) of the reported or estimated LDsq concentration value
creates a normal distribution of the data to facilitate the use
of the logorithms. Consequently, to obtain the final estimated
L0gg values in mg/kg or probability values between 0 and 1 for
the other endpoints, it is necessary to take reciprocal values
and convert back from logorithmic to actual values.
There are several steps (equations) necessary to obtain an
LOgo estimate or probability value. Each algorithm, shown in
its respective table, lists all of the descriptor keys (and, in
the case of the LDsq and carcinogen algorithms, molecular
weight) that have been found to be statistically significant to
their toxic endpoints. The compound for which predictive
toxicity estimates are to be generated is translated into the
equivalent WIN. From the WUN, al] keys that are represented in
the WLN are selected from the total set of 359 keys. However,
only those keys that also appear in the model subset of keys are
used in calculating the positive and negative scores (e.g.,
carcinogen and noncarcinogen scores) for the carcinogen,
mutagen, and teratogen algorithms or to calculate the estimated
Jog (1/C) value in the LDso algorithm (see Figure 3). These
values are then used in the final estimating equation for each
endpoint.
These equations are presented in a step-wise manner for each
algorithm as it is discussed. An example use of each algorithm
is presented in the appendices as indicated in the discussion of
the models
The LD5q algorithm expresses the estimated endpoint value as
the dose of a compound, in mg/kg, necessary to kill one-half of
the test animal population (i.e., lethal dose for 50% mortality,
hence LDso). The other three models express a predicted
endpoint value within a range of 0.000 to 1.000 with 1.000,
being the highest probability of the toxicologic endpoint
occurring as a result of exposure to that compound (e.g., 0.989
probability of the compound being carcinogenic). For the
purpose of this report, the terms probability and potential are
uvFIGURE 3. Oso ESTIMATING EQUATION
The pertinent coefficient values (c) for each of the keys are summed ( c)
and added to the regression constant (0.552) and to the product of 0.681 x
Jogy9 (Mol. Wt.). The resulting value is the estimated log 1/c, where c
is the number of moles of the compound which represents the LDsg.
Jog (1/c) = .552 + .681 (logig M.Wt.) + ¢
To convert log 1/c to the estimated LOs9, expressed as mg/kg, use the
following equation:
L059 (mg/kg) = 1000 x M. Wt
antitog Tog (17c)
12considered interchangeable. The final equations of the
algorithms developed are unusual in that they are expressed in
tabular form because they are quite long and are not in the
usually perceived algebraic form.
LDs9 (oral, rat) Algorithm.
Data used in the LDsq algorithms originated from The Toxic
Substances List (20) which is now called the NIOSH Registry of
Toxic Effects of Chemical Substances (RTECS). The results of
the LDsq algorithm are derived from a continuous (as opposed
to a discriminant) endpoint, and the procedures for generating
an estimated LDgg value are different from those of the other
three models. These procedures are illustrated in Appendix E
using an example compound.
There were two LOsq models developed in this project. An
earlier model was based on 475 compounds selected from the
letters A through M of the 1974 Toxic Substances List and 148
molecular substructure keys then available from the CROSSBOW
program (3). The statistics for the equation of this algorithm
are as follows:
Multiple correlation coefficient, R2 457
Standard error of estimated log (log 1/C + 1) 089
Mean log 1/C 2.35
Standard deviation of log (log 1/C + 1) 0.68
With this equation, it was possible to predict the LOsp (oral,
rat) of an untested compound so that approximately 63% of the
compounds could be estimated within a factor of approximately
2.5, and virtually all compounds within a factor of 10 (in mg/kg
units) (3). This, for example, means that an estimated oral rat
LD59 dose of 1 mg/kg (with a factor of 2.5), when checked
against actual reported data will correspond to a dose in the
0.25 mg/kg to 2.5 mg/kg range approximately 63% of the time.
In the second algorithm, 3,600 compounds were collected from the
RTECS. This was essentially the entire population of compounds
with oral rat LOsg data. This second algorithm was used to
determine how many compounds would be needed in order to achieve
stability of the structure-activity equation. Separate
regression models were developed for three subsets of compounds
of 1,000, 1,500, and 2,000 compounds as shown in Table 3. It
was determined that there was very little change in the
statistics associated with the model subsets of 1,500 and 2,000
compounds (3). Enslein et al, assumed that the major difference
between these two models is due to the difference in the number
of variables considered in these two models (77 for the earlier
model and 103 for the later model) (3).
These results suggest "that at least for the available data,
2,000 compounds result in an essentially asymptotic equation"
(3), (.e., adding more compounds to the data set would not
‘increase the strength of the equation).
13TABLE 3. LDSQ ALGORITHM: REGRESSION STATISTICS FOR SUBSET MODELS
OF 1,000, 1,500, AND 2,000 COMPOUNDS
Jog 1/C
Residual SE OF
N xX S.D. S.E. Skew Kurtosis Range © Mean Square P.F. R2 Estimate
1,000 2.540 .860 0272 .72 Bie -45-5.95 36 892.56 60
1,500 2.540 .875 .0226 .71 51 -45-5.90 38 1,396.52 62
2,000 2.530 .880 0197.69 52 =34-5.99 +39 1,864 452.62
From Enslein et al, (3)
4The 2,000 compound model was therefore used as the basis for
refining statistical procedures in the oral rat LDg9 model
(3). A complete list of these compounds is provided in
Appendix F.
As a number of variables were removed from the equation, ridge
regression analysis was performed. As shown in Table 4,
residual plots (log 1/C actual - log 1/C predicted) produced
from this regression analyses are poorly fitted at both the top
and bottom ends (3). Note that the number of compounds dropped
to 1,968 as a result of removing those found to be duplicates.
Because the range of residual plots values were poorly fitted in
their distribution, it was necessary to compromise between range
and fit in establishing a range of values with which to work
(3). The range of values was limited to encompass log 1/C
values between 1.25 and 4.75 in the final LDsq algorithm.
This is considerably narrower than that in the first algorithm,
which encompassed log 1/C values of approximately 1.0 to 6.2.
The L059 algorithm presented in Table 5 includes all of the
variables and their respective coefficient values as calculated
by the statistical procedures described previously. The
resulting equation for generating LOsq values based on this
model is as shown in Figure 3.
A subset of 600 compounds were withheld for performance testing
of the algorithm. Of these 600 compounds, 8 could not be
properly processed by the WLN key generation program and 24
compounds were assigned none of the 82 keys present in the
LD59 algorithm, leaving a test subset of 568 compounds. Log
1/C data for these 568 compounds were evaluated based on the
equation presented in Table 5. Using a plot of the residual
values (log 1/C actual - log 1/C predicted) as a function of the
predicted values it was found that the prediction inaccuracies
were greatest at the extremes of the range. This was not an
unexpected finding, and because of this the predicted values
were tabulated into ranges and statistics calculated for the
compounds within each range. The results, presented in Table 6,
show that there are no meaningful statistics available below log
1/C of 1.5 or above 4.0 (perhaps 3.5) (3). The standard
deviation of the residuals from predicted log 1/C values from
1.5 to 3.5 varies between .58 and .81.
In examining the quantiles shown in Table 6, it is found that
below mid-range there is a larger residual error for low values
and above mid-range for the higher values. An example of the
accuracy of the resulting estimates in the range of log 1/C of 2
to 2.5 is that 50% of the values between the semi-quartile range
25-15% would have an error of -.45 and +.34. As these are log
values, they translate into actual LDsq values (i.e., mg/kg)
lying between .355 and 2.19 times the estimated value.
Similarly, 90% of the values are found between the 5th and 95th
quantiles with an error range of ~.87 to +1.03, which translates
to the equivalent of .135 and 10.72 times the estimated values.
15TABLE 4.
Jog 1/C range
0.25
0.75
1.25
1.75
2.25
2.75
3.25
3.75
4.25
4.75
5.25
5.75
From Enslein
0.50
1.25
1.75
2.25
2.15
3.25
3.75
4.25
= 4.75
525
5.15
6.25
et al, (3)
DISTRIBUTION OF LOG 1/C FOR 1,968 COMPOUND ALGORITHM
67
295
448
462
307
200
a
30
16Lt
Key
NON-CYCLIC PARTS OF MCLECULE
KS
6
8
10
i
14
16
KI7.
20
25
26
28
30
31
34
36
37
43
44
FREQUENCY
158
386
a
185
14
168
24
CHAIN FRAGMENTS
338
223
19
30
153
102
20
54
205
195
123
54
TABLE 5. LD59 ALGORITHM EQUATION
DESCRIPTION
Terminal oxygen (not carbony1)
One 3-branch carbon atom
Greater than 3-branch nitrogen
atom.
1 sulphur atom
More than 1 sulphur atom
1 double bond, excluding -C=S,
or -C=0
Triple bond
1_methy1/methylene group
Alkyl chain (CHg)p or
CH3(CHp)n-1 where n=3-9
Bromine
Fluorine
One -NH- group
One -NHp group
More than one -NHp group
Unusual carbon atom
More than one -0- group
One -0H group
0
One -C-0 (ester) group
0
More than one -C-0 (ester) group
COEFFICIENT
458
-096
2196
+362
+821
141
189
089
~.250
+256
+435
+334
+236
+258
-278
2211
-.163
~.156
=.205
7.5
ouat
TABLE 5. LDg9 ALGCRITHM EQUATION (Cont. )
KEY FREQUENCY DESCRIPTION COEFFICIENT
SUBSTITUENT FRAGMENTS
Ka7 90 Ethyl/ethylene group
50 280 Generic halogen
51 145 One chlorine
54 13 Fluorine
58 1 One -NHp group
59 7 More than one -NHp group
60 25 One -N= or HN= group
66 24 More than one ~CH group
RING HETEROATOMS
K75 20 Single occurrence of oxygen in
more than one ring -461
78 150 Multiple occurrence of nitrogen 086
al 74 Single occurrence of sulphur 2140
282 7 Multiple occurrence of sulphur 2525
85 82 Single occurrence of carbony1 =.122
20 4 Multiple occurrence of exocyclic .479
double bond
RING TYPES
99100 Carbocyclic 6-membered ring -.257
10027 Carbocyclic ring other than 5
and 6-menbered ~.198
104 233 1 heteroatom in one ring 202
107 56 1 heteroatom in more than one ring .477
BL nobean
aoauodKey
FREQUENCY
RING FUSIONS
Ki
ie
13
4
ns
120
123
RING LINKAGE
K130
ol
EXTENSIONS
K149
ADDITIONAL
KIs1
154
156
161
162
165
166
167
im
174
99
294
CHAIN FRAGMENTS
7
4
4
34
3
35
106
35
is
1
TABLE 5. LDsq ALGORITHM EQUATION (Cont. )
DESCRIPTION
More than 1 single heterocyclic
ring
1 single carbocyclic ring
Nore than 1 single carbocyclic
ring
1 carbo/carbo fusion
More than 1 carbo/carbo fusion
1 carbo/hetero fusion in more
than 1 ring system
More than 1 hetero/hetero fusion
Bilinkage
Presence of suffix
Chain secondary amide
Chain N-substituted acylhydrazides
Chain amidine
Chain N-nitroso
Chain sulfonamide
Chain thioamide
Chain dialkylamino
Chain methoxy
Chain phenethyt
Chain phenylureido
COEFFICIENT
-.574
2321
-409
2143
+373
~.935
2743
271
+098
- +380
~-649
637
2194
+304
255
~.272
+368
-980
oe
ho
13.9
41
Do RULe Loe
PONT PeNpanwey
FREQUENCY
TABLE 5. LD5Q ALGORITHM EQUATION (Cont.
DESCRIPTION
ADDITIONAL SUBSTITUENT FRAGMENTS
K180
182
188
193
194
196
197
201
203
309
ADDITIONAL,
oz
K246
250
256
269
282
284
293
13
37
12
14
3
18
22
4
10
6
METAL FRAGMENTS
Bipheny1
Substituent secondary amide
Barbiturate
Substituent sulfonamide
Substituent quanidine
Substituent thioamide
Substituent dialkyamino
Substituent N-nitro
Substituent: phenoxy
Substituent carbamate
COEFFICIENT
1.435
1.635,
1.114
1.175
Booavudvnn
ieee ales
BboVNho LoeIz
key
FREQUENCY
CARCINOGENESIS KEYS
K312
315
322
327
330
341
343
344
348
350
4
161
4
5
16
79
24
83
3
18
LOG MOLECULAR WEIGHT
CONSTANT
From Enslein et al, (3)
TABLE 5. LD§0 ALGORITHM EQUATION (Cont. )
DESCRIPTION
Organohalogen mustards
Haloalkane
Aziridine
5-membered ring anhydrides
Fused aromatic - unsaturated
lactone
Aromatic nitro
o£ ,B-dihaloalkane
Geminal-dihaloalkane
Fused polychlorinated alicyclic
Hydrazo/hydrazine
COEFFICIENT
Pree es
beokuu2
TABLE 6. TEST COMPOUNDS - CHARACTERISTICS OF RESIDUALS*
Predicted N Quantiles
aaa 1 5 10 2 75 = 909599 x Median S.D. Max.
V0-1.5 3-212 -.12 0120 121.74 1.74 1674 16747259 1.74
V.5- 2.0 78 -1.37 1.05 -.71 -.42 2654.79 1,00 -.059-.12 66 1.00
2.0-2.5 224 -1.34 -.87 -.67 -.45 34.76 «1.03 1.74 -.015 -.059 «58 1.94
2.5-3.0 143 -1.99 -1.25 -.78 -.45 6366861645 2.71 017-007. 7 2.76
3.0-3.5 97 -1.80 -1.20 -.81 -.46 .52 1.17 1.56 2.51 081 067.81 1.80 2.81
3.5-4.0 18 -1.63 -1.63 -1.60 -1.07 .50 1.48 1.64 1.64 -.21 ~-.35 .98 -1.63 1.64
4.0- 4.5 _4 -1,02 -1.02 -1,02 -1,02 .85 .85 .85 85 -.082 ~-.077 1.07 -1.02 .85
567
* Residual values = log 1/C actual - (1og 1/C predicted)
From Enslein et al, (3)It is difficult to know which fraction of the residuals in the
model fit inadequately due to the model itself, or as a result
of other factors such as inadequately measured Ls values or
discrepancies between data resulting from replication studies
between different laboratories (3). Enslein et al, found that
one of the compounds in RTECS incorrectly reported an LOso
value of 70 ug/kg instead of 70 mg/kg. This compound was
dropped from the data shown in Table 6. Despite such
limitations, it would seem that this model can generate L0so
estimate values at least as well as those reported thus far in
the literature. However, there have been insufficient numbers
of compounds for which extensive replications have been carried
out to be able to make such a statement with a great deal of
confidence (3).
Mutagenicity Algorithm
Compounds incorporated in the data base for developing this
model were obtained by screening the files of the Environmental
Mutagen Information Center (EMIC), Oak Ridge National
Laboratory, Oak Ridge, Tennessee, and reports from the National
Toxicology Program (NTP) for all compounds for which Ames test
for mutagenicity data had been reported. Essentially, all
publications from the EMIC files relating to the Ames Test for
mutagenicity (encompassing over 1200 compounds) were reviewed
and the test results recorded. Judgments as to the quality of
‘the reported data, e.g., in terms of dose response reported,
were made by contractor chemists and toxicologists. In general,
the Gene-Tox Criteria (21) were applied in this subjective
evaluation of the data. Using these criteria, a compound was
classified as a nonmutagen if it had been tested with negative
results in at least three of the five strains of Salmonella
Typhimurium (TASB, TAIO0, TAI535, TA1537, and TAIS36) used in
‘the Ames Test. For a compound to be classified mutagenic it had
to be tested with positive results in at least two of the five
strains. It should be noted that two of the five strains (TA98
and TA100) are considered less sensitive than the other three in
assessing mutagenicity and, therefore, weighed less in the
decision to classify a compound as mutagenic (22).
The chemical selection committee of the National Toxicology
Program (NTP) uses the Gene-Tox criteria but requires at least
four strains instead of three to be tested and a negative result
in all four strains for a compound to be classified as a
nonmutagen. Additionally, NTP also requires that each of the
tests be repeated in at least one other laboratory. In the case
of compounds with conflicting data, decisions regarding positive
or negative mutagenic classification were made only if test
results among at least two different laboratories were mutually
reinforcing. When conflicting results could not be so resolved,
the compound was discarded from the data base and subsequent
modeling and testing procedures. Because of the more stringent
requirements, an NTP judgment was held to supercede those
obtained from EMIC.
23After applying these criteria to over a thousand compounds, a
total of 301 were judged to be positive mutagens and 23) to be
nonmutagens. From these two groups a subset of 37 positive and
23 negative compounds were randomly selected and set aside for
Subset verification testing. A list of all the compounds used
in the mutagen modeling data base are presented in Appendix G.
The equations for the mutagen algorithm were derived by
discriminant analysis and ridge regression procedures. Based on
‘the mutagenicity mode? equation presented in Table 7, a mutagen
and nonmutagen score is calculated using the equations shown in
Figure 4. The regression constant values (of -5.078 and -3.183)
result from the regression analysis applied to the mutagen and
nonmutagen groups of compounds in calculating the coefficient
values for the keys selected for the mutagen algorithm. The
equation for generating an estimate of mutagenicity incorporates
the resulting exponential values of these equations. Natural
Jog values are determined for both score values and then used in
the probability equation shown in Figure 5. A step-by-step
illustration of this procedure is presented in Appendix H using
an example compound.
Several methods exist to evaluate the accuracy of the
mutagenesis mode]. The simplest method is to indicate the
number of compounds correctly classified by the model. As shown
in Table 8, varying the range of the indeterminate (i.e., cannot
sufficiently discriminate between mutagen or nonmutagen) zone
between p= .4 to .599 and p= .3 to .699 the percent of false
positives and false negatives increases as the indeterminate
range decreases. The wider the indeterminate range, the larger
the number of compounds which cannot be classified, in addition
to some reduction in the number of misclassified compounds.
A second method for estimating accuracy is to compare the actual
error rate in specified probability ranges to the expected error
rate (based on the binomial distribution). These data are shown
in Table 9. Using a two sample Kolmogorov-Smirnov test (18), it
was found that there was not a statistically distinguishable
difference in the actual and expected cumulative error
distributions (3). This would indicate that the probability
values derived from this model have a high degree of precision
and can be used with confidence for the ranking of compounds
(3). The results of this statistical test were not made
available to the author in the draft report provided by the
contractors, and thus are not presented here.
The subset test provides a third way to assess the accuracy of
this model. As described previously in the LOsq model, a
number of compounds for which mutagenesis data had been obtained
were held back from the modeling set by a random selection
Process. These compounds were then evaluated by means of the
discriminant equation of the model and the probability values of
mutagenicity were compared to the reported values. As seen in
Table 10, the results of the test parallel those results shown
in Table 8 with the exception that a larger number of compounds
24