0% found this document useful (0 votes)
49 views105 pages

NIOSH-The Development and Application of Algorithms For Generating Estimates of Toxicity For The NOHS Data Base

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
49 views105 pages

NIOSH-The Development and Application of Algorithms For Generating Estimates of Toxicity For The NOHS Data Base

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 105
NIOSH TECHNICAL REPORT ‘THE DEVELOPMENT AND APPLICATION OF ALGORITHMS FOR GENERATING ESTIMATES OF TOXICITY FOR THE NOHS DATA BASE HERBERT L. VENABLE U.S DEPARTMENT OF HEALTH AND HUMAN SERVICES Public Health Service Centers for Disease Control National Institute for Occupational Safety and Health Division of Surveillance, Hazard Evaluations and Field Studies Cincinnati, Ohio 45226 July, 1986 DISCLAIMER Mention of company names or products does not constitute endorsement by the National Institute for Occupational Safety and Health ACKNOWLEDGEMENT The development and application of the algorithms presented in this technical report were accomplished under NIOSH contracts 210-78-0077, 210-78-0066, and 210-80-0044 (Genesee Computer Center, Inc., with Health Designs, Inc. and the Franklin Research Institute under subcontract). I would like to thank the following people for their critical review of this document: Ms. Alice Griefe Or. Harold Resnick Occupational Toxicologist Science Advisor Industrial Hygiene Section Office of the Director Industrywide Studies Branch DROS, NIOSH OSHEFS, NIOSH Dr. Sanford Leffingwel1 Or, Curtis Travis Chief, Research Analysis Section Office of Risk Analysis Priorities Research Analysis Branch Health and Safety Research Division DBBS, NIOSH Oak Ridge National Laboratory Dr. Robert W. Mason Dr. Joseph Kelaghan Technical Advisor for Science Epidemic Intelligence Service Officer BBS, NIOSH National Cancer Institute I would also like to express my thanks to Or. Wm. Kar] Sieber, Jr. of NIOSH for his review of the statistical methodologies, Mr. David H. Pedersen of NIOSH for his suggestions and comments on organizing and writing this document, and to Ns. Kathy Mitchell for manuscript preparation. DHHS (NIOSH) Publication No, 87-101 ii In. Il. Vv. VI. TABLE OF Introduction. . 2 2. - Development of Algorithms A. General Background . B. Modeling the Algorithms . €. Statistical Methodologies D. Development of Individual Estimation Algorithms . Estimation and Ranking of NOHS Compounds Discussion . 2 2 1 we References . . . 1. Appendices . 2 . . iii Page No. n a9 62 66 69 Table No. 10. i. zs 13. 4, 15. 16. W. 18. 19. 20. A Sampling of Molecular Descriptors Used in Structural Activity Relationship Studies . . . Parnes Potential ProblemKeys . . . . . s+. - Oso Algorithm: Regression Statistics for Subset Models of 1,000, 1,500, and 2,000 Compounds . . . . . Distribution of Log 1/C for 1,968 Compound Model. . . . L0gq Algorithm Equation . . 2 2... 1 7 we Test Compounds - Characteristics of Residuals. . . . . Mutagen Algorithm Equation . . . 2. ee ee Mutagen Algorithm - Design Compounds. . . . - ~ Mutagen Algorithm - Misclassification in Ranges . Mutagen Algorithm ~ Test Compounds se ee Carcinogen Algorithm Equation . . . . Carcinogen Algorithm - Classification by Discriminant Equation lsc 0 4) ol Sele = ee Carcinogen Algorithm - Misclassification in Ranges Criteria for Evaluation of Teratogenicity . . . - Teratogen Algorithm Equation . . . . oe Distribution of Teratogenicity Scores . . . . - Teratogen Algorithm - Discriminant Equation Evaluation . Teratogen Algorithm - Misclassification in Ranges. . . Number of Chemical Compounds by Selected Ranges (0.750 or Greater) of Estimated Toxicity Endpoint Values Some Predictive Toxicology Oriented Models for the Correlation of Chemical Structure with a Biologic Endpoint iv Page No. 4 16 W 22 25 31 32 33 36 a3 44 6 50 56 57 58 60 63 ETEURES Figure No. 1. Procedures for Developing an Algorithm . . . . . . 2. Translation Process for Obtaining a Quantifiable (Numerical) Representation of Molecular Structure . . 3. LDso Estimating Equation . . 2 2. 2 1 ww ee 4. Equations for Calculating Mutagen and Nonmutagen Scores . 5. Mutagenicity Estimating Equation . . 2 . 2. . 6. Equations for Calculating Definite (Carcinogen) and Indefinite (Noncarcinogen) Scores. . . . . « Carcinogenicity Estimating Equation . . . . 2 2. Equations for Calculating Teratogen and Nonteratogen scores wo a See Se Teratogenicity Estimating Equation. . . . . . . . Page No. 12 29 30 36 a2 ar 48 APPENDICES, Appendix Page No. A. Wiswesser Line-Formula Notation Symbols and Definitions . 69 8. WLN Example for an Acyclic Compound. . . . . . . . . n C. WIN Example for a Cyclic Compound . . . . . . . . 2. 72 D. Molecular Substructure Keys and Their Definitions . . . . . 74 E. Example of Generating an LOso Estimate. . . . . . . 2 . 93 Tog tee ioe (Compara | eed injetel sqHAlWGRI unLn sda 1h Data Base. mare fp eee e ® B.E1S€ of ‘Compounds, Used! tn the Mutagen Algor{ thm KodeTing DataBase . 2. 1 we Se a Ae ee H. Example of Generating an Estimate of Mutagenicity . . . . . 94 IO ely ceseonededinpareloaee toast atop] sau Laden DataBase . . 2 2... pen oie _. * J. Example of Generating an Estimate of Carcinogenicity . . . . 95 SNE Ee ee ee eta cae DataBase . . 2. wee a> L. Example of Generating an Estimate of Teratogenicity . . . . 96 M. List of NOHS Compounds Receiving an LOsq Estimate . . . . . * List of NOHS Compounds Receiving an Estimate of Mutagenicity . * 0. List of NOHS Compounds Receiving an Estimate of Carcinogenicity * P. List of NOHS Compounds Receiving an Estimate of Teratogenicity . * These appendices have been placed on microfiche and attached to the back cover of the printed report. vi Special Note from the Author The research and final products of the work presented in this report were accomplished under NIOSH contracts 210-718-0077, 210-719-0066, and 210-80-0044, with the author serving as the NIOSH Project Officer. However, time and funds allocated for this project expired before an approved final report was submitted by the contractor. Since the author believes that the products of this project have significant value in the field of occupational safety and health, results of the project are reported here despite the lack of a final report from the original contractor. The author wishes to extend his appreciation to and acknowledge the following individuals for their contribution, through the draft report, in the compilation of this report: Mr. Kurt Enslein, President Health Designs, Inc. Rochester, New York Or. Paul Craig National Library of Medicine Bethesda, Maryland Or. John Strange Franklin Research Institute Philadelphia, Pennsylvania Mr. Tom Lander Health Designs, Inc. Rochester, New York Hr. Michael Tomb Health Designs, Inc Rochester, New York The text of this report is extracted largely from the contractor's incomplete draft report and is cited extensively throughout this report as are several publications by Enslein et al, which were written and published as the models were developed. These publications should be consulted in conjunction with this report to obtain a more comprehensive understanding of the project and ‘its intent. Copies of the contractor's incomplete final report to NIOSH are available upon request from the author. vii ABSTRACT This project developed computer-based algorithms designed to provide estimates of toxicity for four toxicologic endpoints; LDsq (oral, rat), mutagenicity, carcinogenicity, and teratogenicity. These algorithms are the end result of a series of models tested against available toxicity data for each of the four toxic endpoints. The modeling data base for each endpoint contained a listing of chemical compounds determined to be toxic or non-toxic for each endpoint based on a subjective analysis of the bioassay data available. Once the algorithms had been developed and tested, they were applied to the chemicals in the National Occupational Hazard Survey (NOHS) data base to generate estimates of toxicity for those chemical compounds known to be in the workplace. These estimates of toxicity are particularly useful in assessing the toxicity of those chemical compounds for which little or no toxicity data has been reported. The algorithms produce estimates of toxic effect based on statistical computation and are therefore known to incorporate a certain degree of unavoidable statistical error. This and other limitations discussed in the report preclude the use of such theoretical toxicity data as a substitute for reported animal bioassay data or as the sole basis in making regulatory or other decisions of similar magnitude regarding the use of and exposure to chemical compounds. Instead, these toxicity data are intended only for rank-ordering a list of compounds according to relative toxicity or as a part of an overall process of selecting, testing, and evaluating chemical compounds for toxicity. viii Ts Introduction A. Purpose This project developed and applied computer-based algorithms to the chemical compounds (hereafter referred to as compounds) listed in the NIOSH National Occupational Hazard Survey (NOHS) data base in order to generate estimates of toxicity for these compounds for the following toxic endpoints: LDsq (oral, rat) Mutagenicity Carcinogenicty Teratogenicity The theoretical toxicity data thus generated is intended for use only as an additional tool in assessing the toxicity of those compounds found in the workplace. ‘The compounds listed in the NOHS data base are a result of the National Occupational Hazard Survey which was a two-year study (1971-74) “intended to describe the health and safety conditions in the American work environment and, more specifically, to determine the extent of worker exposure to chemical and physical agents" (1). Observational data were gathered by surveying approximately 5,000 facilities encompassing all types of industrial activity covered by the Occupational Safety and Health Act (OSHA) of 1970. Approximately 8,000 separate chemical substances were identified as Present in the workplace during the course of the survey. These 8,000 plus chemical substances are included in the NOHS data base. The application of these four algorithmns to these compounds known to be in the work environment extends the utility of the data base by providing NIOSH with a unique toxicology information resource. Such a resource can be effectively utilized in a number of areas. For NIOSH, a major application could be for risk assessment and prioritization of research on chemical hazards in the workplace. Structural Activity Relationships (SARs) All four algorithms were developed on the assumption that a structure-activity relationship (SAR) exists among groups of compounds that exhibit similar chemical characteristics, For example, a SAR may exist among a group of compounds that possess a certain degree of ionic charge per molecule and may therefore have a similar degree of water solubility. SARs may be based on one or more of a number of molecular structure descriptors. Some of the more commonly used structural parameters are listed in Table 1. The concept of SARs has been applied in several areas. For example, the primary use of the SAR concept in pharmaceutical chemistry has been for the evaluation of therapeutic effects of potential new drug compounds. Several approaches have been used in the application of TABLE 1. A SAMPLING OF MOLECULAR DESCRIPTORS USED IN STRUCTURAL ACTIVITY RELATIONSHIP STUDIES Physiochemical descriptors Molecular weight Density Melting point Boiling point Logarithm of n-octy! alcohol/ water partition coefficient Molecular refractivity* Topological descriptors ‘Atom and bond fragments Substructures (atom groups) Substructure environment Number of carbon atoms Number of rings (in polycyclic compounds) Molecular connectivity (extent of branching) Geometrical descriptors Molecular volume Molecular shape Molecular surface area Substructure shape Taft steric parameter* Verloop sterimol constants* Electronic descriptors Hammett-Taft sigma constants* Electron density —- bond reactivity Dielectric constant Dipole and higher moments Ionization potential Electron affinity * These "complex descriptors" could be placed in other categories as well. Reprinted with permission from Chemical and Engineering News, March 9, 1981 (2). SAR research. Craig and Enslein (3) divided these methods of approach into four categories. 1, Intuitive Approach - which applies the organic chemists’ skill, knowledge, and intuition. More recently this approach has focused on creating an additive model SAR which is based on the hypothesis that each structural feature of a molecule plays a consistent role in contributing to the overall activity of the molecule. 2. Multiple Parameter Approach - which combines known physical-organic chemical relationships into a novel mathematical expression to relate the biological activities of a closely related series of compounds to one or more physical properties (e.g., water-octanol solubility ratio or more commonly referred to as the partition coefficient). 3. Quantum Chemical Approach - which employs the principles of quantum mechanics and calculations. For example, one approach obtains electronic indices for a series of structurally related chemicals. 4. Substructural Analysis Approach - which is based on the analysis of type and, in some cases, frequency of occurrence of substructural or molecular fragments of molecular substructures, (2.9. ,-NOg). Unlike the multiple parameter, additive model, or the intuitive approach methods, Adamson et al, state that the substructural analysis method may be used for a large number of structurally well-diversified compounds (4). Statistical analysis may then be applied to the type and frequency of substructural fragments to provide a quantitative value (i.e., coefficient value) for specific fragments that represents the amount of influence that each fragment exerts in the overall statistical variation of a group of compounds. II. Development of Algorithms AL General Background Prior to 1975, the concept of SARs was generally applied to groups of structurally similar compounds, usually for the purpose of evaluating potential therapeutic effects in new drug research. Beginning about 1975, SAR concepts were applied to structurally similar groups of compounds for evaluating toxicity (5-10). Papers presented at the Symposium on Structural Correlates of Carcinogenesis and Mutagenesis, held at the U. S. Naval Academy, Annapolis, Maryland, 1977, reflect some of the areas of interest, endeavor, and success in application of SAR concepts for the evaluation of toxicity (11). The application of quantitative structure-activity relationships (QSARS) to structurally diverse compounds for the evaluation of toxicity was first reported by Craig and Waite (12) and Enslein and Craig (13). This project is an extension of this application of SAR concepts and employs the substructural analysis approach described by Enslein et al, (3). Modeling the Algorithms A number of molecular descriptors were considered for use in modeling the algorithms, (e.g., octanol-water partition coefficients and molar connectivity indices). In this project, regression analysis was used to select those molecular descriptor parameters most useful in modeling the algorithms. Ultimately, the occurrence of substructural fragments (and, in the carcinogen and Ls models, molecular weight) were selected and used as the chemical descriptor variables in these algorithms. Al] four algorithms were developed in a similar fashion. However, ‘there were some differences and these will be pointed out in the presentation of the individual models. Basically, the procedure was as shown in Figure 1. A data base was created for use in developing each model. These data bases listed compounds selected on the basis of evidence indicating their ability to induce or not induce the effect of the selected toxicologic endpoint (e.g., carcinogen or Noncarcinogen). Once the modeling data base was established, the resulting algorithm was designed and tested and then applied to the compounds listed in the NOHS data base for which the required ‘information, (molecular formula, molecular weight, and a Wiswesser Line-Formula Notation) was available or could be generated. Molecular structure plays a key role in all four algorithms in that a multi-step process is used to translate molecular data from a three-dimensional concept to a quantifiable value useful in generating toxicity estimates. These steps are summarized in Figure 2. Wiswesser Line-Formula Notation (WLN) is used as the initial step in this translation process. The use of WLN is summarized by Smith and Baker as "...a precise and concise means of expressing the structural formulas of chemical compounds. Its basic idea is to use letter symbols to denote functional groups (chemical) and to use numbers to express the lengths of alkyl chains and sizes of rings. These symbols then are cited in connecting order from one end of the molecule to the other" (14) The symbols employed by the WLN are the numerals 1-10, the 26 capital letters, the four punctuation marks & -, /, and *, and a blank space (See Appendix A). According to Smith and Baker (14), with these symbols and approximately "a dozen new chemical symbols to supplement the old familiar ones, plus half a dozen operating symbols and the fundamental rules for manipulating them", a chemist should be able to write a WLN or read one as you would read a conventional structural formula. ‘As might be expected, the accuracy and usefulness of a toxic endpoint prediction, as estimated by these four algorithms, depends Jargely on an accurate description of the molecular structure. FIGURE 1. PROCEDURES FOR DEVELOPING AN ALGORITHM Create Data Base for Modeling of Algorithm Generation of WLNs for Compounds in ata Base Generation of Chemical Descriptor Keys based on WLNs Analysis of Variance Applied To Keys; Retain Keys with a Value of F 71.7 Statistical Calculation of Coefficient Values for Keys Create Subset of Keys for each Algorithm FIGURE 2. TRANSLATION PROCESS FOR OBTAINING A QUANTIFIABLE (NUMERICAL) REPRESENTATION OF MOLECULAR STRUCTURE 2-dimensional drawing of molecular structure Generate a WLN based on rules and guidelines prescribed Application of Gen Key #1 Computer Program to WLN for generation of relevant keys* Execute Gen Key #2 Computer Program to obtain estimation * Note that keys listed as problem keys (Table 2) must be manually checked as being relevant or not to the assigned WLN. The accuracy of the generated WLN is subsequently expressed in the generation of the relevant chemical descriptor keys which are numerical representations of substructural molecular fragments. For example, the ~OH (hydroxyl) substructure is the letter Q in WLN and Key 38 in chemical descriptor key terminology. Assigning an accurate WLN to a compound requires a complete knowledge of Wiswesser Line-Formula Notation in conjunction with a considerable background in organic chemistry structure and nomenclature. However, techniques have been developed for generating WLNs by drawing the structures on an electronic graphics pad linked to an appropriately programmed computer which then generates the WLN (15). AWLN cannot be accurately generated for certain compounds. Most notable of these are polymers or compounds for which the molecular structure may vary or is not known. It was also determined that inorganic compounds do not perform well in any of the four models (3). This is due largely to the inadequate WLN representation of the relatively simple structures of inorganic compounds because too few keys are generated. Conversely, the more complex the molecule, the more involved the WLN, and inaccuracies or alternate representations may occur, possibly resulting in the erroneous generation of keys or failure to generate valid keys. To demonstrate the use of WLN in molecular description, examples of assignment of WLNs to compounds are presented in Appendix B for an acyclic compound and in Appendix C for a cyclic compound. The need for accuracy in the generation of the WLN warrants emphasis, since WLN notation is the major factor in the equations for all four algorithms. The next step in the translation procedure is to generate chemical descriptor keys for a compound based on the assigned WLN. This is accomplished by submitting the WLN to a computer program, developed by Enslein et al, as a part of this project, called Genkey 1/ Genkey 2. Chemical descriptor keys provide an expression of molecular structure in terms of substructural fragments and lead to the development of quantifiable (key coefficient) values for use in the algorithms. Obtained from several sources, a total of 309 descriptor keys (with an additional 50 keys assigned based on the Presence of certain combinations of keys 1-309) were used in the development of the four algorithms. None of the models employ all 359 keys in describing molecular structure. A subset of keys is generated by using statistical procedures that are described later in this report. Essentially, keys are selected by determining their contribution to the toxicity endpoint in question. This is determined by the frequency of the occurrence of a key (representing a specific molecular substructure) in the compounds listed in the modeling data base. In effect, the greater the Frequency of occurrence the greater the probability that the key contributes significantly to the toxicity of that endpoint. Statistical methods are then used to calculate a coefficient value for each key in a selected subset of the 359 possible keys. It is this quantifiable (numerical) representation of a molecular substructure that is used in the modeling equations to generate estimates of toxicity. The number of keys selected from the 359 possible used in each model are as follows: Endpoint # Keys Selected L059 - 82 Carcinogenicity - 18 Mutagenicity - 57 Teratogenicity -61 A list of a11 359 keys and a description of the structure each represents is provided in Appendix D. A list of the keys in each model, their descriptions, and their coefficient values are provided as that model is described in this report. Unfortunately the key generation programs are not error free. The contractor was unable to "de-bug" these computer programs within the time and funds allocated for this project. Three types of potential key generation problems are known to occur: 1. Keys not generated when they should be. 2. Keys generated when they should not be. 3. Keys erroneously generated. (Keys 310-350 represent certain combinations of keys 1-309 as defined in the description of each key presented in Appendix 0). This is a particular problem with keys 311, 337, 342, and 349. As a consequence, key files must be manually reviewed and compared against the WLN files for specific compounds to insure that all of the keys generated are correct on the basis of the assigned WLN. Corrections are made if necessary, and the data is resubmitted to the estimating program. Potential problem keys are listed in Table 2. Statistical Methodologies 1. Selection of Variables. For each modeling data base, variables to be included in defining the algorithm were determined using regression techniques (16). Stepwise regression or stepwise discriminant analysis as used based on whether the endpoint of the algorithm was considered as continuous or discriminant (3). If the endpoint was continuous, stepwise regression analysis was used. If the endpoint was discriminant (teratogen algorithm) discriminant analysis was used. As discussed later, it was necessary to use discriminate instead of regression analysis in developing the teratogen model because of the scoring process TABLE 2. POTENTIAL PROBLEM KEYS Problem Key Key wLN Type No. Description Symbo1(s) i 2 Positive charge 150 Chain primary amide 2V or Vz 151 Chain secondary amide VM or MV 152 Chain tertiary amide © N_V or VN 181 Substituent primary amide 2V or Vz 182 Substituent secondary amide VM or HV 183 Substituent tertiary amide NV or WW 162/193 Sul fonamide (N)-SW or SW(N) 163 Chain Guanidine (N)-Y-U(N) or (N)-Y-U(N)-(N) or (n)uY-(N)=(N) 165/196 Thioamide SUYZ or YZUS 186/197/304 Dialkylamino wn) 167/198 Methoxy 01 or 10 aD Chain Phenylethy1 2R or R-(*)2 112/203 Phenoxy OR or R-(*)O 178/209 Urea (N)=V(N) Note: (N) can be in ring 180 Bipheny] R-(*DR 189 lactam (N)V or V(N) within ring 269 Potassium -KA- 306/309 Carbamate OV(N) or (N)-VO 158 Chain N-substituted Does not apply acylThydrazide 162 Chain sulfonamide Does not apply 163 Chain guanidine Does not apply 166 Chain dialkylamine Does not apply (bonded to carbon) 3 310-350 Refer to Appendix D Does not apply Note: (*) represents any locant; From Enslein et al, (3). for description (N) represents any nitrogen. used in determining teratogenicity of compounds in the modeling data base. Stepwise regression procedures used to select variables may not always produce the best set of variables. The variables selected may be correlated and, as a result, produce a biased model (3, 17). To avoid such bias, candidate variables were selected froma larger set of variables, similar to those listed jn Table 1, all of which were thought to make a possible contribution to the explanation of the statistical variance of the modeling data base (3). Ridge regression and a second stepwise regression were done using the candidate variables following the preliminary regression analysis. The initial regression used a backward elimination procedure. All variables were included in the model and were selected out if their F-values were not significant at P=.05 (3, 17). In effect, candidate variables with low criterion or where F-values contributed least to the variance analysis equation were removed from the putative equation until the F-value reached was 1.7 (18). Ridge regression was performed on the remaining variables and ridge traces for each variable were examined to see whether any singularities existed which might suggest that the variable be omitted from the algorithm (3, 19). Least square estimates used in the backward elimination procedure might give results far removed from true variable values if the variables are correlated (17). The ridge regression was used to check the results of the stepwise regression. Finally, stepdown regression was repeated using only those variables retained following the ridge regression analysis. In performing the regressions, outlier compounds (i.e., compounds that are not statistically characteristic of the main group of compounds) were identified and removed. The effect of removing a few outlier compounds from a large data set of several hundred compounds was felt to be minimal (3). Statistical Evaluations of the Algorithms Several statistical tests were used to evaluate the accuracy of classification by the algorithms. Of the evaluation tests used, the subset verification test was used to evaluate the accuracy of classification. This test probably provides the only practical evaluation of performance testing currently available (3). Using this test, a randomly selected subset of compounds js withheld from the data base that was created for the purpose of modeling the algorithm. The algorithm is then designed on the remaining compounds in the data base and is then tested with the subset of compounds set aside for that purpose. Residual plots, misclassification rates, and the Kilmogorov-Smirnov ‘two-sample tests (18) were also used to test the model by comparing estimated values for endpoints with those values assigned based on actual values (i.e., reported bioassay testing) for the compounds in the verification test subset. 10 The results of the various statistical evaluation methods are presented following the description of the respective models. Statistical references cited should be consulted for a more detailed description of the statistical methods mentioned in this report. D. Development of Individual Estimation Algorithms ils General Modeling Considerations. In developing all four algorithms, calculated data was easier to use if converted to equivalent logorithm values. Such conversion produces a normally distributed data base, (1.e., Jog-linear) and also eliminates the problem of dealing with a wide range of values such as 1:1000 which might occur in the dose ranges of 1 milligram to 1 gram seen in the LOso algorithm. In the LOsg algorithm, the use of the reciprocal (1/C) of the reported or estimated LDsq concentration value creates a normal distribution of the data to facilitate the use of the logorithms. Consequently, to obtain the final estimated L0gg values in mg/kg or probability values between 0 and 1 for the other endpoints, it is necessary to take reciprocal values and convert back from logorithmic to actual values. There are several steps (equations) necessary to obtain an LOgo estimate or probability value. Each algorithm, shown in its respective table, lists all of the descriptor keys (and, in the case of the LDsq and carcinogen algorithms, molecular weight) that have been found to be statistically significant to their toxic endpoints. The compound for which predictive toxicity estimates are to be generated is translated into the equivalent WIN. From the WUN, al] keys that are represented in the WLN are selected from the total set of 359 keys. However, only those keys that also appear in the model subset of keys are used in calculating the positive and negative scores (e.g., carcinogen and noncarcinogen scores) for the carcinogen, mutagen, and teratogen algorithms or to calculate the estimated Jog (1/C) value in the LDso algorithm (see Figure 3). These values are then used in the final estimating equation for each endpoint. These equations are presented in a step-wise manner for each algorithm as it is discussed. An example use of each algorithm is presented in the appendices as indicated in the discussion of the models The LD5q algorithm expresses the estimated endpoint value as the dose of a compound, in mg/kg, necessary to kill one-half of the test animal population (i.e., lethal dose for 50% mortality, hence LDso). The other three models express a predicted endpoint value within a range of 0.000 to 1.000 with 1.000, being the highest probability of the toxicologic endpoint occurring as a result of exposure to that compound (e.g., 0.989 probability of the compound being carcinogenic). For the purpose of this report, the terms probability and potential are uv FIGURE 3. Oso ESTIMATING EQUATION The pertinent coefficient values (c) for each of the keys are summed ( c) and added to the regression constant (0.552) and to the product of 0.681 x Jogy9 (Mol. Wt.). The resulting value is the estimated log 1/c, where c is the number of moles of the compound which represents the LDsg. Jog (1/c) = .552 + .681 (logig M.Wt.) + ¢ To convert log 1/c to the estimated LOs9, expressed as mg/kg, use the following equation: L059 (mg/kg) = 1000 x M. Wt antitog Tog (17c) 12 considered interchangeable. The final equations of the algorithms developed are unusual in that they are expressed in tabular form because they are quite long and are not in the usually perceived algebraic form. LDs9 (oral, rat) Algorithm. Data used in the LDsq algorithms originated from The Toxic Substances List (20) which is now called the NIOSH Registry of Toxic Effects of Chemical Substances (RTECS). The results of the LDsq algorithm are derived from a continuous (as opposed to a discriminant) endpoint, and the procedures for generating an estimated LDgg value are different from those of the other three models. These procedures are illustrated in Appendix E using an example compound. There were two LOsq models developed in this project. An earlier model was based on 475 compounds selected from the letters A through M of the 1974 Toxic Substances List and 148 molecular substructure keys then available from the CROSSBOW program (3). The statistics for the equation of this algorithm are as follows: Multiple correlation coefficient, R2 457 Standard error of estimated log (log 1/C + 1) 089 Mean log 1/C 2.35 Standard deviation of log (log 1/C + 1) 0.68 With this equation, it was possible to predict the LOsp (oral, rat) of an untested compound so that approximately 63% of the compounds could be estimated within a factor of approximately 2.5, and virtually all compounds within a factor of 10 (in mg/kg units) (3). This, for example, means that an estimated oral rat LD59 dose of 1 mg/kg (with a factor of 2.5), when checked against actual reported data will correspond to a dose in the 0.25 mg/kg to 2.5 mg/kg range approximately 63% of the time. In the second algorithm, 3,600 compounds were collected from the RTECS. This was essentially the entire population of compounds with oral rat LOsg data. This second algorithm was used to determine how many compounds would be needed in order to achieve stability of the structure-activity equation. Separate regression models were developed for three subsets of compounds of 1,000, 1,500, and 2,000 compounds as shown in Table 3. It was determined that there was very little change in the statistics associated with the model subsets of 1,500 and 2,000 compounds (3). Enslein et al, assumed that the major difference between these two models is due to the difference in the number of variables considered in these two models (77 for the earlier model and 103 for the later model) (3). These results suggest "that at least for the available data, 2,000 compounds result in an essentially asymptotic equation" (3), (.e., adding more compounds to the data set would not ‘increase the strength of the equation). 13 TABLE 3. LDSQ ALGORITHM: REGRESSION STATISTICS FOR SUBSET MODELS OF 1,000, 1,500, AND 2,000 COMPOUNDS Jog 1/C Residual SE OF N xX S.D. S.E. Skew Kurtosis Range © Mean Square P.F. R2 Estimate 1,000 2.540 .860 0272 .72 Bie -45-5.95 36 892.56 60 1,500 2.540 .875 .0226 .71 51 -45-5.90 38 1,396.52 62 2,000 2.530 .880 0197.69 52 =34-5.99 +39 1,864 452.62 From Enslein et al, (3) 4 The 2,000 compound model was therefore used as the basis for refining statistical procedures in the oral rat LDg9 model (3). A complete list of these compounds is provided in Appendix F. As a number of variables were removed from the equation, ridge regression analysis was performed. As shown in Table 4, residual plots (log 1/C actual - log 1/C predicted) produced from this regression analyses are poorly fitted at both the top and bottom ends (3). Note that the number of compounds dropped to 1,968 as a result of removing those found to be duplicates. Because the range of residual plots values were poorly fitted in their distribution, it was necessary to compromise between range and fit in establishing a range of values with which to work (3). The range of values was limited to encompass log 1/C values between 1.25 and 4.75 in the final LDsq algorithm. This is considerably narrower than that in the first algorithm, which encompassed log 1/C values of approximately 1.0 to 6.2. The L059 algorithm presented in Table 5 includes all of the variables and their respective coefficient values as calculated by the statistical procedures described previously. The resulting equation for generating LOsq values based on this model is as shown in Figure 3. A subset of 600 compounds were withheld for performance testing of the algorithm. Of these 600 compounds, 8 could not be properly processed by the WLN key generation program and 24 compounds were assigned none of the 82 keys present in the LD59 algorithm, leaving a test subset of 568 compounds. Log 1/C data for these 568 compounds were evaluated based on the equation presented in Table 5. Using a plot of the residual values (log 1/C actual - log 1/C predicted) as a function of the predicted values it was found that the prediction inaccuracies were greatest at the extremes of the range. This was not an unexpected finding, and because of this the predicted values were tabulated into ranges and statistics calculated for the compounds within each range. The results, presented in Table 6, show that there are no meaningful statistics available below log 1/C of 1.5 or above 4.0 (perhaps 3.5) (3). The standard deviation of the residuals from predicted log 1/C values from 1.5 to 3.5 varies between .58 and .81. In examining the quantiles shown in Table 6, it is found that below mid-range there is a larger residual error for low values and above mid-range for the higher values. An example of the accuracy of the resulting estimates in the range of log 1/C of 2 to 2.5 is that 50% of the values between the semi-quartile range 25-15% would have an error of -.45 and +.34. As these are log values, they translate into actual LDsq values (i.e., mg/kg) lying between .355 and 2.19 times the estimated value. Similarly, 90% of the values are found between the 5th and 95th quantiles with an error range of ~.87 to +1.03, which translates to the equivalent of .135 and 10.72 times the estimated values. 15 TABLE 4. Jog 1/C range 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 From Enslein 0.50 1.25 1.75 2.25 2.15 3.25 3.75 4.25 = 4.75 525 5.15 6.25 et al, (3) DISTRIBUTION OF LOG 1/C FOR 1,968 COMPOUND ALGORITHM 67 295 448 462 307 200 a 30 16 Lt Key NON-CYCLIC PARTS OF MCLECULE KS 6 8 10 i 14 16 KI7. 20 25 26 28 30 31 34 36 37 43 44 FREQUENCY 158 386 a 185 14 168 24 CHAIN FRAGMENTS 338 223 19 30 153 102 20 54 205 195 123 54 TABLE 5. LD59 ALGORITHM EQUATION DESCRIPTION Terminal oxygen (not carbony1) One 3-branch carbon atom Greater than 3-branch nitrogen atom. 1 sulphur atom More than 1 sulphur atom 1 double bond, excluding -C=S, or -C=0 Triple bond 1_methy1/methylene group Alkyl chain (CHg)p or CH3(CHp)n-1 where n=3-9 Bromine Fluorine One -NH- group One -NHp group More than one -NHp group Unusual carbon atom More than one -0- group One -0H group 0 One -C-0 (ester) group 0 More than one -C-0 (ester) group COEFFICIENT 458 -096 2196 +362 +821 141 189 089 ~.250 +256 +435 +334 +236 +258 -278 2211 -.163 ~.156 =.205 7.5 ou at TABLE 5. LDg9 ALGCRITHM EQUATION (Cont. ) KEY FREQUENCY DESCRIPTION COEFFICIENT SUBSTITUENT FRAGMENTS Ka7 90 Ethyl/ethylene group 50 280 Generic halogen 51 145 One chlorine 54 13 Fluorine 58 1 One -NHp group 59 7 More than one -NHp group 60 25 One -N= or HN= group 66 24 More than one ~CH group RING HETEROATOMS K75 20 Single occurrence of oxygen in more than one ring -461 78 150 Multiple occurrence of nitrogen 086 al 74 Single occurrence of sulphur 2140 282 7 Multiple occurrence of sulphur 2525 85 82 Single occurrence of carbony1 =.122 20 4 Multiple occurrence of exocyclic .479 double bond RING TYPES 99100 Carbocyclic 6-membered ring -.257 10027 Carbocyclic ring other than 5 and 6-menbered ~.198 104 233 1 heteroatom in one ring 202 107 56 1 heteroatom in more than one ring .477 BL nobean aoauod Key FREQUENCY RING FUSIONS Ki ie 13 4 ns 120 123 RING LINKAGE K130 ol EXTENSIONS K149 ADDITIONAL KIs1 154 156 161 162 165 166 167 im 174 99 294 CHAIN FRAGMENTS 7 4 4 34 3 35 106 35 is 1 TABLE 5. LDsq ALGORITHM EQUATION (Cont. ) DESCRIPTION More than 1 single heterocyclic ring 1 single carbocyclic ring Nore than 1 single carbocyclic ring 1 carbo/carbo fusion More than 1 carbo/carbo fusion 1 carbo/hetero fusion in more than 1 ring system More than 1 hetero/hetero fusion Bilinkage Presence of suffix Chain secondary amide Chain N-substituted acylhydrazides Chain amidine Chain N-nitroso Chain sulfonamide Chain thioamide Chain dialkylamino Chain methoxy Chain phenethyt Chain phenylureido COEFFICIENT -.574 2321 -409 2143 +373 ~.935 2743 271 +098 - +380 ~-649 637 2194 +304 255 ~.272 +368 -980 oe ho 13.9 41 Do RULe Loe PONT PeNpan wey FREQUENCY TABLE 5. LD5Q ALGORITHM EQUATION (Cont. DESCRIPTION ADDITIONAL SUBSTITUENT FRAGMENTS K180 182 188 193 194 196 197 201 203 309 ADDITIONAL, oz K246 250 256 269 282 284 293 13 37 12 14 3 18 22 4 10 6 METAL FRAGMENTS Bipheny1 Substituent secondary amide Barbiturate Substituent sulfonamide Substituent quanidine Substituent thioamide Substituent dialkyamino Substituent N-nitro Substituent: phenoxy Substituent carbamate COEFFICIENT 1.435 1.635, 1.114 1.175 Booavudvnn ieee ales BboVNho Loe Iz key FREQUENCY CARCINOGENESIS KEYS K312 315 322 327 330 341 343 344 348 350 4 161 4 5 16 79 24 83 3 18 LOG MOLECULAR WEIGHT CONSTANT From Enslein et al, (3) TABLE 5. LD§0 ALGORITHM EQUATION (Cont. ) DESCRIPTION Organohalogen mustards Haloalkane Aziridine 5-membered ring anhydrides Fused aromatic - unsaturated lactone Aromatic nitro o£ ,B-dihaloalkane Geminal-dihaloalkane Fused polychlorinated alicyclic Hydrazo/hydrazine COEFFICIENT Pree es beokuu 2 TABLE 6. TEST COMPOUNDS - CHARACTERISTICS OF RESIDUALS* Predicted N Quantiles aaa 1 5 10 2 75 = 909599 x Median S.D. Max. V0-1.5 3-212 -.12 0120 121.74 1.74 1674 16747259 1.74 V.5- 2.0 78 -1.37 1.05 -.71 -.42 2654.79 1,00 -.059-.12 66 1.00 2.0-2.5 224 -1.34 -.87 -.67 -.45 34.76 «1.03 1.74 -.015 -.059 «58 1.94 2.5-3.0 143 -1.99 -1.25 -.78 -.45 6366861645 2.71 017-007. 7 2.76 3.0-3.5 97 -1.80 -1.20 -.81 -.46 .52 1.17 1.56 2.51 081 067.81 1.80 2.81 3.5-4.0 18 -1.63 -1.63 -1.60 -1.07 .50 1.48 1.64 1.64 -.21 ~-.35 .98 -1.63 1.64 4.0- 4.5 _4 -1,02 -1.02 -1,02 -1,02 .85 .85 .85 85 -.082 ~-.077 1.07 -1.02 .85 567 * Residual values = log 1/C actual - (1og 1/C predicted) From Enslein et al, (3) It is difficult to know which fraction of the residuals in the model fit inadequately due to the model itself, or as a result of other factors such as inadequately measured Ls values or discrepancies between data resulting from replication studies between different laboratories (3). Enslein et al, found that one of the compounds in RTECS incorrectly reported an LOso value of 70 ug/kg instead of 70 mg/kg. This compound was dropped from the data shown in Table 6. Despite such limitations, it would seem that this model can generate L0so estimate values at least as well as those reported thus far in the literature. However, there have been insufficient numbers of compounds for which extensive replications have been carried out to be able to make such a statement with a great deal of confidence (3). Mutagenicity Algorithm Compounds incorporated in the data base for developing this model were obtained by screening the files of the Environmental Mutagen Information Center (EMIC), Oak Ridge National Laboratory, Oak Ridge, Tennessee, and reports from the National Toxicology Program (NTP) for all compounds for which Ames test for mutagenicity data had been reported. Essentially, all publications from the EMIC files relating to the Ames Test for mutagenicity (encompassing over 1200 compounds) were reviewed and the test results recorded. Judgments as to the quality of ‘the reported data, e.g., in terms of dose response reported, were made by contractor chemists and toxicologists. In general, the Gene-Tox Criteria (21) were applied in this subjective evaluation of the data. Using these criteria, a compound was classified as a nonmutagen if it had been tested with negative results in at least three of the five strains of Salmonella Typhimurium (TASB, TAIO0, TAI535, TA1537, and TAIS36) used in ‘the Ames Test. For a compound to be classified mutagenic it had to be tested with positive results in at least two of the five strains. It should be noted that two of the five strains (TA98 and TA100) are considered less sensitive than the other three in assessing mutagenicity and, therefore, weighed less in the decision to classify a compound as mutagenic (22). The chemical selection committee of the National Toxicology Program (NTP) uses the Gene-Tox criteria but requires at least four strains instead of three to be tested and a negative result in all four strains for a compound to be classified as a nonmutagen. Additionally, NTP also requires that each of the tests be repeated in at least one other laboratory. In the case of compounds with conflicting data, decisions regarding positive or negative mutagenic classification were made only if test results among at least two different laboratories were mutually reinforcing. When conflicting results could not be so resolved, the compound was discarded from the data base and subsequent modeling and testing procedures. Because of the more stringent requirements, an NTP judgment was held to supercede those obtained from EMIC. 23 After applying these criteria to over a thousand compounds, a total of 301 were judged to be positive mutagens and 23) to be nonmutagens. From these two groups a subset of 37 positive and 23 negative compounds were randomly selected and set aside for Subset verification testing. A list of all the compounds used in the mutagen modeling data base are presented in Appendix G. The equations for the mutagen algorithm were derived by discriminant analysis and ridge regression procedures. Based on ‘the mutagenicity mode? equation presented in Table 7, a mutagen and nonmutagen score is calculated using the equations shown in Figure 4. The regression constant values (of -5.078 and -3.183) result from the regression analysis applied to the mutagen and nonmutagen groups of compounds in calculating the coefficient values for the keys selected for the mutagen algorithm. The equation for generating an estimate of mutagenicity incorporates the resulting exponential values of these equations. Natural Jog values are determined for both score values and then used in the probability equation shown in Figure 5. A step-by-step illustration of this procedure is presented in Appendix H using an example compound. Several methods exist to evaluate the accuracy of the mutagenesis mode]. The simplest method is to indicate the number of compounds correctly classified by the model. As shown in Table 8, varying the range of the indeterminate (i.e., cannot sufficiently discriminate between mutagen or nonmutagen) zone between p= .4 to .599 and p= .3 to .699 the percent of false positives and false negatives increases as the indeterminate range decreases. The wider the indeterminate range, the larger the number of compounds which cannot be classified, in addition to some reduction in the number of misclassified compounds. A second method for estimating accuracy is to compare the actual error rate in specified probability ranges to the expected error rate (based on the binomial distribution). These data are shown in Table 9. Using a two sample Kolmogorov-Smirnov test (18), it was found that there was not a statistically distinguishable difference in the actual and expected cumulative error distributions (3). This would indicate that the probability values derived from this model have a high degree of precision and can be used with confidence for the ranking of compounds (3). The results of this statistical test were not made available to the author in the draft report provided by the contractors, and thus are not presented here. The subset test provides a third way to assess the accuracy of this model. As described previously in the LOsq model, a number of compounds for which mutagenesis data had been obtained were held back from the modeling set by a random selection Process. These compounds were then evaluated by means of the discriminant equation of the model and the probability values of mutagenicity were compared to the reported values. As seen in Table 10, the results of the test parallel those results shown in Table 8 with the exception that a larger number of compounds 24

You might also like