Study Of CTS DNA Proficiency Tests With Regard To DNA Mixture Interpretation A NIST Scientific Foundation Review

1. Introduction
In June of 2021, a group of authors from The National Institute of Standards and Technology released a document entitled DNA Mixture Interpretation: A NIST Scientific Foundation Review for public comment [1]. This has become known as the Draft NIST Foundation Review. The Draft NIST Foundation Review [1] contains a section on the NIST review team’s reporting of proficiency test results (starting on page 75). This includes the statements:“Across these 69 data sets, there were 80 false negatives and 18 false positives reported from 110,408 possible responses (27,602 participants × two evidence items × two reference items). In the past five years, the number of participants using PGS has grown”.

It is possible to infer from the above statement that Probabilistic Genotyping Software (PGS or PG) contributed to the false positives or negatives.

The Draft NIST Foundation Review concludes: “KEY TAKEAWAY #4.1: The degree of reliability of a component or a system can be assessed using empirical data (when available) obtained through validation studies, interlaboratory studies, and proficiency tests”.

We examine a set of proficiency test results to determine if both of these NIST statements could be justified.

2. Method
Collaborative Testing Services, Inc. (CTS) publish the results of their proficiency tests on the internet at accessed on 8 October 2022. The CTS forensic biology proficiency tests provide four samples (either as samples or profiles): Items 1 and 2 serve as references for comparison to questioned items 3 and 4. A mock case scenario is also provided.Respondents are asked to provide the genotyping results of the four samples and a statement such as: could the Victim (Item 1) and/or the Suspect (Item 2) be a contributor to the questioned samples (Item 3 and Item 4)? Answers are given by ticking a box from the options yes, no, inconclusive, and no interpretation. Therefore, there are potentially four comparisons made by the analyst per test. No statistical analysis is requested.

The manufacturer of the tests provides the consensus of the pre distribution laboratories and at least 10 participating laboratories. These are assumed to be the expected answers.

The summary reports for each relevant forensic biology test (Forensic Biology, Semen, and Mixture) in the years 2018–2021 were searched and those participants recording an answer different from the consensus result were noted. Based on the text of the NIST Draft Foundation, we infer that the NIST team scored a yes where the consensus was “no” as a false positive and vice versa. We follow this procedure but note that this means that the terms false positive and false negative will now include things such as samples where a component was below the detection standard.

Data were provided to us by CTS upon our request. The CTS data analysts were able to mine the data and provide the number of false positive/false negative results per test and the number of reporting PGS labs that fell in the group. For example: “three false negatives, one PGS lab” indicated that two non-PGS labs and one PGS lab reported a false negative for this test. We are limited to what the participating labs actually reported with respect to whether or not they use PGS. Some labs do not use their PGS for the determination of inclusion/exclusion status for a reference. Data were separated based on whether the laboratory was using PGS or not. We point out that this data is imperfect, as there is no requirement for a lab to indicate PGS use on the CTS test. It is entirely at the discretion of the responding laboratories and is based on various internal policies.

We examined the summary report to assign the probable cause of each discordance. Sometimes the participant had given a comment that indicated the reason. In others we were able to see that the genotyping was consistent with the consensus result, but the yes/no was not. We surmise that these were incorrectly filled forms since the inclusion/exclusion in these cases was obvious.

3. Results
Only seven of the discordant results did not have an obvious cause. Five of these appeared to be the result of checking the incorrect boxes for the CTS report. All of the reported profiles for these five instances were consistent with the consensus result. The two remaining instances contained DNA profiles that were not consistent with the consensus results.

Fifteen discordant results were due to the laboratory only reporting the male fraction/component for a semen-containing stain where the victim was also detected. Therefore, the victim was “excluded” from stains that were known to be a mixture of the victim and the male (semen) component when reporting the results on the CTS form.

There was one instance of possible low-level contamination of an evidence item (Item 3) with a reference item (Item 2), which led to a false inclusion.

In four instances, the female or victim component of a blood/semen mixture was weakly detected in the epithelial fraction. The participants determined that the minor component was not suitable for comparison purposes which resulted in false negative conclusions with respect to these individuals.

The majority of the discordant results were false negatives due to the type of analysis performed. Nineteen participants performed mtDNA analysis on the evidence samples. When these samples contained a mixture of blood and semen, only the mtDNA from the blood component was detected and reported. This resulted in false exclusions of the semen contributor. While these conclusions are not consistent with the consensus results, they are consistent with expected results of mtDNA analysis of this type of mixture.

The final discordant conclusion appeared to be the result of a sample switch. The reported DNA profile of one of the reference samples (Item 2) was a mixture and the reported DNA profile of one of the evidence samples (Item 3) was a single source profile. The consensus result for Item 3 was a mixed DNA profile.

4. Conclusions
The instances of false positives and false negatives that arise from the probable causes: only reported male fraction, minor component (female) of differential epithelial cell fraction not suitable for comparison, and mtDNA from blood/semen mixture are not errors and are not related to PGS in any fashion. The “only reported male fraction” discordant results were concordant with respect to the male fraction. The results of the mtDNA analyses are what one would expect for a blood/semen mixture. The issue with the minor component female is not a problem with interpretation, but instead with the extraction or the original sample set-up. This could sometimes also be affected by where the cutting was taken from the substrate provided in the test.

Sample switching, contamination, and reporting results incorrectly are serious errors. Part of routine casework is a technical review that would most likely catch these non-PGS related errors. However, none of these have to do with the mixture interpretation strategy and certainly not with PGS.

It is generally considered that the most serious interpretation error in forensic science is that of a false positive, or an erroneous inclusion. According to the data provided by CTS at our request, there were zero false positives among laboratories that used PGS. This information was not available as presented in the Draft NIST Foundation Review. However, we would never claim that PGS use would make a respondent error proof. We merely point this out to remind the reader that the CTS data as presented in the Draft NIST Foundation Review is not suitable for discussion as done by NIST.

In the end, proficiency test data are currently not a good metric to judge the overall reliability of a system. Individual laboratory systems can use the results to determine how the individual participants performed since the labs know the conditions and parameters of their analysis and reporting. In addition, there are no restrictions on who can participate in vendor-provided proficiency tests, meaning these tests can be used for training, research, or academic purposes. Attempting to judge the overall reliability of a discipline/system using proficiency test data without knowing the sources and causes of each discordant result is misleading and uninformative.

The degree of reliability for PGS, or really any system, cannot be truly assessed by simply examining the numbers that the Draft NIST Foundation Review has presented for proficiency testing results. These numbers can be deceiving and do not truly represent the reliability of a system. If proficiency test data are going to be used to evaluate reliability, a more in-depth examination must be performed.