The International Organization for Standardization (ISO) (2019) notes a metric is reliable inasmuch as it constantly provides the same result when applied to the same phenomena. For the measure score, the phenomena are the quality construct. For the data element, the phenomena are the demographic, health status, health care activity, or other patient, clinician, or encounter attributes. For an instrument, the phenomena may be some construct such as ‘satisfaction’ or ‘functional status.’
Accountable Entity Level (Measure Score) Reliability
Measure developers should conduct accountable entity-level reliability testing with computed measure scores for each measured entity. Adams, 2009 defines measure score reliability conceptually as the ratio of signal to noise, where the signal is the proportion of the variability in measured performance explained by real differences in performance (differences in the quality construct). Measure score reliability matters because the metric is related to the ability to distinguish differences between measured entities due to true differences in performance rather than to chance (and therefore to reduce the probability of misclassification in comparative performance). The measure developer should always assess the measure score under development for reliability using data derived from testing.
The measure developer may assess accountable entity level reliability using an accepted method. See the table for examples.
Measure of Reliability | Description | Recommended Uses | Measure Reliability Tests |
---|
Signal-to-noise | Estimates the proportion of overall variability explained by the differences between entities | Entity-level scores aggregated from dichotomous data at the patient level | Beta-binomial model (Adams, 2009 ) |
---|
Temporal correlation | Similar to random split-half correlation, assesses the correlation of data from adjacent time periods for each entity | To compare to a dataset from the same source separated by a minimal time period | ICC (intraclass correlation coefficient) Pearson's correlation coefficient Spearman's p - a non-parametric correlation of ranked data Kendall's Tau - a non-parametric correlation of ranked data that is less sensitive to small sample sizes |
---|
Random split-half correlation | Randomly splits data from each entity in half and estimates the correlation of the two halves across all entities | When an independent dataset or data from an adjacent time period are not available | ICC Pearson's correlation coefficient Spearman's p - a non-parametric correlation of ranked data Kendall's Tau - a non-parametric correlation of ranked data that is less sensitive to small sample sizes |
---|
The measure developer should also determine if the reliability extends to the repeatability of significant differences between group means and/or the stability of rankings within groups. In other words, does the measure reliably detect differences in scores between groups expected to be different or does the measure allocate the same proportion of participants into ranks on different administrations? The measure developer can accomplish this in part by measuring the significance of differences between groups in measure scores. The measure developer needs to carefully select the tests to account for data distribution and whether the scores are from the same respondents (dependent or paired samples) or different respondents (independent or unpaired samples). With respect to the data distributions, use parametric tests if data are normally distributed (e.g., the mean best represents the center, data are ratio or interval). Use non-parametric tests if data are not normally distributed (e.g., the median better represents the center and the data are categorical or ordinal) or if the sample size is small (Trochim, 2002; Sullivan, n.d.). For example, measure developers often report medians for reliability scores because they do not tend to be normally distributed. Use a special case for non-parametric testing for dichotomous categorical data (e.g., Yes/No; see the Tests of Measure Score Differences table).
Tests of Measure Score Differences (Trochim, 2002)
Comparisons | Parametric | Non-Parametric | Non-Parametric |
---|
| Ratio or Interval Data | Ordinal, Nominal | Dichotomous |
---|
One group score to reference value | 1 sample t-test | Wilcoxon test | Chi-square |
---|
Two scores from two groups | Unpaired t-test | Mann-Whitney | Fischer’s test |
---|
Two scores from the same group | Paired t-test | Wilcoxon test | McNemar's test |
---|
More than two scores from two groups | Analysis of variance (ANOVA) | Kruskal-Wallis | Chi-square |
---|
More than two scores from the same group | Repeated measures ANOVA | Friedman test | Cochrane’s Q |
---|
Patient/Encounter Level (Data Element) Reliability
Measure developers conduct data element reliability testing with patient or encounter-level data elements (numerator, denominator, and exclusions, at a minimum). Patient/encounter level reliability refers to the repeatability of the testing findings—that is, whether the measure specifications are able to extract the correct data elements consistently across measured entities. Per the CMS consensus-based entity (CBE), if the measure developer assesses patient/encounter level validity, they do not need to test data element reliability. The CMS CBE does not require data element reliability from electronic clinical quality measures (eCQMs) if based on data from structured fields. However, the CMS CBE requires demonstration of reliability at both the patient/encounter level and computed performance score for instrument-based measures, including patient-reported outcome-based performance measures. Measure developers may exclude from reliability testing data elements already established as reliable (e.g., age). Measure developers should critically review all data elements prior to deciding which to include in reliability testing.
Testing patient/encounter level reliability is less common with digital measures than with manually abstracted measures because electronic systems are good at standardizing processes. However, when using natural language processing to extract data elements for digital measures, the measure developer should also conduct patient/encounter level reliability testing in addition to patient/encounter level validity testing.
Types of Reliability
Depending on the complexity of the measure specifications, the measure developer may assess one or more types of reliability. Some general types of reliability include
Summary of Measure Data Element Reliability Types, Uses, and Tests (Soobiah et al., 2019)
Measure of Reliability | Description | Recommended Uses | Measure Reliability Tests |
---|
Internal consistency | Internal consistency assesses the extent to which items (e.g., multiple item survey) designed to measure a given construct are inter-correlated, the extent to which data elements within a measure score are measuring the same construct. | Evaluating data elements or items for a single construct (i.e., questionnaire design) | Cronbach's alpha (α) (Cronbach, 1951) assesses how continuous data elements within a measure are correlated with each other Kuder-Richardson Formula 20 (KR-20) assesses reliability of dichotomous data McDonald's Omega (ω) |
---|
Test-retest reliability (temporal reliability) | Test-retest reliability (temporal reliability) is the consistency of scores from the same respondent across two administrations of measurements (Bland, 2000). The measure developer should use the coefficient of stability to quantify the association for the two measurement occasions or when assessing information not expected to change over a short or medium interval of time. Test-retest reliability is not appropriate for repeated measurement of disease symptoms nor for measuring intermediate outcomes that follow an expected trajectory for improvement or deterioration. The measure developer assesses test-retest reliability when there is a rationale for expecting stability—rather than change—over the time period. | Assessing whether participant performance on the same test is repeatable or assessing the consistency of scores across time. Used if the measure developer expects the measured construct to be stable over time and measured objectively. | Correlation coefficient quantifies the association between continuous (Pearson or rp) or ordinal (Spearman or rs) scores ICC reflects both correlation and agreement between measurements of continuous data Kendall’s Tau correlation coefficient, τb, is a non-parametric test of association |
---|
Intra-rater/abstractor reliability | Intra-rater reliability is the consistency of scores assigned by one rater across two or more measurements (Bland, 2000). | Assessing consistency of assignment or abstraction by one rater (e.g., rater scores at two timepoints). | ICC Kendall’s Tau Gwet's AC1 Percent agreement is the number of ratings that agree divided by the total ratings Cohen’s Kappa |
---|
Inter-rater (inter-abstractor) reliability | Inter-rater (inter-abstractor) reliability is the consistency of ratings from two or more observers (often using the same method or instrumentation) when rating the same information (Bland, 2000). It is frequently employed to assess reliability of data elements used in exclusion specifications, as well as the calculation of measure scores when the measure requires review or abstraction. Quantitatively summarize the extent of inter-rater/abstractor reliability and concordance rates with confidence intervals which are acceptable statistics to describe inter-rater/abstractor reliability. eCQMs implemented as direct queries to electronic health record databases may not use abstraction. Therefore, there may be no need for inter-rater reliability for eCQMs. | Assessing consistency of assignment by multiple raters. | ICC Kendall’s Tau Gwet's AC1 Percent agreement Cohen's Kappa |
---|
- Intraclass Correlation Coefficient
ICC is one of the most commonly used metrics of test-retest, intra-rater, and inter-rater reliability index that reflects both degree of correlation and agreement between measurements of continuous data (Koo & Li, 2016). There are six different variations of ICC. The variation selected depends on the number of raters (model and form) and the type of reliability (agreement or consistency in ratings) used. The result is an expected value of the true ICC, estimated with a 95 percent confidence interval. The measure developer may consider a test to have moderate reliability if the ICC is between 0.5 and 0.75 and excellent reliability if ICC > 0.75 (Gwet, 2008). See the Variations of Intraclass Correlation table.