The International Organization for Standardization (ISO) (1995) notes reliability is “a metric is reliable inasmuch as it constantly provides the same result when applied to the same phenomena.” For the measure score, the phenomena are the quality construct. For the data element, the phenomena are the demographic, health status, health care activity, or other patient, clinician, or encounter attributes. For an instrument, the phenomena may be some construct such as ‘satisfaction’ or ‘functional status.’
Accountable Entity Level (Measure Score) Reliability
Measure developers should conduct accountable entitylevel reliability testing with computed measure scores for each measured entity. Adams, 2009 defines measure score reliability conceptually as the ratio of signal to noise, where the signal is the proportion of the variability in measured performance explained by real differences in performance (differences in the quality construct). Measure score reliability matters because the metric is related to the ability to distinguish differences between measured entities due to true differences in performance rather than to chance (and therefore to reduce the probability of misclassification in comparative performance). The measure developer should always assess the measure score under development for reliability using data derived from testing.
The measure developer may assess accountable entity level reliability using an accepted method. See the table for examples.
Measure of Reliability 
Description 
Recommended Uses 
Measure Reliability Tests 

Signaltonoise 
Estimates the proportion of overall variability explained by the differences between entities 
Entitylevel scores aggregated from dichotomous data at the patient level 
Betabinomial model (Adams, 2009 ) 

Temporal correlation 
Similar to random splithalf correlation, assesses the correlation of data from adjacent time periods for each entity 
To compare to a dataset from the same source separated by a minimal time period 
ICC (intraclass correlation coefficient)
Pearson's correlation coefficient
Spearman's p  a nonparametric correlation of ranked data
Kendall's Tau  a nonparametric correlation of ranked data that is less sensitive to small sample sizes


Random splithalf correlation 
Randomly splits data from each entity in half and estimates the correlation of the two halves across all entities 
When an independent dataset or data from an adjacent time period are not available 
ICC
Pearson's correlation coefficient
Spearman's p  a nonparametric correlation of ranked data
Kendall's Tau  a nonparametric correlation of ranked data that is less sensitive to small sample sizes


The measure developer should also determine if the reliability extends to the repeatability of significant differences between group means and/or the stability of rankings within groups. In other words, does the measure reliably detect differences in scores between groups expected to be different or does the measure allocate the same proportion of participants into ranks on different administrations? The measure developer can accomplish this in part by measuring the significance of differences between groups in measure scores. The measure developer needs to carefully select the tests to account for data distribution and whether the scores are from the same respondents (dependent or paired samples) or different respondents (independent or unpaired samples). With respect to the data distributions, use parametric tests if data are normally distributed (e.g., the mean best represents the center, data are ratio or interval). Use nonparametric tests if data are not normally distributed (e.g., the median better represents the center and the data are categorical or ordinal) or if the sample size is small (Trochim, 2002; Sullivan, n.d.). For example, measure developers often report medians for reliability scores because they do not tend to be normally distributed. Use a special case for nonparametric testing for dichotomous categorical data (e.g., Yes/No; see the Tests of Measure Score Differences table).
Tests of Measure Score Differences (Trochim, 2002)
Comparisons 
Parametric 
NonParametric 
NonParametric 


Ratio or Interval Data 
Ordinal, Nominal 
Dichotomous 

One group score to reference value 
1 sample ttest 
Wilcoxon test 
Chisquare 

Two scores from two groups 
Unpaired ttest 
MannWhitney 
Fischer’s test 

Two scores from the same group 
Paired ttest 
Wilcoxon test 
McNemar's test 

More than two scores from two groups 
Analysis of variance (ANOVA) 
KruskalWallis 
Chisquare 

More than two scores from the same group 
Repeated measures
ANOVA 
Friedman test 
Cochrane’s Q 

Patient/Encounter Level (Data Element) Reliability
Measure developers conduct data element reliability testing with patient or encounterlevel data elements (numerator, denominator, and exclusions, at a minimum). Patient/encounter level reliability refers to the repeatability of the testing findings—that is, whether the measure specifications are able to extract the correct data elements consistently across measured entities. Per the CMS consensusbased entity (CBE), if the measure developer assesses patient/encounter level validity, they do not need to test data element reliability. The CMS CBE does not require data element reliability from electronic clinical quality measures (eCQMs) if based on data from structured fields. However, the CMS CBE requires demonstration of reliability at both the patient/encounter level and computed performance score for instrumentbased measures, including patientreported outcomebased performance measures. Measure developers may exclude from reliability testing data elements already established as reliable (e.g., age). Measure developers should critically review all data elements prior to deciding which to include in reliability testing.
Testing patient/encounter level reliability is less common with digital measures than with manually abstracted measures because electronic systems are good at standardizing processes. However, when using natural language processing to extract data elements for digital measures, the measure developer should also conduct patient/encounter level reliability testing in addition to patient/encounter level validity testing.
Types of Reliability
Depending on the complexity of the measure specifications, the measure developer may assess one or more types of reliability. Some general types of reliability include
 Internal Consistency

Internal consistency assesses the extent to which items (e.g., multiple item survey) designed to measure a given construct are intercorrelated, the extent to which data elements within a measure score are measuring the same construct.
 TestRetest (Temporal Reliability)

Testretest reliability (temporal reliability) is the consistency of scores from the same respondent across two administrations of measurements (Bland, 2000). The measure developer should use the coefficient of stability to quantify the association for the two measurement occasions or when assessing information not expected to change over a short or medium interval of time. Testretest reliability is not appropriate for repeated measurement of disease symptoms nor for measuring intermediate outcomes that follow an expected trajectory for improvement or deterioration. The measure developer assesses testretest reliability when there is a rationale for expecting stability—rather than change—over the time period.
 Intrarater Reliability

Intrarater reliability is the consistency of scores assigned by one rater across two or more measurements (Bland, 2000).
 Interrater (Interabstractor) Reliability

Interrater (interabstractor) reliability is the consistency of ratings from two or more observers (often using the same method or instrumentation) when rating the same information (Bland, 2000). It is frequently employed to assess reliability of data elements used in exclusion specifications, as well as the calculation of measure scores when the measure requires review or abstraction. Quantitatively summarize the extent of interrater/abstractor reliability and concordance rates with confidence intervals which are acceptable statistics to describe interrater/abstractor reliability. eCQMs implemented as direct queries to electronic health record databases may not use abstraction. Therefore, there may be no need for interrater reliability for eCQMs.
Summary of Measure Data Element Reliability Types, Uses, and Tests (Soobiah et al., 2019)
Measure of Reliability 
Recommended Uses 
Measure Reliability Tests 

Internal consistency

Evaluating data elements or items for a single construct (i.e., questionnaire design)

Cronbach's alpha (α) (Cronbach, 1951) assesses how continuous data elements within a measure are correlated with each other
KuderRichardson Formula 20 (KR20) assesses reliability of dichotomous data
McDonald's Omega (ω)


Testretest reliability (temporal reliability)

Assessing whether participant performance on the same test is repeatable or assessing the consistency of scores across time. Used if the measure developer expects the measured construct to be stable over time and measured objectively.

Correlation coefficient quantifies the association between continuous (Pearson or rp) or ordinal (Spearman or rs) scores
ICC reflects both correlation and agreement between measurements of continuous data
Kendall’s Tau correlation coefficient, τb, is a nonparametric test of association


Intrarater/abstractor reliability

Assessing consistency of assignment or abstraction by one rater (e.g., rater scores at two timepoints).

ICC
Kendall’s Tau
Gwet's AC1
Percent agreement is the number of ratings that agree divided by the total ratings
Cohen’s Kappa


Interrater (interabstractor) reliability

Assessing consistency of assignment by multiple raters.

ICC
Kendall’s Tau
Gwet's AC1
Percent agreement
Cohen's Kappa


Intraclass Correlation Coefficient
ICC is one of the most commonly used metrics of testretest, intrarater, and interrater reliability index that reflects both degree of correlation and agreement between measurements of continuous data (Koo & Li, 2016). There are six different variations of ICC. The variation selected depends on the number of raters (model and form) and the type of reliability (agreement or consistency in ratings) used. The result is an expected value of the true ICC, estimated with a 95 percent confidence interval. The measure developer may consider a test to have moderate reliability if the ICC is between 0.5 and 0.75 and excellent reliability if ICC > 0.75 (Gwet, 2008). See the Variations of Intraclass Correlation table.