Reliability

The International Organization for Standardization (ISO) (2019) notes a metric is reliable inasmuch as it constantly provides the same result when applied to the same phenomena. For the measure score, the phenomena are the quality construct. For the data element, the phenomena are the demographic, health status, health care activity, or other patient, clinician, or encounter attributes. For an instrument, the phenomena may be some construct such as ‘satisfaction’ or ‘functional status.’

Accountable Entity Level (Measure Score) Reliability

Measure developers should conduct accountable entity-level reliability testing with computed measure scores for each measured entity. Adams, 2009 defines measure score reliability conceptually as the ratio of signal to noise, where the signal is the proportion of the variability in measured performance explained by real differences in performance (differences in the quality construct). Measure score reliability matters because the metric is related to the ability to distinguish differences between measured entities due to true differences in performance rather than to chance (and therefore to reduce the probability of misclassification in comparative performance). The measure developer should always assess the measure score under development for reliability using data derived from testing.

The measure developer may assess accountable entity level reliability using an accepted method. See the table for examples.

Measure of Reliability	Description	Recommended Uses	Measure Reliability Tests
Signal-to-noise	Estimates the proportion of overall variability explained by the differences between entities	Entity-level scores aggregated from dichotomous data at the patient level	Beta-binomial model (Adams, 2009 )
Temporal correlation	Similar to random split-half correlation, assesses the correlation of data from adjacent time periods for each entity	To compare to a dataset from the same source separated by a minimal time period	ICC (intraclass correlation coefficient) Pearson's correlation coefficient Spearman's p - a non-parametric correlation of ranked data Kendall's Tau - a non-parametric correlation of ranked data that is less sensitive to small sample sizes
Random split-half correlation	Randomly splits data from each entity in half and estimates the correlation of the two halves across all entities	When an independent dataset or data from an adjacent time period are not available	ICC Pearson's correlation coefficient Spearman's p - a non-parametric correlation of ranked data Kendall's Tau - a non-parametric correlation of ranked data that is less sensitive to small sample sizes

Measure of Reliability

Description

Recommended Uses

Measure Reliability Tests

Signal-to-noise

Estimates the proportion of overall variability explained by the differences between entities

Entity-level scores aggregated from dichotomous data at the patient level

Beta-binomial model (Adams, 2009 )

Temporal correlation

Similar to random split-half correlation, assesses the correlation of data from adjacent time periods for each entity

To compare to a dataset from the same source separated by a minimal time period

ICC (intraclass correlation coefficient)

Pearson's correlation coefficient

Spearman's p - a non-parametric correlation of ranked data

Kendall's Tau - a non-parametric correlation of ranked data that is less sensitive to small sample sizes

Random split-half correlation

Randomly splits data from each entity in half and estimates the correlation of the two halves across all entities

When an independent dataset or data from an adjacent time period are not available

ICC

Pearson's correlation coefficient

Spearman's p - a non-parametric correlation of ranked data

Kendall's Tau - a non-parametric correlation of ranked data that is less sensitive to small sample sizes

The measure developer should also determine if the reliability extends to the repeatability of significant differences between group means and/or the stability of rankings within groups. In other words, does the measure reliably detect differences in scores between groups expected to be different or does the measure allocate the same proportion of participants into ranks on different administrations? The measure developer can accomplish this in part by measuring the significance of differences between groups in measure scores. The measure developer needs to carefully select the tests to account for data distribution and whether the scores are from the same respondents (dependent or paired samples) or different respondents (independent or unpaired samples). With respect to the data distributions, use parametric tests if data are normally distributed (e.g., the mean best represents the center, data are ratio or interval). Use non-parametric tests if data are not normally distributed (e.g., the median better represents the center and the data are categorical or ordinal) or if the sample size is small (Trochim, 2002; Sullivan, n.d.). For example, measure developers often report medians for reliability scores because they do not tend to be normally distributed. Use a special case for non-parametric testing for dichotomous categorical data (e.g., Yes/No; see the Tests of Measure Score Differences table).

Tests of Measure Score Differences (Trochim, 2002)

Comparisons	Parametric	Non-Parametric	Non-Parametric
	Ratio or Interval Data	Ordinal, Nominal	Dichotomous
One group score to reference value	1 sample t-test	Wilcoxon test	Chi-square
Two scores from two groups	Unpaired t-test	Mann-Whitney	Fischer’s test
Two scores from the same group	Paired t-test	Wilcoxon test	McNemar's test
More than two scores from two groups	Analysis of variance (ANOVA)	Kruskal-Wallis	Chi-square
More than two scores from the same group	Repeated measures ANOVA	Friedman test	Cochrane’s Q

Patient/Encounter Level (Data Element) Reliability

Measure developers conduct data element reliability testing with patient or encounter-level data elements (numerator, denominator, and exclusions, at a minimum). Patient/encounter level reliability refers to the repeatability of the testing findings—that is, whether the measure specifications are able to extract the correct data elements consistently across measured entities. Per the CMS consensus-based entity (CBE), if the measure developer assesses patient/encounter level validity, they do not need to test data element reliability. The CMS CBE does not require data element reliability from electronic clinical quality measures (eCQMs) if based on data from structured fields. However, the CMS CBE requires demonstration of reliability at both the patient/encounter level and computed performance score for instrument-based measures, including patient-reported outcome-based performance measures. Measure developers may exclude from reliability testing data elements already established as reliable (e.g., age). Measure developers should critically review all data elements prior to deciding which to include in reliability testing.

Testing patient/encounter level reliability is less common with digital measures than with manually abstracted measures because electronic systems are good at standardizing processes. However, when using natural language processing to extract data elements for digital measures, the measure developer should also conduct patient/encounter level reliability testing in addition to patient/encounter level validity testing.

Types of Reliability

Depending on the complexity of the measure specifications, the measure developer may assess one or more types of reliability. Some general types of reliability include

Summary of Measure Data Element Reliability Types, Uses, and Tests (Soobiah et al., 2019)

Measure of Reliability	Description	Recommended Uses	Measure Reliability Tests
Internal consistency	Internal consistency assesses the extent to which items (e.g., multiple item survey) designed to measure a given construct are inter-correlated, the extent to which data elements within a measure score are measuring the same construct.	Evaluating data elements or items for a single construct (i.e., questionnaire design)	Cronbach's alpha (α) (Cronbach, 1951) assesses how continuous data elements within a measure are correlated with each other Kuder-Richardson Formula 20 (KR-20) assesses reliability of dichotomous data McDonald's Omega (ω)
Test-retest reliability (temporal reliability)	Test-retest reliability (temporal reliability) is the consistency of scores from the same respondent across two administrations of measurements (Bland, 2000). The measure developer should use the coefficient of stability to quantify the association for the two measurement occasions or when assessing information not expected to change over a short or medium interval of time. Test-retest reliability is not appropriate for repeated measurement of disease symptoms nor for measuring intermediate outcomes that follow an expected trajectory for improvement or deterioration. The measure developer assesses test-retest reliability when there is a rationale for expecting stability—rather than change—over the time period.	Assessing whether participant performance on the same test is repeatable or assessing the consistency of scores across time. Used if the measure developer expects the measured construct to be stable over time and measured objectively.	Correlation coefficient quantifies the association between continuous (Pearson or rp) or ordinal (Spearman or rs) scores ICC reflects both correlation and agreement between measurements of continuous data Kendall’s Tau correlation coefficient, τb, is a non-parametric test of association
Intra-rater/abstractor reliability	Intra-rater reliability is the consistency of scores assigned by one rater across two or more measurements (Bland, 2000).	Assessing consistency of assignment or abstraction by one rater (e.g., rater scores at two timepoints).	ICC Kendall’s Tau Gwet's AC1 Percent agreement is the number of ratings that agree divided by the total ratings Cohen’s Kappa
Inter-rater (inter-abstractor) reliability	Inter-rater (inter-abstractor) reliability is the consistency of ratings from two or more observers (often using the same method or instrumentation) when rating the same information (Bland, 2000). It is frequently employed to assess reliability of data elements used in exclusion specifications, as well as the calculation of measure scores when the measure requires review or abstraction. Quantitatively summarize the extent of inter-rater/abstractor reliability and concordance rates with confidence intervals which are acceptable statistics to describe inter-rater/abstractor reliability. eCQMs implemented as direct queries to electronic health record databases may not use abstraction. Therefore, there may be no need for inter-rater reliability for eCQMs.	Assessing consistency of assignment by multiple raters.	ICC Kendall’s Tau Gwet's AC1 Percent agreement Cohen's Kappa

Intraclass Correlation Coefficient: ICC is one of the most commonly used metrics of test-retest, intra-rater, and inter-rater reliability index that reflects both degree of correlation and agreement between measurements of continuous data (Koo & Li, 2016). There are six different variations of ICC. The variation selected depends on the number of raters (model and form) and the type of reliability (agreement or consistency in ratings) used. The result is an expected value of the true ICC, estimated with a 95 percent confidence interval. The measure developer may consider a test to have moderate reliability if the ICC is between 0.5 and 0.75 and excellent reliability if ICC > 0.75 (Gwet, 2008). See the Variations of Intraclass Correlation table.

Variations of Intraclass Correlation

Model	Definition	Form	Definition
1 way random	Assessment of each subject by a different set of randomly selected raters	1	Calculating reliability from a single measurement
2 way random	Assessment of each subject by each rater and selection of raters is random	1 & K	Calculating reliability from a single measurement
2 way random		1 & K	Calculating reliability by taking an average of the k raters’ measurements
2 way mixed	Assessment of each subject by each rater, but the raters are the only raters of interest	K

With respect to other approaches to reliability across all types of reliability estimation, the shared objective is to ensure replication of measurements or decisions. In terms of comparisons of groups, where possible the measure developer should extend reliability to assess stability of the relative positions of different groups or determination of significant differences between groups. These types of assessments address the proportion of variation in the measure attributable to the group. The measure developer describes the proportion as true differences (signal) relative to variation in the measure due to other factors, including chance variation (noise). Measure developers may consider measures with a relatively high proportion of signal variance reliable because of their power for discriminating among measured entities and the repeatability of group-level differences across samples. However, not all reliable measures have a high proportion of signal variance, e.g., due to shrinkage. Provided the number of observations within groups is sufficiently large, measure developers can partially address these questions using methods such as ANOVA, estimation of variance components within a hierarchical mixed (i.e., random effects) model, or bootstrapping simulations. Changes in group ranking across multiple measurements may also add to an understanding of the stability of group-level measurement (Adams, 2009).

Instrument Reliability

Measure developers must assess instrument-based measures (from a survey, test, questionnaire, or scale) at both the data element (or item) level and the measure score level. Measure developers may assess item reliability by evaluating test-retest reliability (see the Types of Reliability table), which evaluates the consistency of the item from one time to the next. The measure developer may estimate precision of the measure score, measure score reliability, using the methods outlined in the Measure Reliability Types, Uses, and Tests table.

Measure developers evaluating a statistic, such as Cronbach's alpha or McDonald's Omega, with all the items in the measure against the Cronbach's alpha excluding the item in question. If the alpha increases in the latter situation, the item is not strongly related to the other items in the scale (Ritter, 2010). Measure developers should critically review all items in a scale prior to deciding which to include in reliability testing. Items should always undergo reliability testing if they are new, gathered in a different mode (e.g., telephone vs. self-administered survey), used on a different population, or included as part of a new scale.

Form Equivalence

If using multiple modes or data collections, the measure developer should document and quantify discrepancies between methods (i.e., mode effects). Form equivalence (parallel forms) reliability is the extent to which multiple formats or versions of a test yield the same results (Bland, 2000). Measure developers may use form equivalence reliability when testing comparability of results across more than one method of data collection, across automated data extraction from different data sources, or testing agreement between the known values from a simulated data set and the elements obtained when applying the specifications to the data set. As part of the analysis, the measure developer should investigate and document the reasons for discrepancies between methods (i.e., mode effects –for example, when results from a telephone survey are different from results of the same survey when mailed). Measure developers may measure form equivalence with the same methods used to assess measure reliability according to the data type including ICC, Pearson's correlation, Spearman’s rho, or Kendall's Tau.

Last Updated: Jun 2024

Measure Testing