Validity

In measure development, the term “validity” has a specific application known as test validity, which refers to the degree to which evidence, clinical judgment, and theory support interpretations of a measure score. Stated more simply, test validity is an empirical demonstration of the ability of a measure to record or quantify what it purports to measure.

Types of Validity

Measure developers may test validity of a measure score in many ways. Although some experts view all types of validity as special cases or subsets of construct validity, researchers commonly reference the types of validity separately: construct validity, discriminant validity, predictive validity, convergent validity, criterion validity, and face validity (Messick, 1994).

Face Validity

Face validity is the extent to which a measure appears to measure what it is supposed to measure “at face value.” It is a subjective assessment by experts based on experience about whether the measure reflects its intended assessment, and the research community generally considers the weakest form of validity testing because it is not based on objective observation. However, following best practices for face validity for a quality measure using a systematic and transparent process by a panel of identified experts not involved in the measure development can provide meaningful support for interpretations of the measure score.

The literature notes there are two main techniques for a systematic and transparent process (Davies et al., 2011). Both techniques provide advantages over more unstructured methods (e.g., a survey of the technical expert panel). First, structured evaluations attempt to combat cognitive biases in judgment that are particularly influential in complex tasks. For example, anchoring bias can occur when panelists set their initial responses relative to the opinion of the group. The Delphi Group and nominal group (NG) methods both require an independent initial rating to anchor opinions based on an individual’s own knowledge. Second, structured methods focus the discussion on specific topics pertinent to the underlying validity of the measures and allow all panelists to have access to similar information before evaluation. Third, these methods allow for objectively quantifying the results for direct comparisons among measures to better establish consensual face validity.

In the Delphi technique, typically a panel of experts independently rates indicators and the measure developer compiles the ratings, summarizes, and distributes for review before another round of ratings. The measure developer continues the process until the ratings converge and stabilize. The Delphi process allows for a large panel, minimizing the influence of individual panelists, and maximizing inter-panel reliability. However, because the exchange of opinions and information occurs via written documentation, there is no opportunity for interactive discussion.

The NG technique also utilizes an initial independent rating, followed by the distribution of summarized results. At this point the panel then meets, traditionally in person and in some cases via conference call, to discuss opinions regarding the indicators. Panelists then rerate the indicators independently. This technique is based on the RAND appropriateness method. The NG process allows for efficient information exchange among panelists, which is particularly important when panelists offer unique points of view (e.g., different clinical specialties, types of practice). However, successful facilitation of an in-person or call-based panel limits the size, generally to under 15 individuals. Without effective moderation by the facilitator, one or two individuals can unduly influence the discussion. There are limitations to inter-panel reliability because of the small panel size.

The CMS recommends using a formal consensus process, such as a Delphi or NG approach, for the review of face validity. The CMS CBE allows the use of face validity in lieu of empirical testing for new measures if the measure developer performs a systematic assessment to reflect the accuracy of focused care measured.

For new electronic clinical quality measure (eCQM) data elements in unstructured fields and all maintenance reviews, face validity is not sufficient. Maintenance and eCQM review requires empirical validity testing. Justification is necessary if empirical validity testing is not possible.

Except for criterion validity, measure developers generally use empirical validity testing of the measure score at the accountable entity level. For patient/encounter level validity testing, the measure developers often use criterion validity.

Types of Validity

Measure of Validity	Definition	Recommended Uses	Example Measure Validity Tests
Construct Validity	Construct validity refers to the extent to which the measure quantifies what the theory says it should. Construct validity evidence often involves empirical and theoretical support for the interpretation of the construct. Evidence may include statistical analyses such as confirmatory factor analysis of data elements to ensure they cohere and represent a single construct.	In general, to demonstrate measured entities performing better (or worse) on the quality construct perform better (or worse) on a meaningful outcome.	A process-outcome correlation
Convergent Validity	Convergent validity refers to the degree to which multiple measures/indicators of a single underlying concept are interrelated. Examples include measurement of correlations between a measure score and other indicators of processes related to the target outcome or multiple target outcomes with similar processes.	A form of construct validity where the meaningful outcome occurs at the same time as the quality construct (e.g., inpatient mortality). The measure developer may use a process measure with a pre-established validity to the same meaningful outcome as a proxy.	A process-outcome correlation or process-process (proxy) correlation
Criterion Validity	Criterion validity refers to verification of data elements against some reference criterion determined to be valid (i.e., the gold standard). Examples include verification of data elements obtained through automated search strategies of electronic health records (EHRs) compared with manual review of the same medical records (i.e., the gold standard). Concurrent validity and predictive validity are forms of criterion validity.	Used to compare a data element or a patient/encounter level construct with a gold standard	An electronic clinical quality measure (eCQM) or hospital patient record review vs. expert patient record review
Discriminant Validity	Discriminant/contrasted groups validity examines the variation across multiple comparison groups (e.g., measured entities). The measure developer demonstrates discriminant validity by showing the measure can differentiate between disparate groups that it should theoretically be able to distinguish.	When the quality construct is unobservable, but there is theoretical evidence that performance should (or should not) be better (or worse) for groups based on observable characteristics	A structure-outcome correlation or highest quality of care
Face Validity	Face validity is the extent to which a measure appears to measure what it is supposed to measure “at face value.”	New measures or any circumstance when empirical validity testing of the measure score is not feasible.	Modified Delphi Approach analyzing individual items (ordinal data) using non-parametric tests such as Spearman’s correlation or chi-square test for independence Modified Delphi Approach analyzing all items (interval data) parametric tests such as Pearson’s r correlation or t-tests
Predictive Validity	Predictive validity refers to the ability of measure scores to predict scores of other related measures or outcomes in the future, particularly if the original measure scores predict a subsequent patient-level outcome of undisputed importance (e.g., death, permanent disability). Predictive validity also refers to scores on the same measure for other groups at the same point in time.	A form of construct validity where the meaningful outcome occurs later in time from the quality construct (e.g., 30-day mortality). The measure developer may use a process measure with a pre-established validity to the same meaningful outcome as a proxy.	A process-outcome correlation or process-process (proxy) correlation

Measure Data Element level Versus Measure Score level

Patient/encounter-level data elements are the building blocks for a quality measure and measure developers should assess them for reliability and validity. Although patient/encounter-level data elements are important, measure developers should use computed measure scores to draw conclusions about the targeted aspect of care. eCQMs must demonstrate validity at the patient/encounter level and measure score level. See the CMS definition of a fully developed measure.

Validity testing of data elements typically analyzes agreement with another authoritative source of the same information. Some examples of validity testing using comparative analysis measure data elements include comparisons of

Claims data with codes used to represent primary clinical data (e.g., International Classification of Diseases, 10th Revision-Clinical Modification/Procedure Coding System, Current Procedural Terminology) for manual abstraction from a sample of patient medical records
Standardized patient assessment instrument information (e.g., Long Term Care Minimum Data Set, Outcome and Assessment Information Set, registry data) not abstracted, coded, or transcribed with “expert” assessor evaluation (conducted at approximately the same time) for a sample of patients
EHR information extracted using automated processes based on measure technical specifications to manual abstraction of the entire EHR

Considerations for Different Measure Types for New and Maintenance Measures

Structure, patient-reported outcome-based performance (PRO-PMs), and composite measures may have different validity requirements. For example, structure measures may demonstrate measure score validity using the logic model linking the process to the outcomes and/or may not need any data element validity testing. PRO-PMs should refer to development of the patient-reported outcome measure from which the PRO-PM is based. Measure developers should treat each component of a composite measure as a separate measure and complete the validity testing of the individual component measures.

Sample Size

Prior to data collection, measure developers should perform power calculations to ensure the sample size will be adequate to detect important differences between the measure score and the comparison data. At a minimum, the measure developer needs to report metrics of uncertainty.

Last Updated: Mar 2025

Measure Testing

Validity

Types of Validity

Face Validity

Types of Validity

Measure Data Element level Versus Measure Score level