Skip to main content


In measure development, the term “validity” has a specific application known as test validity, which refers to the degree to which evidence, clinical judgment, and theory support interpretations of a measure score. Stated more simply, test validity is an empirical demonstration of the ability of a measure to record or quantify what it purports to measure. 

Types of Validity

Measure developers may test validity of a measure score in many ways. Although some experts view all types of validity as special cases or subsets of construct validity, researchers commonly reference the types of validity separately: construct validity, discriminant validity, predictive validity, convergent validity, criterion validity, and face validity (Messick, 1994). 

Face Validity

Face validity is the extent to which a measure appears to measure what it is supposed to measure “at face value.” It is a subjective assessment by experts based on experience about whether the measure reflects its intended assessment, and the research community generally considers the weakest form of validity testing because it is not based on objective observation. However, following best practices for face validity for a quality measure using a systematic and transparent process by a panel of identified experts not involved in the measure development can provide meaningful support for interpretations of the measure score.

The literature notes there are two main techniques for a systematic and transparent process (Davies et al., 2011). Both techniques provide advantages over more unstructured methods (e.g., a survey of the technical expert panel). First, structured evaluations attempt to combat cognitive biases in judgment that are particularly influential in complex tasks. For example, anchoring bias can occur when panelists set their initial responses relative to the opinion of the group. The Delphi Group and nominal group (NG) methods both require an independent initial rating to anchor opinions based on an individual’s own knowledge. Second, structured methods focus the discussion on specific topics pertinent to the underlying validity of the measures and allow all panelists to have access to similar information before evaluation. Third, these methods allow for objectively quantifying the results for direct comparisons among measures to better establish consensual face validity.

In the Delphi technique, typically a panel of experts independently rates indicators and the measure developer compiles the ratings, summarizes, and distributes for review before another round of ratings. The measure developer continues the process until the ratings converge and stabilize. The Delphi process allows for a large panel, minimizing the influence of individual panelists, and maximizing inter-panel reliability. However, because the exchange of opinions and information occurs via written documentation, there is no opportunity for interactive discussion.

The NG technique also utilizes an initial independent rating, followed by the distribution of summarized results. At this point the panel then meets, traditionally in person and in some cases via conference call, to discuss opinions regarding the indicators. Panelists then rerate the indicators independently. This technique is based on the RAND appropriateness method. The NG process allows for efficient information exchange among panelists, which is particularly important when panelists offer unique points of view (e.g., different clinical specialties, types of practice). However, successful facilitation of an in-person or call-based panel limits the size, generally to under 15 individuals. Without effective moderation by the facilitator, one or two individuals can unduly influence the discussion. There are limitations to inter-panel reliability because of the small panel size.

In Guidance for Measure Testing and Evaluating Scientific Acceptability of Measure Properties, the CMS consensus-based entity (CBE) recommends using a formal consensus process, such as a Delphi or NG approach, for the review of face validity. Likewise, in Measure Evaluation Criteria and Guidance for Evaluating Measures for Endorsement, the CMS CBE allows the use of face validity in lieu of empirical testing for new measures if a systematic assessment is performed and targeted to reflect accuracy of the targeted care measured. For maintenance review, face validity is not sufficient. Maintenance review requires empirical validity testing. Justification is necessary if empirical validity testing is not possible. 

Other Types of Validity

The other types of validity – except for criterion validity – measure developers generally use empirical validity testing of the measure score at the accountable entity level. Measure developers generally use criterion validity at the patient/encounter level.

Types of Validity

Measure of Validity Definition Recommended Uses Example Measure Validity Tests
Construct Validity Construct validity refers to the extent to which the measure quantifies what the theory says it should. Construct validity evidence often involves empirical and theoretical support for the interpretation of the construct. Evidence may include statistical analyses such as confirmatory factor analysis of data elements to ensure they cohere and represent a single construct. In general, to demonstrate that measured entities that perform better (or worse) on the quality construct perform better (or worse) on a meaningful outcome.  A process-outcome correlation
Convergent Validity Convergent validity refers to the degree to which multiple measures/indicators of a single underlying concept are interrelated. Examples include measurement of correlations between a measure score and other indicators of processes related to the target outcome or multiple target outcomes with similar processes. A form of construct validity where the meaningful outcome occurs at the same time as the quality construct (e.g., inpatient mortality). The measure developer may use as a proxy a process measure with a pre-established validity to the same meaningful outcome. A process-outcome correlation or process-process (proxy) correlation
Criterion Validity Criterion validity refers to verification of data elements against some reference criterion determined to be valid (i.e., the gold standard). Examples include verification of data elements obtained through automated search strategies of electronic health records (EHRs) compared with manual review of the same medical records (i.e., the gold standard). Concurrent validity and predictive validity are forms of criterion validity.  Used to compare a data element or a patient/encounter level construct with a gold standard  An electronic clinical quality measure (eCQM) or hospital medical record review vs. expert medical record review
Discriminant Validity Discriminant/contrasted groups validity examines the variation across multiple comparison groups (e.g., measured entities). The measure developer demonstrates discriminant validity by showing that the measure can differentiate between disparate groups that it should theoretically be able to distinguish. When the quality construct is unobservable, but there is theoretical evidence that performance should (or should not) be better (or worse) for groups based on observable characteristics A structure-outcome correlation or equity
Face Validity Face validity is the extent to which a measure appears to measure what it is supposed to measure “at face value.” New measures or any circumstance when empirical validity testing of the measure score is not feasible.

Modified Delphi Approach analyzing individual items (ordinal data) using non-parametric tests such as Spearman’s correlation or chi-square test for independence

Modified Delphi Approach analyzing all items (interval data) parametric tests such as Pearson’s r correlation or t-tests

Predictive Validity Predictive validity refers to the ability of measure scores to predict scores of other related measures or outcomes in the future, particularly if the original measure scores predict a subsequent patient-level outcome of undisputed importance (e.g., death, permanent disability). Predictive validity also refers to scores on the same measure for other groups at the same point in time. A form of construct validity where the meaningful outcome occurs later in time from the quality construct (e.g., 30-day mortality). The measure developer may use as a proxy a process measure with a pre-established validity to the same meaningful outcome. A process-outcome correlation or process-process (proxy) correlation 

Measure Data Elements Versus Quality Measure Score

Patient/encounter-level data elements are the building blocks for a quality measure and measure developers should assess them for reliability and validity. Although patient/encounter-level data elements are important, measure developers should use computed measure scores to draw conclusions about the targeted aspect of care. According to Measure Evaluation Criteria and Guidance for Evaluating Measures for Endorsement, the CMS CBE will accept patient/encounter level and/or accountable entity level validity testing. However, instrument-based measures need both and composite measures need empirical performance score validity testing by the time of endorsement maintenance. eCQMs must demonstrate validity at the patient/encounter level.

Validity testing of data elements typically analyzes agreement with another authoritative source of the same information. Some examples of validity testing using comparative analysis measure data elements include comparisons of

  • Claims data that have codes used to represent primary clinical data (e.g., International Classification of Diseases, 10th Revision-Clinical Modification/Procedure Coding System, Current Procedural Terminology) for manual abstraction from a sample of patient medical records
  • Standardized patient assessment instrument information (e.g., Long Term Care Minimum Data Set, Outcome and Assessment Information Set, registry data) not abstracted, coded, or transcribed with “expert” assessor evaluation (conducted at approximately the same time) for a sample of patients
  • EHR information extracted using automated processes based on measure technical specifications to manual abstraction of the entire EHR

Sample Size

Prior to data collection, measure developers should perform power calculations to ensure the sample size will be adequate to detect important differences between the measure score and the comparison data. At a minimum, the measure developer needs to report metrics of uncertainty.

Last Updated: May 2023