Skip to main content

Measure Testing

Scientific Acceptability

Scientific acceptability of a measure refers to the extent to which the measure produces reliable and valid results about the intended area of measurement. These qualities determine whether use of the measure can draw reasonable conclusions about care in a given domain. Because many measure scores are composed of patient-level data elements (e.g., blood pressure, laboratory values, medication, surgical procedures) aggregated at the comparison group level (e.g., hospital, nursing home, physician), the measure developer often needs evidence of reliability and validity for both the measure score and data elements, and should address both facets. In many cases, evidence of reliability may not be necessary for electronic clinical quality measures.

Examples of Measure Testing and Reporting Errors

  • Reporting limited to descriptive statistics. Descriptive statistics demonstrate that data are available for analysis, but do not provide evidence of reliability or validity.
  • A lack of testing of a respecified measure. When respecifying a measure (e.g., using similar process criteria for a different population or denominator), the newly respecified measure still requires testing to obtain empirical evidence of reliability and validity.
  • Inadequate evidence of scientific acceptability for commonly used data elements. Data elements (e.g., diagnosis codes, electronic health record fields) that are in common use still require testing or evidence of reliability and validity within the context of the new measure specifications (e.g., new population, new setting).
  • Inadequate analysis or use of clinical guidelines for justifying denominator exclusions and/or numerator exclusions. The measure developer should report analyses and/or clinical guidelines justifying a denominator and/or numerator exclusion or demonstrating reliability for different methods of data collection.
  • Not properly accounting for missing data.
  • Inadequate risk adjustment or stratification.

Since expression of reliability and validity is along a scale or continuum (i.e., they are not all-or-nothing properties), the measure developer may need to address many issues to supply adequate evidence of scientific acceptability. The complexity of different healthcare environments, data sources, and sampling constraints often preclude ideal testing conditions. As such, the level of scientific acceptability requires interpretation and explanation. The assumption is that a measure developer will contract or employ experienced methodologists, statisticians, and subject matter experts to select testing that is appropriate and feasible for the measure(s) under development and ensure demonstration of measure reliability and validity. The measure developer must also engage experts to review testing data and determine the measure’s reliability and validity.

Although not intended to replace expert judgment of the measure development team, the reliability and validity pages describe general factors for a measure developer to consider when evaluating reliability and validity of both a measure score and its component elements. The descriptions should acquaint the measure developer with specialized terminology that testing, evaluation, and statistics experts may use in assessing scientific acceptability.

Scoring Errors

Every score generated by a measure has some error associated with it. In measurement, error is the difference between the true value and the measured value. Error can either be random or systematic. Random errors are fluctuations around the true value as a result of the difficulty of taking measurements (i.e., the ‘how’ of measurement). Reducing random error might involve increasing the effective sample size or eliminating language ambiguity in a data collection tool. In general, the larger the random error, the less reliable the measure score. Systematic error leads to predictable and consistent departures from the true value due to problems with the calibration of the measure (i.e., the ‘what’ of measurement). Reducing systematic error might involve increasing the sensitivity and specificity of a data element relative to a gold standard, or adjusting the measure score for factors that are independent of the quality construct (e.g., social determinants of health). Normally, the larger the systematic error, the less valid the measure score.

Reliability and validity are distinct properties of a measure score that vary by context. For example, the reliability of a measure score may be less when applied to clinicians than facilities, or less when applied to encounters over six months than one year. Similarly, a measure score may be less valid when applied to a heterogeneous population than a homogeneous population, or populations with different capacities (e.g., urban vs. rural). A measure can be reliable, but inaccurate and/or not valid. However, an unreliable measure cannot be valid. Therefore, the measure developer must carefully and explicitly describe the context of measure testing to inform potential users of the measure about the relevant context of demonstrated reliability and validity.

Last Updated: May 2022