Reliability
Variations of Intraclass Correlation
Model | Definition | Form | Definition |
---|---|---|---|
1 way random | Assessment of each subject by a different set of randomly selected raters |
1
K
|
Calculating reliability from a single measurement
Calculating reliability by taking an average of the k raters’ measurements
|
2 way random | Assessment of each subject by each rater and selection of raters is random | ||
2 way mixed | Assessment of each subject by each rater, but the raters are the only raters of interest |
With respect to other approaches to reliability across all types of reliability estimation, the shared objective is to ensure replication of measurements or decisions. In terms of comparisons of groups, where possible the measure developer should extend reliability to assess stability of the relative positions of different groups or determination of significant differences between groups. These types of assessments address the proportion of variation in the measure attributable to the group. The measure developer describes the proportion as true differences (signal) relative to variation in the measure due to other factors, including chance variation (noise). Measure developers may consider measures with a relatively high proportion of signal variance reliable because of their power for discriminating among measured entities and the repeatability of group-level differences across samples. However, not all reliable measures have a high proportion of signal variance, e.g., due to shrinkage. Provided the number of observations within groups is sufficiently large, measure developers can partially address these questions using methods such as ANOVA, estimation of variance components within a hierarchical mixed (i.e., random effects) model, or bootstrapping simulations. Changes in group ranking across multiple measurements may also add to an understanding of the stability of group-level measurement (Adams, 2009).
Instrument Reliability
Measure developers must assess instrument-based measures (from a survey, test, questionnaire, or scale) at both the data element (or item) level and the measure score level. Measure developers may assess item reliability by evaluating test-retest reliability (see the Types of Reliability table), which evaluates the consistency of the item from one time to the next. The measure developer may estimate precision of the measure score, measure score reliability, using the methods outlined in the Measure Reliability Types, Uses, and Tests table.
Measure developers evaluating a statistic, such as Cronbach's alpha or McDonald's Omega, with all the items in the measure against the Cronbach's alpha excluding the item in question. If the alpha increases in the latter situation, the item is not strongly related to the other items in the scale (Ritter, 2010). Measure developers should critically review all items in a scale prior to deciding which to include in reliability testing. Items should always undergo reliability testing if they are new, gathered in a different mode (e.g., telephone vs. self-administered survey), used on a different population, or included as part of a new scale.
Form Equivalence
If using multiple modes or data collections, the measure developer should document and quantify discrepancies between methods (i.e., mode effects). Form equivalence (parallel forms) reliability is the extent to which multiple formats or versions of a test yield the same results (Bland, 2000). Measure developers may use form equivalence reliability when testing comparability of results across more than one method of data collection, across automated data extraction from different data sources, or testing agreement between the known values from a simulated data set and the elements obtained when applying the specifications to the data set. As part of the analysis, the measure developer should investigate and document the reasons for discrepancies between methods (i.e., mode effects –for example, when results from a telephone survey are different from results of the same survey when mailed). Measure developers may measure form equivalence with the same methods used to assess measure reliability according to the data type including ICC, Pearson's correlation, Spearman’s rho, or Kendall's Tau.