Copyright © 2023 The Authors. Neuropsychopharmacology Reports published by John Wiley & Sons Australia, Ltd on behalf of The Japanese Society of Neuropsychopharmacology.
This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
Data sharing is not applicable to this article as no datasets were generated or analyzed for the current study.
Problems associated with alcohol use are multidimensional with psychiatric, psychological, physical, and social aspects, which makes it challenging to choose appropriate assessment scales. However, there has been no systematic evaluation of existing alcohol scales.
A systematic literature search was conducted for articles that assessed the psychometric properties of scales for alcohol use disorder on March 19, 2023, using Medline, EMBASE, and PsycINFO. Only scales whose original development papers were cited more than 20 times were included. The methodological quality and psychometric properties of the scales were evaluated using COnsensus‐based Standards for the selection of health Measurement INstruments. The overall rating of the scales were assessed with a score ranging from 0 to 18.
In total, 314 studies and 40 scales were identified. These scales differ widely in measurement methods, target populations, and psychometric properties. The overall mean score was 6.3, and only the following three scales received >9 points suggesting a moderate level of evidence: Alcohol Use Disorders Identification Test (AUDIT), Alcohol Dependence Scale (ADS), and Short Alcohol Dependence Data Questionnaire (SADD). Measurement error and responsiveness were not evaluated or reported in the included scales.
Although the AUDIT, ADS, and SADD were rated the highest among the 40 scales, they showed, at most, a moderate level of evidence. These findings underscore the need to accumulate further evidence to assure the quality of the scales. It may be advisable to select and combine scales to meet the purpose of the assessment.
Keywords: alcohol, COSMIN, methodological quality, psychometric property, scale, validityA systematic literature search was conducted for articles that assessed the psychometric properties of scales for alcohol use disorder. Although the AUDIT, ADS, and SADD were rated the highest among the 39 scales, they showed, at most, a moderate level of evidence. These findings underscore the need to accumulate further evidence to assure the quality of the scales.
Alcohol use disorder is associated with various psychiatric, psychological, physical, and social problems, resulting in an increased risk of death and a substantial direct and indirect economic burden. 1 For example, alcohol use is estimated to be responsible for approximately 100 000 deaths annually 2 and a loss of $179.0 billion in the United States. 3 These negative consequences emphasize the need for effective clinical interventions and active research activities. Therefore, it is critically important to evaluate and characterize problems associated with the use of alcohol; however, it is challenging because alcohol‐related problems are multidimensional and have various conditions such as occasional drinking, at‐risk drinking, and abuse and dependence on alcohol. 4 , 5
Several assessment scales have been developed for the measurement of alcohol‐related problems; nonetheless, there has been no gold standard. This lack of consensus may be partially due to the absence of a systematic evaluation of existing alcohol scales in terms of their methodological quality and measurement properties. Allen and Wilson's “Assessment of Alcohol Problems” comprehensively evaluated alcohol scales that experts had deemed highly relevant for assessing and treating alcohol problems 6 ; however, the second edition of the book is already 20 years old. Thus, little robust evidence is available for researchers and clinicians to determine their choice of assessment scales in evaluating problems associated with alcohol use, resulting in an arbitrary and inappropriate selection of the populations and settings of their interest. The COnsensus‐based Standards for the selection of health Measurement INstruments (COSMIN) is considered beneficial to overcome this situation, 7 which includes explicit quality assessment criteria to demonstrate what constitutes good measurement properties to facilitate the selection of high‐quality patient‐reported outcome measures for research and clinical practice. The COSMIN checklist was developed in 2010 based on a consensus among 43 experts in the fields of psychology, epidemiology, statistics, and clinical medicine through an international Delphi study 8 and provides a comprehensive guideline for the systematic review of patient‐reported outcome measures. 9
There has been no systematic evaluation of the methodological quality and measurement properties of existing alcohol scales using the COSMIN checklist. To fill the gap in the literature, we conducted a systematic evaluation of the methodological quality and measurement properties of published measurement scales for alcohol‐related problems, using the COSMIN checklist, to provide evidence‐based recommendations for measurement scales for this challenging population.
This review adhered to the recommendations of the Preferred Reporting Items for Systematic Reviews and Meta‐analyses statement to ensure transparent and complete reporting. A study protocol was prepared before commencing data collection and registered on the International Prospective Register of Systematic Reviews (registration number CRD42020204055). Two independent authors (YO and HU) performed the search, assessed eligibility, extracted data, and conducted a quality assessment. Any discrepancies were resolved through discussion by three authors (YO, FU, and HU).
A literature search was conducted on March 19, 2023, using three electronic databases (Medline, EMBASE, and PsycINFO) to identify articles that assessed the psychometric properties of scales for alcohol use disorder. The following search terms were used: (“alcohol use disorder” OR “alcohol dependence” OR “alcohol misuse” OR “alcohol abuse”), (scale OR instrument OR questionnaire OR tool OR assess*) and (psychometric* OR reliability OR validity). Limits for “English” and “humans” were employed. The reference lists of the identified articles were manually searched for additional articles.
Articles were included if (1) they were peer‐reviewed articles in which scales for alcohol use disorder were used; (2) the population of interest was human; (3) they were written in English; (4) the original articles regarding the development of the scales had been cited more than 20 times according to the Web of Science. Studies that claimed to measure alcohol‐related problems but did not use any relevant scale were excluded.
Two independent authors (YO and HU) screened the titles and abstracts of all articles identified by the search strategies and assessed the full texts of the relevant articles. The following information was extracted from the eligible articles: characteristics of included studies such as names of the first authors, year of publication, number and age of populations assessed, the country where the study was conducted, diagnosis of the included participants, names and purpose of the scales used, number of items, number of dimensions measured, assessment format (i.e., self‐report or in‐person interview), and psychometric properties.
Measurement property ratings for each scale were determined according to the COSMIN's criteria for good measurement property (Table 1 ) and rated as “sufficient,” “insufficient,” or “indeterminate,” based on the criteria published elsewhere. 10 , 11 , 12 Assessment of the methodological quality of the scales was performed using the COSMIN's risk of bias checklist, which consisted of 116 items across ten criteria boxes. The first box is related to the development of scales and is considered when evaluating content validity. The second one is associated with content validity. The remaining eight evaluated structural validity, internal consistency, cross‐cultural validity, reliability, measurement error, criterion validity, hypothesis testing for construct validity, and responsiveness. Each item was scored on a 4‐grade rating scale (very good, adequate, doubtful, and inadequate), and the item with the lowest rating was used as the overall rating in each criteria box. The quality of evidence for each scale was categorized as “high,” “moderate,” “low,” or “very low” based on the Grading of Recommendations Assessment, Development, and Evaluation approach. 13 In the Grading of Recommendations Assessment, Development, and Evaluation approach, it is recommended to start with the assumption that the quality of evidence of studies of each scale is of “high” quality. The quality of evidence is subsequently downgraded by one or two levels because of the following four factors: (1) risk of bias (i.e., the methodological quality of the studies), (2) inconsistency (i.e., unexplained inconsistency of results across the studies), (3) imprecision (i.e., a small sample size of the studies), and (4) indirectness (i.e., different population from the population of interest in the review). Although the COSMIN initiatives also recommend a mention of interpretability and feasibility, it was not included in this review because it is not always objective.
Definition of measurement property and criteria for good measurement property.
Property | Definition | Criteria | ||
---|---|---|---|---|
1 | Content validity | The extent to which the domain of interest is comprehensively sampled by the items in the questionnaire (the extent to which the measure represents all facets of the construct under question) | + | A clear description of measurement purpose, target population, constructs that are being measured, and the target population were involved in item selection |
? | There was a description of measurement purpose, target population, constructs that are being measured, but the target population were not involved in item selection | |||
− | No description of measurement purpose, target population, and constructs that are being measured | |||
2 | Structural validity | The degree to which the scores of a PROM are an adequate reflection of the dimensionality of the construct to be measured | + | CTT: CFA: CFI or TLI or comparable measure >0.95 OR RMSEA 0.95 OR RMSEA 0.30 AND adequate model fit: IRT: χ 2 > 0.01 Rasch: infit and outfit mean squares ≥0.5 and ≤1.5 OR Z standardized values >−2 and |
? | CTT: Not all information for “+” reported IRT/Rasch: Model fit not reported | |||
− | Criteria for “+” not met | |||
3 | Internal consistency | The extent to which items in a (sub)scale are intercorrelated, thus measuring the same construct | + | At least low evidence for sufficient structural validity AND Cronbach's alpha(s) ≥0.70 for each unidimensional scale or subscale |
? | Criteria for “at least low evidence for sufficient structural validity” not met | |||
− | At least low evidence for sufficient structural validity AND Cronbach's alpha(s) | |||
4 | Cross‐cultural validity | The degree to which the performance of the items on a translated or culturally adapted PROM are an adequate reflection of the performance of the items of the original version of the PROM | + | No important differences found between group factors (such as age, gender, language) in multiple group factor analysis OR no important DIF for group factors (McFadden's R 2 < 0.02) |
? | No multiple group factor analysis OR DIF analysis performed | |||
0 | Important differences between group factors OR DIF was found | |||
5 | Reliability | The extent to which patients can be distinguished from each other, despite measurement errors (relative measurement error) | + | ICC or weighted Kappa ≥0.70 |
? | ICC or weighted Kappa not reported | |||
− | ICC or weighted Kappa | |||
6 | Measurement error | The extent to which the scores on repeated measures are close to each other (absolute measurement error) | + | SDC or LoA < MIC |
? | MIC not defined | |||
− | SDC or LoA > MIC | |||
7 | Criterion validity | The extent to which scores on a particular questionnaire relate to gold standard | + | Correlation with gold standard ≥0.70 OR AUC ≥0.70 |
? | Not all information for “+” reported | |||
− | Correlation with gold standard <0.70 OR AUC | |||
8 | Hypothesis testing for construct validity | The extent to which scores on a particular questionnaire relate to other measures in a manner that is consistent with theoretically derived hypotheses concerning the concepts that are being measured | + | 75% of the result is in accordance with the hypothesis |
? | No hypothesis defined (by the review team) | |||
− | Criteria for “+” not met | |||
9 | Responsiveness | The ability of a questionnaire to defect clinically important changes over time | + | The result is in accordance with the hypothesis OR AUC ≥0.70 |
? | No hypothesis defined (by the review team) | |||
− | The result is not in accordance with the hypothesis OR AUC |
Note: The criteria are based on Terwee et al. 11 and Prinsen et al. 10
Rating: “+” = sufficient, “−” = insufficient, “?” = indeterminate.
Abbreviations: AUC, area under the curve; CFA, confirmatory factor analysis; CFI, comparative fit index; CTT, classical test theory; DIF, differential item functioning; ICC, intraclass correlation coefficient; CTT, classical test theory; DIF, differential item functioning; ICC, intraclass correlation coefficient; IRT, item response theory; LoA, limits of agreement; MIC, minimal important change; PROM, patient‐reported outcome measure; RMSEA, root mean square error of approximation; SEM, standard error of measurement; SDC, smallest detectable change; SRMR, standardized root mean residuals; TLI, Tucker‐Lewis index.
A validated scoring system from previous reviews was employed to provide an overall rating of the psychometric properties of each scale. 14 , 15 In this scoring system, the total score ranged from 0 to 18; a score of 0 to 2 points was given to each of the following nine psychometric properties: internal consistency, reliability, measurement error, content validity, structural validity, hypothesis testing, cross‐cultural validity, criterion validity, and responsiveness. A score of two points represented that the scale had “sufficient” measurement property and there was “high” quality of evidence; one point if the scale had “sufficient” measurement property and there was “moderate” quality of evidence; 0 points if the scale had “insufficient” or “indeterminate” measurement property, and/or there was “low” to “very low” quality of evidence.
A systematic literature search yielded 3825 articles. Figure 1 summarizes the literature search process. The figure was created with the R‐based web app “PRISMA2020” version 0.0.2. 16 Through the screening and assessment of eligibility, 206 scales were initially identified (see Table S1), and 522 articles used those scales. Of these scales, the following 40 scales that had been cited for more than 20 times according to the Web of Science were identified: Alcohol Dependence Scale (ADS), 17 Alcohol Expectancy Questionnaire (AEQ), 18 Alcohol Expectancy Questionnaire‐Adolescent Form (AEQ‐A), 19 Addiction Severity Index (ASI), 20 Alcohol, Smoking, and Substance Involvement Screening Test (ASSIST), 21 Alcohol Use Disorders Identification Test (AUDIT), 22 Alcohol Use Disorders Identification Test‐Consumption (AUDIT‐C), 23 Alcohol Urge Questionnaire (AUQ), 24 Cut down, Annoyed, Guilty, and Eye‐opener (CAGE), 25 Comprehensive Effects of Alcohol (CEOA) questionnaire, 26 Clinical Institute Withdrawal Assessment (CIWA‐A), 27 Car, Relax, Alone, Forget, Friends, Trouble (CRAFFT), 28 Drinker Inventory of Consequences (DrInC), 29 Fast Alcohol Screening Test (FAST), 30 Form 90, 31 Lifetime Drinking History (LDH), 32 Leeds Dependence Questionnaire (LDQ), 33 MacAndrew Alcoholism Scale (Mac), 34 Maudsley Addiction Profile (MAP), 35 Michigan Alcoholism Screening Test (MAST), 36 Obsessive Compulsive Drinking Scale (OCDS), 37 Penn Alcohol Craving Scale (PACS), 38 Processes of Change (POC), 39 Readiness To Change Questionnaire (RTCQ), 40 Rutgers Alcohol Problem Index (RAPI), 41 Rapid Alcohol Problems Screen (RAPS4), 42 Self‐Administered Alcoholism Screening Test (SAAST), 43 Short Alcohol Dependence Data Questionnaire (SADD), 44 Severity of Alcohol Dependence Questionnaire (SADQ), 45 Substance Abuse Subtle Screening Inventory‐3 (SASSI‐3), 46 Severity of Dependence Scale (SDS), 47 Short Michigan Alcoholism Screening Test (SMAST), 48 Stages of Change Readiness and Treatment Eagerness Scale Alcohol version (SOCRATES‐A), 49 Self‐Rating the Effects of Alcohol Scale (SRE), 50 Semi‐Structured Assessment for the Genetics of Alcoholism (SSAGA), 51 T‐ACE, 52 Alcohol Timeline Follow‐back (TLFB), 53 Tolerance, Worry about drinking, Eye‐opener, Amnesia, and Cut down on drinking (TWEAK), 54 University of Rhode Island Change Assessment Scale (URICA), 55 and Young Adult Alcohol Consequences Questionnaire (YAACQ). 56 Thus, 314 articles that used these 40 alcohol scales were included in this review.
PRISMA flow diagram. The flow diagram depicts the search and selection process applied through the systematic review. It maps out the number of records identified, included and excluded, and the reasons for exclusions. The figure was created with the R‐based web app “PRISMA2020” version 0.0.2. 16
Table S2 describes the scales that were included. All scales were developed in English. The scales were categorized into three groups: 15 screening tests, 22 symptom assessment scales, and three assessment scales for drinking behaviors. They were classified in terms of their purpose to the following seven groups: (1) to screen for alcohol use disorders (n = 10), (2) to assess the severity of alcohol‐related problems, including alcohol use disorders (n = 7), (3) to assess specific symptoms or aspects of alcohol use disorders, such as craving and withdrawal (n = 8), (4) to assess motivation or illness stages during the treatment process (n = 4), (5) to assess alcohol‐related problems in special population, including adolescents and pregnant women (n = 6), (6) to quantify alcohol use (n = 3), and (7) to assess the treatment outcome in patients (n = 2). Revised versions were available for three scales (7.5%); however, the original versions were predominantly used with a few exceptions, such as AEQ‐A, CIWA‐A, and SOCRATES‐A. The AEQ‐A is a modified version of the AEQ for adolescents. CIWA‐A and SOCRATES‐A are modified versions of CIWA and SOCRATES explicitly designed for patients with alcohol use disorder.
Table 2 presents the overall quality scores and scores for each quality criterion. The measurement properties and methodological quality of alcohol scales were summarized in Table S3. The mean scores were 6.3 points, and only 7.5% (three of 40) of the scales received higher scores than half of the full mark (i.e., nine points), such as AUDIT, ADS, and SADD.
Summary of the overall rating scores for alcohol scales included.
Scale name | Internal consistency | Reliability | Measurement error | Content validity | Structural validity | Criterion validity | Hypothesis testing | Cross‐cultural validity | Responsiveness | Total score |
---|---|---|---|---|---|---|---|---|---|---|
ADS | 2 | 1 | 0 | 1 | 2 | 2 | 2 | 0 | 0 | 10 |
AEQ | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 4 |
AEQ‐A | 2 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 6 |
ASI | 0 | 2 | 0 | 1 | 1 | 2 | 2 | 0 | 0 | 8 |
ASSIST | 2 | 0 | 0 | 2 | 0 | 2 | 2 | 0 | 0 | 8 |
AUDIT | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 1 | 0 | 13 |
AUDIT‐C | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 0 | 0 | 5 |
AUQ | 2 | 0 | 0 | 2 | 2 | 0 | 2 | 0 | 0 | 8 |
CAGE | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 0 | 0 | 7 |
CEOA | 1 | 0 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 6 |
CIWA‐A | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
CRAFFT | 0 | 0 | 0 | 1 | 2 | 2 | 2 | 0 | 0 | 7 |
DrInC | 2 | 0 | 0 | 1 | 1 | 0 | 2 | 1 | 0 | 7 |
FAST | 1 | 1 | 0 | 1 | 0 | 2 | 2 | 0 | 0 | 7 |
Form 90 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 5 |
LDH | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 4 |
LDQ | 2 | 0 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 7 |
Mac | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 3 |
MAP | 2 | 2 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 7 |
MAST | 1 | 0 | 0 | 1 | 0 | 2 | 2 | 0 | 0 | 6 |
OCDS | 2 | 1 | 0 | 1 | 1 | 1 | 2 | 0 | 0 | 8 |
PACS | 2 | 0 | 0 | 1 | 2 | 0 | 2 | 0 | 0 | 7 |
POC | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 3 |
RTCQ | 2 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 7 |
RAPI | 2 | 0 | 0 | 2 | 2 | 0 | 2 | 0 | 0 | 8 |
RAPS4 | 2 | 0 | 0 | 2 | 0 | 2 | 2 | 0 | 0 | 8 |
SAAST | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
SADD | 2 | 2 | 0 | 2 | 1 | 2 | 2 | 0 | 0 | 11 |
SADQ | 2 | 1 | 0 | 2 | 1 | 0 | 2 | 0 | 0 | 8 |
SASSI‐3 | 0 | 1 | 0 | 2 | 0 | 0 | 2 | 0 | 0 | 5 |
SDS | 2 | 0 | 0 | 1 | 2 | 1 | 1 | 0 | 0 | 7 |
SMAST | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 4 |
SOCRATES | 2 | 2 | 0 | 2 | 1 | 0 | 2 | 0 | 0 | 9 |
SRE | 2 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 6 |
SSAGA | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
T‐ACE | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 3 |
TLFB | 0 | 2 | 0 | 1 | 2 | 0 | 2 | 0 | 0 | 7 |
TWEAK | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 0 | 0 | 4 |
URICA | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 6 |
YAACQ | 2 | 0 | 0 | 2 | 2 | 0 | 2 | 0 | 0 | 8 |
Abbreviations: ADS, Alcohol Dependence Scale; AEQ, Alcohol Expectancy Questionnaire; AEQ‐A, Alcohol Expectancy Questionnaire‐Adolescent Form; ASI, Addiction Severity Index; ASSIST, Alcohol, Smoking, and Substance Involvement Screening Test; AUDIT, Alcohol Use Disorders Identification Test; AUDIT‐C, Alcohol Use Disorders Identification Test‐Consumption; AUQ, Alcohol Urge Questionnaire; CAGE, Cut down, Annoyed, Guilty, and Eye‐opener; CEOA, Comprehensive Effects of Alcohol questionnaire; CIWA‐A, Clinical Institute Withdrawal Assessment; CRAFFT, Car, Relax, Alone, Forget, Friends, Trouble; DrInC, Drinker Inventory of Consequences; FAST, Fast Alcohol Screening Test; LDH, Lifetime Drinking History; LDQ, Leeds Dependence Questionnaire; Mac, MacAndrew Alcoholism Scale; MAP, Maudsley Addiction Profile; MAST, Michigan Alcoholism Screening Test; OCDS, Obsessive Compulsive Drinking Scale; PACS, Penn Alcohol Craving Scale; POC, Processes of Change; RTCQ, Readiness To Change Questionnaire; RAPI, Rutgers Alcohol Problem Index; RAPS4, Rapid Alcohol Problems Screen; SAAST, Self‐Administered Alcoholism Screening Test; SADD, Short Alcohol Dependence Data Questionnaire; SADQ, Severity of Alcohol Dependence Questionnaire; SASSI‐3, Substance Abuse Subtle Screening Inventory‐3; SDS, Severity of Dependence Scale; SMAST, Short Michigan Alcoholism Screening Test; SOCRATES‐A, Stages of Change Readiness and Treatment Eagerness Scale Alcohol version; SRE, Self‐Rating the Effects of Alcohol Scale; SSAGA, Semi‐Structured Assessment for the Genetics of Alcoholism; TLFB, Alcohol Timeline Follow‐back; TWEAK, Tolerance, Worry about drinking, Eye‐opener, Amnesia, and Cut down on drinking; URICA, University of Rhode Island Change Assessment Scale; YAACQ, Young Adult Alcohol Consequences Questionnaire.
Most of the scales (37 of 40 scales) received a full score for at least one quality criterion, while CIWA‐A, SAAST, and SSAGA did not. Thirty‐two (80.0%) and 19 (47.5%) scales received full scores for two and three criteria, respectively. Overall, the highest ratings of the overall quality scores were given to the AUDIT (13 points), SADD (11 points), and ADS (10 points) in rank order. Regarding the quality of evidence, no scale had a moderate to a high level of evidence for all nine psychometric properties. In particular, 40 scales (100.0%) and 39 scales (97.5%) did not have any information on the following two quality criteria: measurement error and responsiveness.
All scales had acceptable content validity and were given at least one point. In particular, 19 scales (AEQ, AEQ‐A, ASSIST, AUDIT, AUQ, CEOA, LDQ, Mac, MAP, RTCQ, RAPI, RAPS4, SADD, SADQ, SASSI‐3, SMAST, SOCRATES, URICA, and YAACQ) achieved the full score for content validity as the selection of items was based on a preliminary assessment of the target populations. Other scales were given one point since the item selection was not preliminarily examined in the target populations during the development process.
Sixteen scales (ADS, AEQ, AUDIT, AUQ, CAGE, CEOA, CRAFFT, LDQ, PACS, POC, RTCQ, RAPI, SDS, TLFB, URICA, and YAACQ) achieved the full score for structural validity. Confirmatory factor analyses (CFA) were performed for all these scales, and the results met the criteria for sufficient structural validity proposed by the COSMIN, measured with comparative fit index (CFI), Tucker‐Lewis index (TLI), root mean square error of approximation (RMSEA), and standardized root mean residuals (SRMR). Structural validity was also assessed with CFA for DrInC, OCDS, SADD, SADQ, and SOCRATES, but the quality of evidence was considered low due to inconsistencies in the results. Thus, these five scales were given one point. AEQ‐A and ASI were analyzed using principal component analysis. Since this analysis method was considered less statistically strict than CFA according to the COSMIN, these scales were given one point. Other scales were given 0 points as there was no information on the structural validity.
Twenty scales (ADS, AEQ‐A, ASSIST, AUDIT, AUQ, DrInC, LDQ, MAP, OCDS, PACS, RTCQ, RAPI, RAPS4, SADD, SADQ, SDS, SOCRATES, SRE, URICA, and YAACQ) achieved the full score for internal consistency. Factor analyses confirmed that the items from the scales formed unidimensional scales or subscales. Cronbach's alpha(s) were calculated per unidimensional scale or subscale and were found to be between 0.70 and 0.95, indicating good internal consistency for these scales. Alphas of the CEOA, FAST and MAST scales calculated were more than 0.70, but the quality of evidence was considered low due to the risk of bias; thus, they were given one point. Since alphas were not reported for the scales of the AEQ, CIWA‐A, Form 90, the LDH, Mac, SSAGA, and TLFB, they were given 0 points. All other scales were given 0 points as alphas were reported to be less than 0.70, or there were inconsistencies in the reported alphas between studies.
Six scales (ADS, ASSIST, AUDIT, DrInC, URICA, and YAACQ) were assessed with a focus on differences in average scores between languages or sexes. AUDIT and DrInC showed a moderate quality of evidence to achieve one point for cross‐cultural validity. The other four scales were given 0 points because neither multigroup factor analysis nor differential item functioning analysis was performed and did not have sufficient information on the assessment of cross‐cultural validity. Other scales were given 0 points as well as there was no information on cross‐cultural validity.
Seven scales (ASI, AUDIT, Form 90, MAP, SADD, SOCRATES, and TLFB) achieved the full score on the assessment of reliability. Weighted kappa or intraclass correlation coefficient (ICC) was calculated with adequate sample sizes (more than 100) and was found to be more than 0.70 for these scales. ADS, FAST, OCDS, RTCQ, SAAST, SADQ, SASSI‐3, SRE, and SSAGA were given one point as their weighted kappa or ICCs were greater than 0.70, but the quality of evidence was considered low due to imprecision or the risk of bias. A zero point was given to the other scales since weighted kappa or ICC was not reported, or sample sizes were as small as
No information on measurement errors for any of the scales included in this review was available.
Regarding the assessment of criterion validity, two diagnostic criteria, that is, the Diagnostic and Statistical Manual of Mental Disorders, fourth or fifth edition (DSM‐IV 5 or DSM‐5 57 ) and the International Statistical Classification of Diseases, Tenth Revision (ICD‐10), 58 were considered to be the gold standard for comparison with new scales. Except for diagnostic criteria, only AUDIT was considered the gold standard as it was widely used. The area under the receiver operating characteristic curves (AUROCs) was calculated to examine the extent to which the new scales could detect alcohol use disorders based on the diagnostic criteria. When AUDIT was used as the gold standard, correlation coefficients were calculated to examine the relationship between the scores of the new scales and AUDIT. Twelve scales (ADS, ASI, ASSIST, AUDIT, AUDIT‐C, CAGE, CRAFFT, FAST, MAST, RAPS4, SADD, and SMAST) achieved the full score for criterion validity as either AUROCs or correlation coefficients were greater than 0.70. The OCDS, SDS, and TWEAK were given one point since either AUROCs or correlation coefficients reported were >0.70, but the quality of evidence was considered low due to indirectness and imprecision. The remaining scales were given 0 points as neither AUROCs nor coefficients were reported, or there was no information on the criterion validity.
Twenty‐seven scales (ADS, ASI, ASSIST, AUDIT, AUDIT‐C, AUQ, CAGE, CRAFFT, DrInC, FAST, Form 90, LDH, Mac, MAST, OCDS, PACS, RAPI, RAPS4, SADD, SADQ, SASSI‐3, SOCRATES, SRE, T‐ACE, TLFB, TWEAK, and YAACQ) were compared with various existing scales in which more than 75% of the results were following the hypotheses, resulting in a full score on the hypothesis testing for construct validity. The AEQ‐A, CEOA, LDQ, MAP, and SDS were given one point because more than 75% of the results were per the hypotheses, but the quality of evidence was considered low due to the risk of bias and inconsistencies among studies. Other scales were given 0 points since the results were not in accordance with the hypotheses or there was no information on the hypotheses defined.
Changes over time were examined only for DrInC. Pre‐ and post‐treatment DrInC scores were compared among participants with alcohol use disorder. However, this scale was given 0 points since the minimal important change was not calculated, resulting in the judgment as limited evidence. Other scales were given 0 points as there was no information on the responsiveness.
This systematic review comprehensively appraised the measurement properties of existing scales of alcohol‐related problems, including alcohol use disorder. According to the COSMIN, we rigorously quantified the sufficiency of measurement properties and methodological quality and presented evidence‐based recommendations for scales of alcohol use disorder or alcohol‐related problems. Our findings were as follows: (1) the AUDIT, SADD, and ADS showed the highest rate of evidence of psychometric properties; (2) all 40 scales included in this review were considerably heterogeneous in terms of their scope, purpose, and target symptoms.
Although the AUDIT, SADD, and ADS were highly endorsed concerning evidence of psychometric properties, the level of evidence was moderate, according to the COSMIN. The mean score of overall quality reflecting measurement property and methodological quality was 6.3, which is lower than the full mark of 18. These results showed that the psychometric properties of the existing scales did not provide sufficient evidence to estimate the scales' measurement properties and methodological quality. None of the 40 scales had moderate to high evidence for all nine psychometric properties. In particular, psychometric properties, such as measurement error, responsiveness, and cross‐cultural validity, have rarely been examined. Lack of evidence in the psychometric properties is a challenge not only in scales for alcohol‐related problems, but in those for psychiatric disorders in general. Recently, sporadic systematic reviews of scales for other psychiatric disorders, such as anxiety symptoms in schizophrenia 59 and postpartum depression 60 using the COSMIN framework were published: however, they did not report either on measurement error, cross‐cultural validity, or responsiveness. Measurement error refers to the systematic and random error of an individual's score that is not due to true changes in the construct measured, which allows us to confirm whether the change in the individual's score can be statistically meaningful. Responsiveness is the ability of a scale to detect changes over time in the construct measured, and it plays an important role in evaluating clinical changes in a longitudinal fashion. However, these properties have been understudied despite the importance of their functions. Cross‐cultural validity refers to whether scales that were originally developed in a single culture can show adequate performance in another culture. Nonetheless, most of the major scales have not been validated in non‐English languages. These findings highlight the need for validation studies in various populations.
While the problems associated with the use of alcohol are multidimensional, including psychiatric, psychological, physical, and social aspects, several assessment scales from various perspectives are available. The 40 scales widely differed in their scope, purpose, target symptoms, and target populations. In fact, they were developed for a variety of purposes (e.g., to screen for alcohol use disorders, to assess the severity of alcohol‐related problems, including alcohol use disorders, to evaluate specific symptoms or aspects of alcohol use disorders, to assess illness stages during the treatment process, to examine alcohol‐related problems in children and adolescents, and to quantify alcohol use). Considering the multidimensional problems with alcohol use, physicians and researchers should be aware of appropriate choices and possibly combine different scales according to the purpose, study population, or settings. For example, AUDIT can identify and assess the severity of problematic alcohol use and related problems in a wide variety of populations and settings, from the general public to the patients with alcohol problems. ADS and SADD evaluate the severity of alcohol dependence and allow us to identify individuals who require further assessments and treatment in clinical settings. While ADS is comprehensive and can be available for the research purpose, SADD focuses on specific symptoms related to alcohol dependence, such as impaired control over drinking and withdrawal symptoms.
There were some limitations to the present study. First, we could not reject the possibility of missing relevant scales for the following reasons: the three electronic databases used in this study did not always cover all relevant articles, scales written in languages other than English were not searched, and the inclusion criteria of being cited more than 20 times were challenging for the recently established scales to meet. Second, this review included only scales written in English; that is, our results cannot always be extrapolated to non‐English populations. Third, a possibility of publication bias should be acknowledged in the light of a lack of registries for this type of studies. Finally, as diagnostic criteria for alcohol‐related disorders such as the DSM and ICD have changed over the past few decades, the old scales without any recent revision might no longer be useful.
Among the identified 40 scales of alcohol‐related problems, the AUDIT, ADS, and SADD were rated the highest concerning evidence of psychometric properties. However, even these scales showed a moderate level of evidence as assessed with the COSMIN. Moreover, none of the 40 scales evaluated or reported measurement error, responsiveness, or cross‐cultural validity. These findings underscore the need for further evidence to ensure scale quality. In addition, these scales differ widely in terms of measurement methods, target populations, and psychometric properties. In light of such complex clinical manifestations and resultant outcomes of alcohol‐related problems, it may be advisable to select and combine scales according to the purpose of the assessment.
YO involved in conceptualization, data curation, formal analysis, and writing. FU involved in conceptualization, data curation, supervision, and writing. MK, SM, and MM involved in conceptualization and supervision. HU involved in conceptualization, methodology, project administration, supervision, and writing. All authors have approved the final version of the manuscript.
No funding support was obtained for this report.
YO has received manuscript fees from Dainippon Sumitomo Pharma within the past 3 years. FU has received fellowship grants from Discovery Fund, Nakatani Foundation, and the Canadian Institutes of Health Research (CIHR); manuscript fees from Dainippon Sumitomo Pharma, and consultant fees from VeraSci, and Uchiyama Underwriting within the past 3 years. MK has received speaker's honoraria from Otsuka Pharmaceutical and Sumitomo Pharma within the past 3 years. SM has received a research grant from Asahi Breweries Ltd., and speaker's honoraria from EA Pharma, Ono Yakuhin, Otsuka Pharmaceutical, and Yoshitomi Yakuhin within the past 3 years. MM has received grants and/or speaker honoraria from Asahi Kasei Pharma, Astellas Pharmaceutical, Daiichi Sankyo, Dainippon‐Sumitomo Pharma, Eisai, Eli Lilly, Fuji Film RI Pharma, Janssen Pharmaceutical, Kracie, Meiji‐Seika Pharma, Mochida Pharmaceutical, MSD, Novartis Pharma, Ono Yakuhin, Otsuka Pharmaceutical, Pfizer, Shionogi, Takeda Yakuhin, Tanabe Mitsubishi Pharma, and Yoshitomi Yakuhin within the past 3 years. HU has received grants from Eisai, Otsuka Pharmaceutical, Dainippon‐Sumitomo Pharma, Daiichi Sankyo Company, and Mochida Pharmaceutical; speaker's honoraria from Otsuka Pharmaceutical, Dainippon‐Sumitomo Pharma, Eisai, Janssen Pharmaceuticals, Lundbeck Japan, and Meiji‐Seika Pharma, and advisory panel payments from Dainippon‐Sumitomo Pharma and Lundbeck Japan within the past 3 years.
Approval of the Research Protocol by an Institutional Reviewer Board: Not applicable.
Informed Consent: Not applicable.
Registry and Registration No. of the Study/Trial: International Prospective Register of Systematic Reviews (CRD42020204055).
Animal Studies: Not applicable.