I'm using likelihood ratio testing to assess whether a behavioral model is a better description of my data than a simpler (so called *restricted*) model.

How should results of such statistical tests be reported?

General reporting recommendations such as that of APA Manual apply. One should report exact p-value and an effect size along with its confidence interval. In the case of likelihood ratio test one should report the test's p-value and how much more likely the data is under model A than under model B.

Example: The data is 7.3, 95% CI [6.8,8.1] times more likely under Model A than under Model B. The hypothesis that the data is equally likely under the two models was rejected with p=0.006.

The above statements already indicate that likelihood ratio test does not tell you which

model is a better description of my data

as the likelihood is $p(mathrm{Data}|mathrm{Model})$ and to learn which model is a better description of the data you need to compute $p(mathrm{Model}|mathrm{Data})$.

The likelihood ratio test is distributed as χ²with degrees of freedom = the change in degrees of freedom between the two models. So, to give an example dropping one parameter from a model, you would report it like this:

χ² (1) = 3.4, p = 0.065

## P < .05

I’m not great at everything, but I do understand statistics pretty well. I took five graduate-level stats courses in a Ph.D. program at the University of New Hampshire and I have been teaching statistics, at both the undergraduate and graduate levels, since 1996. And I’ve written (along with Sara Hall) a textbook on the topic (

In teaching statistics for so long, I always find myself disappointed in the attitudes that many people hold about this area of applied mathematics. People often have themselves convinced that they “are not good at math” and that they “will never really understand stats.” This is a problem for two reasons: First, these attitudinal barriers get in the way of otherwise bright and confident people fully reaching their educational potential. Second, with so many people out there being stat-phobic, the job of researchers becomes almost too easy! If people out there are afraid of statistics, then researchers who have even just a few tools in their stats toolbox can really just say whatever they want—with the statistically phobic given little choice but to believe what they are presented with.

This post is designed to help address one of the major statistical concepts that often gets people to throw their hands up—*statistical significance*.

**What is Statistical Significance? What does p < .05 mean?**

*Statistical significance*, often represented by the term *p < .05*, has a very straightforward meaning. If a finding is said to be “statistically significant,” that simply means that the pattern of findings found in a study is likely to generalize to the broader population of interest. That is it.

For instance, suppose you did a study with 100 cats and 100 dogs. And you found that, in your sample, 80 of the dogs were able to be trained to go through a hoop and only one cat was able to be trained to go through a hoop. And suppose you ran some statistical test and found that *p < .05*. That would simply mean that the pattern you found, with dogs being better at jumping through hoops, is likely to be a pattern that holds across the entire population of dogs and the entire population of cats. Further, this statistical language implies that the probability of the pattern of findings from the study *not* generalizing to the broader populations of interest is very small—less than 5% (thus, p < .05)—with *p* meaning *probability* and *.05* simply meaning *5%*.

What is magical about 5%? Well, nothing really! It’s kind of a practical benchmark that statisticians have come to use as a standard over years and years and across lots of different disciplines. It’s a worthy question, but also a question for a different post as it raises a whole bunch of other, more complex issues.

**Bottom Line**

Statistics are tools used by psychologists and behavioral scientists. They are designed neither to be scary nor mysterious. They are straightforward mathematical tools designed to help us better understand the world. *Statistical significance* and its related term *p < .05* are simple concepts—simply meaning that the pattern found in a sample likely generalizes to the broader population of interest that is being studied. There’s no abracadabra there!

**References and Acknowledgment**

Thanks to my graduate student, Vania Rolon, whose speech at SUNY New Paltz that was part of the ribbon-cutting for the newly renovated Wooster Hall, with a focus on her passion for teaching statistics, partly inspired this post.

## What does effect size tell you?

Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude – not just, does a treatment affect people, but how much does it affect them.

###### What is effect size?

Effect size is a quantitative measure of the magnitude of the experimental effect. The larger the effect size the stronger the relationship between two variables.

You can look at the effect size when comparing any two groups to see how substantially different they are. Typically, research studies will comprise an experimental group and a control group. The experimental group may be an intervention or treatment which is expected to effect a specific outcome.

For example, we might want to know the effect of a therapy on treating depression. The effect size value will show us if the therapy as had a small, medium or large effect on depression.

###### How to calculate and interpret effect sizes

Effect sizes either measure the sizes of associations between variables or the sizes of differences between group means.

### Cohen's d

Cohen's d is an appropriate effect size for the comparison between two means. It can be used, for example, to accompany the reporting of t-test and ANOVA results. It is also widely used in meta-analysis.

To calculate the standardized mean difference between two groups, subtract the mean of one group from the other (M1 – M2) and divide the result by the standard deviation (SD) of the population from which the groups were sampled.

A *d* of 1 indicates the two groups differ by 1 standard deviation, a *d* of 2 indicates they differ by 2 standard deviations, and so on. Standard deviations are equivalent to z-scores (1 standard deviation = 1 z-score).

Cohen suggested that *d* = 0.2 be considered a 'small' effect size, 0.5 represents a 'medium' effect size and 0.8 a 'large' effect size. This means that if the difference between two groups' means is less than 0.2 standard deviations, the difference is negligible, even if it is statistically significant.

### Pearson r correlation

This parameter of effect size summarises the strength of the bivariate relationship. The value of the effect size of Pearson r correlation varies between -1 (a perfect negative correlation) to +1 (a perfect positive correlation).

According to Cohen (1988, 1992), the effect size is low if the value of r varies around 0.1, medium if r varies around 0.3, and large if r varies more than 0.5.

###### Why report effect sizes?

### The p-value is not enough

A lower *p*-value is sometimes interpreted as meaning there is a stronger relationship between two variables. However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

Therefore, a significant *p*-value tells us that an intervention works, whereas an effect size tells us how much it works.

It can be argued that emphasizing the size of effect promotes a more scientific approach, as unlike significance tests, effect size is independent of sample size.

### To compare the results of studies done in different settings

Unlike a *p*-value, effect sizes can be used to quantitatively compare the results of studies done in a different setting. It is widely used in meta-analysis.

## Three Popular Psychology Studies That Didn't Hold Up

Researchers re-did 100 published psychology studies, and many did not check out. These are three of the studies, and some possible explanations for why they couldn’t be replicated.

The project began in 2011, when a University of Virginia psychologist decided to find out whether suspect science was a widespread problem. He and his team recruited more than 250 researchers, identified the 100 studies published in 2008, and rigorously redid the experiments in close collaboration with the original authors.

The new analysis, called the Reproducibility Project, found no evidence of fraud or that any original study was definitively false. Rather, it concluded that the evidence for most published findings was not nearly as strong as originally claimed.

Dr. John Ioannidis, a director of Stanford University’s Meta-Research Innovation Center, who once estimated that about half of published results across medicine were inflated or wrong, noted the proportion in psychology was even larger than he had thought. He said the problem could be even worse in other fields, including cell biology, economics, neuroscience, clinical medicine, and animal research.

The report appears at a time when the number of retractions of published papers is rising sharply in a wide variety of disciplines. Scientists have pointed to a hypercompetitive culture across science that favors novel, sexy results and provides little incentive for researchers to replicate the findings of others, or for journals to publish studies that fail to find a splashy result.

“We see this is a call to action, both to the research community to do more replication, and to funders and journals to address the dysfunctional incentives,” said Brian Nosek, a psychology professor at the University of Virginia and executive director of the Center for Open Science, the nonprofit data-sharing service that coordinated the project published Thursday, in part with $250,000 from the Laura and John Arnold Foundation. The center has begun an effort to evaluate widely cited results in cancer biology, and experts said that the project could be adapted to check findings in many sciences.

In a conference call with reporters, Marcia McNutt, the editor in chief of Science, said, “I caution that this study should not be regarded as the last word on reproducibility but rather a beginning.” In May, after two graduate students raised questions about the data in a widely reported study on how political canvassing affects opinions of same-sex marriage, Science retracted the paper.

The new analysis focused on studies published in three of psychology’s top journals: Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition.

The act of double-checking another scientist’s work has been divisive. Many senior researchers resent the idea that an outsider, typically a younger scientist, with less expertise, would critique work that often has taken years of study to pull off.

“There’s no doubt replication is important, but it’s often just an attack, a vigilante exercise,” said Norbert Schwarz, a professor of psychology at the University of Southern California.

## Statistical significance and generalizability of effect size estimates

Consider two sets of observations with *M*_{1} = 7.7, *SD*_{1} = 0.95, and *M*_{2} = 8.7, *SD*_{2} = 0.82. Depending on whether the data were collected in a between or within-subjects design, the effect size partial eta squared (η 2 _{p}) for the difference between these two observations (for details, see the illustrative example below) is either 0.26 or 0.71, respectively. Given that the mean difference is the same (i.e., 1) regardless of the design, which of these two effect sizes is the “true” effect size? There are two diverging answers to this question. One viewpoint focusses on the generalizability of the effect size estimate across designs, while the other viewpoint focusses on the statistical significance of the difference between the means. I will briefly discuss these two viewpoints.

As Maxwell and Delaney (2004, p. 548) remark: 𠇊 major goal of developing effect size measures is to provide a standard metric that meta-analysts and others can interpret across studies that vary in their dependent variables as well as types of designs.” This first viewpoint, which I will refer to as the *generalizable effect size estimate* viewpoint, assumes that it does not matter whether you use a within-subjects design or a between-subjects design. Although you can exclude individual variation in the statistical test if you use a pre- and post-measure, and the statistical power of a test will often substantially increase, the effect size (e.g., η 2 _{p}) should not differ depending on the design that was used. Therefore, many researchers regard effect sizes in within-subjects designs as an overestimation of the “true” effect size (e.g., Dunlap et al., 1996 Olejnik and Algina, 2003 Maxwell and Delaney, 2004).

A second perspective, which I will refer to as the *statistical significance* viewpoint, focusses on the statistical test of a predicted effect, and regards individual differences as irrelevant for the hypothesis that is examined. The goal is to provide statistical support for the hypothesis, and being able to differentiate between variance that is due to individual differences and variance that is due to the manipulation increases the power of the study. Researchers advocating the statistical significance viewpoint regard the different effect sizes (e.g., η 2 _{p}) in a within- compared to between-subjects design as a benefit of a more powerful design. The focus on the outcome of the statistical test in this perspective can be illustrated by the use of confidence intervals. As first discussed by Loftus and Masson (1994), the use of traditional formulas for confidence intervals (developed for between-subjects designs) can result in a marked discrepancy between the statistical summary of the results and the error bars used to visualize the differences between observations. To resolve this inconsistency, Loftus and Masson (1994, p. 481) suggest that: “Given the irrelevance of intersubject variance in a within-subjects design, it can legitimately be ignored for purposes of statistical analysis.”

To summarize, researchers either focus on generalizable effect size estimates, and try to develop effect size measures that are independent from the research design, or researchers focus on the statistical significance, and prefer effect sizes (and confidence intervals) to reflect the conclusions drawn by the statistical test. Although these two viewpoints are not mutually exclusive, they do determine some of the practical choices researchers make when reporting their results. Regardless of whether researchers focus on statistical significance or generalizability of measurements, cumulative science will benefit if researchers determine their sample size a-priori, and report effect sizes when they share their results. In the following sections, I will discuss how effect sizes to describe the differences between means are calculated, with a special focus on the similarities and differences in within and between-subjects designs, followed by an illustrative example.

## 3. Assessing the value of the likelihood ratio

In practice, the value of the likelihood ratio must be assessed. Nonetheless, it is very important to realize that for two distinct propositions *H*_{p} and *H*_{d} the relationship (3) always holds, thus implying that there is always a likelihood ratio. By distinct we mean that neither of the two propositions is a conglomerate of sub-propositions each with a different likelihood of the forensic findings. Therefore, the use of the likelihood ratio as the measure of evidentiary strength in such cases cannot be questioned on theoretical grounds. In particular, one cannot argue that the LR approach should be inadmissible just because the very value of the likelihood ratio is difficult to assess.

What does it mean to assess the value of the likelihood ratio? Since the likelihood ratio is a ratio between either distinct probabilities or probability density values, it may be expected that assessment of its value is synonymous with computing its true numerical value. However, most people fairly acquainted with statistical methodology would be aware that the true value of an unknown quantity very seldom can be calculated on basis of the available data. An assessment is almost always some kind of estimation. In addition, the word estimation may be misinterpreted in the sense that we expect a result on the same scale as the true value. This is not so the scale used for the assessment can be different from the scale associated with the true likelihood ratio. By ‘scale’ we here mean essentially the degree of resolution of the reported result. For instance, a certain quantity can attain its values on a continuous scale by definition, but when reporting the values we may stick to a rougher resolution such as positioning the value into one of a defined number of intervals, i.e. lowering the resolution. Let us return to Example 1. Observing snow is better explained by a temperature below or equal to zero than a temperature above zero, or expressed differently the observed snow supports the proposition that the temperature is at most zero. The analysis may stop there and as a result we have assessed the likelihood ratio to be greater than one, if the proposition forwarded is ‘temperature is at most zero’. Here, the scale used has at most three levels: ‘less than one’, ‘greater than one’ and ‘equal to one’ (the last level may be incorporated with one of the first two). This is a very rough scale and its usefulness may of course be debated, but without more background data on snow and temperatures it is probably the highest scale resolution we could reach.

Let us now consider a forensic case where the question is whether the signature on a will is a forgery. The proposition forwarded is that it is a forgery. A forensic handwriting expert examines samples of spontaneous writing known to originate from the ‘true owner’ of the questioned signature. The expert concludes that there are clear dissimilarities between the questioned signature on the will and the samples with respect to several features and there are very few similarities. The expert would not expect such clear dissimilarities if it were the true owner that wrote the signature and they would definitely expect many more similarities. Based on this, it is their opinion that the forwarded proposition (a forgery) is a much better explanation to their findings than is the proposition that the signature was written by the true owner. Taken into account similar cases the expert has worked on during a long career as a handwriting examiner they estimate that their findings are more than 100 times more probable if the signature were a forgery than if it were genuine. We can now anticipate a likelihood ratio larger than 100, which may be very useful. However, it should be noted that in this case the forensic expert cannot be more detailed than that. In fact, we could hardly expect a handwriting examiner to assess the likelihood ratio with a higher resolution. The question may be asked in court as to whether their findings could be even 200 times more probable if the signature were a forgery than if it were not, and the answer would most likely be negative (or the expert would probably have said so in their statement). Trying to refine the assessment further, estimating for instance that the likelihood ratio is somewhere around 150, is less meaningful. Such an ‘exact’ value would be very uncertain and probably would not have any substantial implication for the case compared with ‘greater than 100’. A development where the legal system puts pressure on the forensic experts to state ‘exact’ numbers that include false precision, in the belief that this would add value to the case, would be most undesirable.

Now let us consider a different situation where the forensic findings are essentially continuous measurements. For instance, they could be the results of a gas chromatographic analysis of a material suspected to contain heroine. In such a case the question may not be whether the material contains heroine or not (such a statement may very well be given with almost 100% confidence). Instead the issue may be whether the material has the same origin as some other material known to have been produced at a certain illegal laboratory. To evaluate the findings against a pair of propositions (‘same source’ versus ‘different sources’) we can measure the amounts of a number of contaminants or accompanying substances (e.g. caffeine). The likelihood of one proposition would then be the value of a probability density function and the likelihood ratio is a ratio of two such values. It is now less meaningful to speak in terms of how more or less probable the findings are under one proposition compared to under the other proposition. Provided there are enough reference data available we can estimate the two probability density values and as a result the estimated likelihood ratio will be on the same scale as is the true one. When the reference data are too sparse to provide such estimates we must stick to a rougher scale for the assessment of this likelihood ratio and the discriminating power will accordingly be lower, but (and this is important) no less valid.

Common to Examples 2 and 3 is that the statement that will be the outcome of the forensic investigation is solely built on the underlying likelihood ratio no matter to which resolution this can be estimated. In a case where there is no or very little knowledge about the kind of material that is investigated, the scale on which the likelihood ratio is assessed is by nature rough. Nevertheless the information may be valuable to the court. For instance, it could be the three levels used in Example 1 about snow and temperatures, i.e. ‘less than one’, ‘equal to one’ and ‘greater than one’. If we have reasons to believe that the findings would be more probable under one proposition than under the other the first or third level should be used (which one depends on the ordering of the propositions). If we cannot come up with any such reasons we should stick to the second level (‘equal to one’). However, for materials where there is no knowledge at all about the prevalence of the findings, the forensic scientist should rather avoid reporting any value of evidence, but just state what has been observed. Interesting, however, is that increased knowledge and experience does not imply that we move away from the second level, i.e. from reporting an assessed value of one. A scale with a higher resolution implies that less evidence will be left without any evaluation and interpretation, but the results may very well also in this case be equally likely given either of the propositions. In cases where we can use comprehensive databases to support the assessment we can estimate the likelihood ratio on the same scale as of its true value and thus also report an ‘exact’ numerical value with a resolution depending on the quality of the database. Nevertheless, for all types of cases between the ones with no background knowledge available and the ones with comprehensive databases we can build the evidence value on the underlying likelihood ratio. The lack of comprehensive background data does not make this procedure inadmissible or the results of no use for the court it just implies a rougher scale of reporting.

## Appraising Study Types

This section provides other questions that may be helpful as you appraise the research. Because study types have different features, you will not use the same validity criteria for all articles. Click below to review questions to appraise various articles.

#### Systematic Review or Meta-Analysis

### Are the results of this article valid?

**1. Did the review explicitly address a sensible question?**

The systematic review should address a specific question that indicates the patient problem, the exposure and one or more outcomes. General reviews, which usually do not address specific questions, may be too broad to provide an answer to the clinical question for which you are seeking information.

**2. Was the search for relevant studies detailed and exhaustive?**

Researchers should conduct a thorough search of appropriate bibliographic databases. The databases and search strategies should be outlined in the methodology section. Researchers should also show evidence of searching for non-published evidence by contacting experts in the field. Cited references at the end of articles should also be checked.

**3. Were the primary studies of high methodological quality?**

Researchers should evaluate the validity of each study included in the systematic review. The same EBP criteria used to critically appraise studies should be used to evaluate studies to be included in the systematic review. Differences in study results may be explained by differences in methodology and study design.

**4. Were selection and assessments of the included studies reproducible?**

More than one researcher should evaluate each study and make decisions about its validity and inclusion. Bias (systematic errors) and mistakes (random errors) can be avoided when judgment is shared. A third reviewer should be available to break a tie vote.

- focused question
- thorough literature search
- include validated studies
- selection of studies reproducible

### What are the results?

**Were the results similar from study to study?**

How similar were the point estimates?

Do confidence intervals overlap between studies?

**What are the overall results of the review?**

Were results weighted both quantitatively and qualitatively in summary estimates?

**How precise were the results?**

What is the confidence interval for the summary or cumulative effect size?

More information on reading forest plots:

Ried K. Interpreting and understanding meta-analysis graphs: a practical

guide. Aust Fam Physician. 2006 Aug35(8):635-8. PubMed PMID: 16894442.

Greenhalgh T. Papers that summarise other papers (systematic

reviews and meta-analyses). BMJ. 1997 Sep 13315(7109):672-5.

PubMed PMID: 9310574.

### How can I apply the results to patient care?

**Were all patient-important outcomes considered?**

Did the review omit outcomes that could change decisions?

**Are any postulated subgroup effects credible?**

Were subgroup differences postulated before data analysis?

Were subgroup differences consistent across studies?

**What is the overall quality of the evidence?**

Were prevailing study design, size, and conduct reflected in a summary of the quality of evidence?

**Are the benefits worth the costs and potential risks?**

Does the cumulative effect size cross a test or therapeutic threshold?

Based on: Guyatt, G. Rennie, D. Meade, MO, Cook, DJ. *Users&rsquo Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.*

#### Harm Study

### Evaluating the Validity of a Harm Study

### Are the results of this article valid?

**FOR COHORT STUDIES: Aside from the exposure of interest, did the exposed and control groups start and finish with the same risk for the outcome ?**

**1. Were patients similar for prognostic factors that are known to be associated with the outcome (or did statistical adjustment level the playing field)?**

The two groups, those exposed to the harm and those not exposed, must begin with the same prognosis. The characteristics of the exposed and non-exposed patients need to be carefully documented and their similarity (except for the exposure) needs to be demonstrated. The choice of comparison groups has a significant influence on the credibility of the study results. The researchers should identify an appropriate control population before making a strong inference about a harmful agent. The two groups should have the same baseline characteristics. If there are differences investigators should use statistical techniques to adjust or correct for differences.

**2. Were the circumstances and methods for detecting the outcome similar?**

In cohort studies determination of the outcome is critical. It is important to define the outcome and use objective measures to avoid possible bias. Detection bias may be an issue for these studies, as unblinded researchers may look deeper to detect disease or an outcome.

**3. Was follow-up sufficiently complete?**

Patients unavailable for complete follow-up may compromise the validity of the research because often these patients have very different outcomes than those that stayed with the study. This information must be factored into the study results.

**FOR CASE CONTROL STUDIES: Did the cases and control group have the same risk (chance) of being exposed in the past?**

**1. Were cases and controls similar with respect to the indication or circumstances that would lead to exposure?**

The characteristics of the cases and controls need to be carefully documented and their similarity needs to be demonstrated. The choice of comparison groups has a significant influence on the credibility of the study results. The researchers should identify an appropriate control population that would be eligible or likely to have the same exposure as the cases.

**2. Were the circumstances and methods for determining exposure similar for cases and controls?**

In a case control study determination of the exposure is critical. The exposure in the two groups should be identified by the same method. The identification should avoid any kind of bias, such as recall bias. Sometimes using objective data, such as medical records, or blinding the interviewer can help eliminate bias.

- similarity of comparison groups
- outcomes and exposures measured same for both groups
- follow-up of sufficient length (80% or better)

### What are the results?

**How strong is the association between exposure and outcome?**

* What is the risk ratio or odds ratio?

* Is there a dose-response relationship between exposure and outcome?

**How precise was the estimate of the risk?**

* What is the confidence interval for the relative risk or odds ratio?

**Strength of inference:**

**For RCT or Prospective cohort studies: Relative Risk**

**Outcome present**

**Outcome not present**

**Relative Risk (RR) = a /(a + b) / c/(c + d)**

is the risk of the outcome in the exposed group divided by the risk of the outcome in the unexposed group:

RR = (exposed outcome yes / all exposed) / (not exposed outcome yes / all not exposed)

Example: &ldquoRR of 3.0 means that the outcome occurs 3 times more often in those exposed versus unexposed.&rdquo

**For case-control or retrospective studies: Odds Ratio**

**Outcome present**

**Outcome not present**

**Odds Ratio (OR)** = **(a / c) / (b / d)**

is the odds of previous exposure in a case divided by the odds of exposure in a control patient:

OR = (exposed - outcome yes / not exposed - outcome yes) / (exposed - outcome no / not exposed - outcome no)

Example: &ldquoOR of 3.0 means that cases were 3 times more likely to have been exposed than were control patients.&rdquo

**Confidence Intervals** are a measure of the precision of the results of a study. For example, &ldquo36 [95% CI 27-51]&ldquo, a 95%CI range means that if you were to repeat the same clinical trial a hundred times you can be sure that 95% of the time the results would fall within the calculated range of 27-51. Wider intervals indicate lower precision narrow intervals show greater precision.

**Confounding Variable** is one whose influence distorts the true relationship between a potential risk factor and the clinical outcome of interest.

Read more on odds ratios: The odds ratio Douglas G Altman & J Martin Bland BMJ 2000320:1468 (27 May)

Watch more on odds ratios: Understanding odds ratio with Gordon Guyatt. (21 minutes.)

### How can I apply the results to patient care?

**Were the study subjects similar to your patients or population?**

Is your patient so different from those included in the study that the results may not apply?

**Was the follow-up sufficiently long?**

Were study participants followed-up long enough for important harmful effects to be detected?

**Is the exposure similar to what might occur in your patient?**

Are there important differences in exposures (dose, duration, etc) for your patients?

**What is the magnitude of the risk?**

What level of baseline risk for the harm is amplified by the exposure studied?

**Are there any benefits known to be associated with the exposure?**

What is the balance between benefits and harms for patients like yours?

**Source:** Guyatt, G. Rennie, D. Meade, MO, Cook, DJ. *Users&rsquo Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.*

#### Diagnostic Test Study

### Evaluating the Validity of a Diagnostic Test Study

### Are the results valid?

**1.** **Did participating patients present a diagnostic dilemma?**

The group of patients in which the test was conducted should include patients with a high, medium and low probability of having the target disease. The clinical usefulness of a test is demonstrated in its ability to distinguish between obvious illness and those cases where it is not so obvious or where the diagnosis might otherwise be confused. The patients in the study should resemble what might be expected in a clinical practice.

**2.** **Did investigators compare the test to an appropriate, independent reference standard?**

The reference (or gold) standard refers to the commonly accepted proof that the target disorder is present or not present. The reference standard might be an autopsy or biopsy. The reference standard provides objective criteria (e.g., laboratory test not requiring subjective interpretation) or a current clinical standard (e.g., a venogram for deep venous thrombosis) for diagnosis. Sometimes there may not be a widely accepted reference standard. The author will then need to clearly justify their selection of the reference test.

**3. Were those interpreting the test and reference standard blind to the other results?**

To avoid potential bias, those conducting the test should not know or be aware of the results of the other test.

**4. Did the investigators perform the same reference standard to all patients regardless of the results of the test under investigation?**

Researchers should conduct *both* tests (the study test and the reference standard) on all patients in the study regardless of the results of the test in question. Researchers should not be tempted to forego either test based on the results of only one of the tests. Nor should the researchers apply a different reference standard to patients with a negative results in the study test.

**Key issues for Diagnostic Studies:**

- diagnostic uncertainty
- blind comparison to gold standard
- each patient gets both tests

### What are the results?

**Reference Standard Disease Positive**

**Reference Standard Disease Negative**

**Sensitivity: = true positive / all disease positives**

measures the proportion of patients with the disease who also test positive for the disease in this study. It is the probability that a person with the disease will have a positive test result.

**Specificity: Specificity = true negative / all disease negatives**

measures the proportion of patients without the disease who also test negative for the disease in this study. It is the probability that a person without the disease will have a negative test result.

Sensitivity and specificity are characteristics of the test but do not provide enough information for the clinician to act on the test results. Likelihood ratios can be used to help adapt the results of a study to specific patients. They help determine the probability of disease in a patient.

**Likelihood ratios (LR):**

**LR + = positive test in patients with disease / positive test in patients without disease**

**LR - = negative test in patients with disease / negative test in patients without disease**

Likelihood ratios indicate the likelihood that a given test result would be expected in a patient with the target disorder compared to the likelihood that the same result would be expected in a patient without that disorder.

Likelihood ratio of a positive test result (LR+) increases the odds of having the disease after a positive test result.

Likelihood ratio of a negative test result (LR-) decreases the odds of having the disease after a negative test result.

**How much do LRs change disease likelihood?**

LRs greater than 10 or less than 0.1 | cause large changes |

LRs 5 &ndash 10 or 0.1 &ndash 0.2 | cause moderate changes |

LRs 2 &ndash 5 or 0.2 &ndash 0.5 | cause small changes |

LRs less than 2 or greater than 0.5 | cause tiny changes |

LRs = 1.0 | cause no change at all |

More about likelihood ratios: Diagnostic tests 4: likelihood ratios. JJ Deeks & Douglas G Altman BMJ 2004 329:168-169

### How can I apply the results to patient care?

**Will the reproducibility of the test result and its interpretation be satisfactory in your clinical setting?**

Does the test yield the same result when reapplied to stable participants?

Do different observers agree about the test results?

**Are the study results applicable to the patients in your practice?**Does the test perform differently (different LRs) for different severities of disease?

Does the test perform differently for populations with different mixes of competing conditions?

**Will the test results change your management strategy?**

What are the test and treatment thresholds for the health condition to be detected?

Are the test LRs high or low enough to shift posttest probability across a test or treatment threshold?

**Will patients be better off as a result of the test?**

Will patient care differ for different test results?

Will the anticipated changes in care do more good than harm?

Based on: Guyatt, G. Rennie, D. Meade, MO, Cook, DJ. *Users&rsquo Guide to Medical Literature: A Manual for Evidence-Based Clinical Practice, 2nd Edition 2008.*

#### Prognosis Study

### Are the results Valid?

**1. Was the sample of patients representative?**

The patients groups should be clearly defined and representative of the spectrum of disease found in most practices. Failure to clearly define the patients who entered the study increases the risk that the sample is unrepresentative. To help you decide about the appropriateness of the sample, look for a clear description of which patients were included and excluded from a study. The way the sample was selected should be clearly specified, along with the objective criteria used to diagnose the patients with the disorder.

**2. Were the patients sufficiently homogeneous with respect to prognostic factors?**

Prognostic factors are characteristics of a particular patient that can be used to more accurately predict the course of a disease. These factors, which can be demographic (age, gender, race, etc.) or disease specific (e.g., stage of a tumor or disease) or comorbid (other conditions existing in the patient at the same time), can also help predict good or bad outcomes.

In comparing the prognosis of the 2 study groups, researchers should consider whether or not the patient&rsquos clinical characteristics are similar. It may be that adjustments have to made based on prognostic factors to get a true picture of the clinical outcome. This may require clinical experience or knowledge of the underlying biology to determine if all relevant factors were considered.

**3. Was the follow-up sufficiently complete?**

Follow-up should be complete and all patients accounted for at the end of the study. Patients who are lost to follow-up may often suffer the adverse outcome of interest and therefore, if not accounted for, may bias the results of the study. Determining if the number of patients lost to follow up affects the validity depends on the proportion of patients lost and the proportion of patients suffering the adverse outcome.

Patients should be followed until they fully recover or one of the disease outcomes occur. The follow-up should be long enough to develop a valid picture of the extent of the outcome of interest. Follow-up should include at least 80% of participants until the occurrence of a major study end point or to the end of the study.

**4. Were objective and unbiased outcome criteria used?**

Some outcomes are clearly defined, such as death or full recovery. In between, can exist a wide range of outcomes that may be less clearly defined. Investigators should establish specific criteria that define each possible outcome of the disease and use these same criteria during patient follow-up. Investigators making judgments about the clinical outcomes may have to be &ldquoblinded&rdquo to the patient characteristics and prognostic factors in order to eliminate possible bias in their observations.

- well-defined sample
- similar prognosis
- follow-up complete
- objective and unbias outcome criteria

### What are the results?

**How likely are the outcomes over time?**

- What are the event rates at different points in time?
- If event rates vary with time, are the results shown using a survival curve?

**How precise are the estimates of likelihood?**

- What is the confident interval for the principle event rate?
- How do confidence intervals change over time?

**Prognostic Results** are the numbers of events that occur over time, expressed in:

**absolute**terms: e.g. 5 year survival rate**relative**terms: e.g. risk from prognostic factor**survival curves**: cumulative events over time

#### Therapy Study

### Are the results of the study valid?

**1. Were patients randomized?** The assignment of patients to either group (treatment or control) must be done by a random allocation. This might include a coin toss (heads to treatment/tails to control) or use of randomization tables, often computer generated. Research has shown that random allocation comes closest to insuring the creation of groups of patients who will be similar in their risk of the events you hope to prevent. Randomization balances the groups for known prognostic factors (such as age, weight, gender, etc.) and unknown prognostic factors (such as compliance, genetics, socioeconomics, etc.). This reduces the chance of over-representation of any one characteristic within the study groups.

**2. Was group allocation concealed?** The randomization sequence should be concealed from the clinicians and researchers of the study to further eliminate conscious or unconscious selection bias. Concealment (part of the enrollment process) ensures that the researchers cannot predict or change the assignments of patients to treatment groups. If allocation is not concealed it may be possible to influence the outcome (consciously or unconsciously) by changing the enrollment order or the order of treatment which has been randomly assigned. Concealed allocation can be done by using a remote call center for enrolling patients or the use of opaque envelopes with assignments. This is different from blinding which happens AFTER randomization.

**3. Were patients in the study groups similar with respect to known prognostic variables?** The treatment and the control group should be similar for all prognostic characteristics except whether or not they received the experimental treatment. This information is usually displayed in Table 1, which outlines the baseline characteristics of both groups. This is a good way to verify that randomization resulted in similar groups.

**4. To what extent was the study blinded?** Blinding means that the people involved in the study do not know which treatments were given to which patients. Patients, researchers, data collectors and others involved in the study should not know which treatment is being administered. This helps eliminate assessment bias and preconceived notions as to how the treatments should be working. When it is difficult or even unethical to blind patients to a treatment, such as a surgical procedure, then a "blinded" clinician or researcher is needed to interpret the results.

**5. Was follow-up complete?** The study should begin and end with the same number of patients in each group. Patients lost to the study must be accounted for or risk making the conclusions invalid. Patients may drop out because of the adverse effects of the therapy being tested. If not accounted for, this can lead to conclusions that may be overly confident in the efficacy of the therapy. Good studies will have better than 80% follow-up for their patients. When there is a large loss to follow-up, the lost patients should be assigned to the "worst-case" outcomes and the results recalculated. If these results still support the original conclusion of the study then the loss may be acceptable.

**6. Were patients analyzed in the groups to which they were first allocated?** Anything that happens after randomization can affect the chances that a patient in a study has an event. Patients who forget or refuse their treatment should not be eliminated from the study results or allowed to &ldquochange groups&rdquo. Excluding noncompliant patients from a study group may leave only those that may be more likely to have a positive outcome, thus compromising the unbiased comparison that we got from the process of randomization. Therefore all patients must be analyzed within their assigned group. Randomization must be preserved. This is called "intention to treat" analysis.

**7. Aside from the experimental intervention, were the groups treated equally?** Both groups must be treated the same except for administration of the experimental treatment. If "cointerventions" (interventions other than the study treatment which are applied differently to both groups) exist they must be described in the methods section of the study.

### How can I apply the results to patient care?

**Were the study patients similar to my population of interest?**

Does your population match the study inclusion criteria?

If not, are there compelling reasons why the results should not apply to your population?

**Were all clinically important outcomes considered?**

What were the primary and secondary endpoints studied?

Were surrogate endpoints used?

**Are the likely treatment benefits worth the potential harm and costs?**

What is the number needed to treat (NNT) to prevent one adverse outcome or produce one positive outcome?

Is the reduction of clinical endpoints worth the potential harms of the surgery or the cost of surgery?

To determine which variable levels have the most impact, compare the observed and expected counts or examine the contribution to chi-square

By looking at the differences between the observed cell counts and the expected cell counts, you can see which variables have the largest differences, which may indicate dependence. You can also compare the contributions to the chi-square statistic to see which variables have the largest values that may indicate dependence.

###### Key Results: Count, Expected count, Contribution to Chi-square

In this table, the cell count is the first number in each cell, the expected count is the second number in each cell, and the contribution to the chi-square statistic is the third number in each cell. In these results, the expected count and the observed count are the largest for the 1st shift with Machine 2, and the contribution to the chi-square statistic is also the largest. Investigate your process during the 1st shift with Machine 2 to see if there is a special cause that can explain this difference.

## The odds ratio: calculation, usage and interpretation

The odds ratio (OR) is one of several statistics that have become increasingly important in clinical research and decision-making. It is particularly useful because as an effect-size statistic, it gives clear and direct information to clinicians about which treatment approach has the best odds of benefiting the patient. Significance statistics used for the OR include the Fisher&rsquos Exact Probability statistic, the Maximum-Likelihood Ratio Chi-Square and Pearson&rsquos Chi-Square. Typically the data consist of counts for each of a set of conditions and outcomes and are set in table format. The most common construction is a 2 × 2 table although larger tables are possible. As a simple statistic to calculate, [OR = (a × d)/(b × c)], it can be hand calculated in a clinic if necessary to determine the odds of a particular event for a patient at risk for that event. In addition to assisting health care providers to make treatment decisions, the information provided by the odds ratio is simple enough that patients can also understand the results and can participate in treatment decisions based on their odds of treatment success.

## 1. Does the p value predict the probability of a hypothesis given the evidence?

The *p* value refers to the probability of the data at least as extreme as the observed data given the statistical (often the null) hypothesis, p(D|H), and assuming that underlying assumptions are met (Greenland et al., 2016 Wasserstein & Lazar, 2016). In ST, the test statistic (e.g., *z, t*, or *F*) represents the data as it is computed from the central tendency of the observed data and the standard error. We use the terms *p* value and p(D|H) interchangeably. As a probability that refers to the size of an area under a density curve, the *p* value is conceptually distinct from the likelihood of the data, which refers the value of the density function at a particular point. In our simulation experiments, we find that the log-transforms of *p* values are nearly perfectly correlated with their associated likelihoods. Consider a continuous distribution under the null hypothesis of μ = 0. As sample observations increase in magnitude (for example, from a range of .01 to 2.0 standard units) when moving from the peak of this distribution toward the positive (right) tail, *p* values and likelihoods both decrease monotonically. In this article, we only report the findings obtained with likelihoods.

A key concern about the *p* value is that it does not speak to the strength of the evidence against the tested hypothesis, that is, that it does not predict the posterior probability of the tested hypothesis (Cohen, 1994 Gelman, 2013 Lykken, 1968). The ASA warns that that “*p*-values do not measure the probability that the studied hypothesis is true” (Wasserstein & Lazar, 2016, p. 131), although “researchers often wish to turn a *p*-value into a statement about the truth of a null hypothesis” (p. 131). In other words, finding that the data are unlikely under the hypothesis is not the same as finding that the hypothesis is unlikely under the data. The question of whether there is any relationship, and how strong it might be, is the crux of inductive inference. All inductive inference is essentially “reverse inference,” and reverse inference demands vigilance (Krueger, 2017).

We sought to quantify how much p(D|H) reveals about p(H|D). Bayes’ Theorem, which expresses the mathematical relationship between the two inverse conditional probabilities, provides the first clues. The theorem

shows that as p(D|H) decreases, *ceteris paribus*, so does p(H|D). If the tested hypothesis, H, is a null hypothesis, a low *p* value suggests a comparatively high probability that the alternative hypothesis,

H, is true. Yet, the association between p(D|H) and p(H|D) is perfect only if the prior probability of the hypothesis, p(H), is the same as the cumulative probability of the data, p(D), that is, the denominator of the ratio in the above formula. This identity may be rare in research practice so how strongly is p(D|H) related to p(H|D) in practice?

We studied the results for a variety of settings in simulation experiments (Krueger & Heck, 2017). We began by sampling the elements of Bayes’ Theorem, p(H), p(D|H), and p(D|∼H) from uniform distributions that were independent of one another. These simple settings produced a correlation of *r* = .38 between p(D|H) and p(H|D) (see also Krueger, 2001 Trafimow & Rice, 2009). The size of this correlation may raise questions about the inductive power of the *p* value. Note, however, that this correlation emerges for a set of minimal, and as we shall see unrealistic, assumptions and thus represents a lower bound of possible results. Consider the relationship between p(D|H) and p(D|∼H) over studies. Inasmuch as the null hypothesis H and the alternative hypothesis ∼H are distinctive, one may expect a negative correlation between p(D|H) and p(D|∼H) over studies. The limiting case is given by a daring ∼H predicting a large effect, δ, and a set of experiments yielding estimated effects *d* that are greater than 0 but smaller than δ (García-Pérez, 2016). Here, the correlation between p(D|H) and p(D|∼H) is perfectly negative.

We sampled values for p(H), p(D|H), and p(D|∼H) and varied the size of the negative correlation between p(D|H) and p(D|∼H), with the result of interest being the correlation between p(D|H) and p(H|D), that is, the correlation indicating the predictive power of *p* for the posterior probability of the null hypothesis. We found that as the correlation between p(D|H) and p(D|∼H) becomes more negative, the correlation between p(D|H) and p(H|D) becomes more positive. For example, when setting the correlation between p(D|H) and p(D|∼H) to *r* = –.9, the outcome correlation between p(D|H) p(H|D) is *r* = .49, which is moderately greater than the baseline correlation of .38 obtained under the assumption of independence. Nevertheless, when a research program provides bold hypotheses, that is, hypotheses that overestimate empirical effect sizes, the *p* value becomes an incrementally stronger predictor of the posterior probability of H (and thereby of ∼H).

Turning to the effect of researchers’ prior knowledge on the inductive power of *p*, we varied the correlation between p(D|H) and the prior probability of a hypothesis p(H). Here, positive correlations reflect the researchers’ sense of the riskiness of the tested hypothesis. At one end of the spectrum, consider an experiment in parapsychology, where the prior probability of the null hypothesis (e.g., “Psychokinesis cannot occur”) is high – at least among skeptics. A low *p* value is improbable, that is, the (meta-)probability of a low *p* value is low. Thus, both p(∼H) and p(*p* < .05) are low. 1 At the other end of the spectrum, consider a social categorization experiment, for example, on ingroup-favoritism. Ingroup-favoritism is a robust empirical finding (Brewer, 2007), and thus the prior probability of the null hypothesis of no favoritism is low. Now, both p(∼H) and p(*p* < .05) are high. When multiple scenarios across this spectrum are considered, the positive correlation between p(H) and p(D|H) is evident.

When raising the correlation between p(H) and p(D|H) to .5 and to .9, we respectively observe correlations of .628 and .891 between p(D|H) and p(H|D). This result suggests that as a research program matures, the *p* value becomes more closely related to both the prior probability of the tested hypothesis and its updated posterior probability. Interestingly, ST yields diminishing returns within a line of study, as reflected in shrinking differences between p(H) and p(H|D). To review, the distribution of the prior probability of the likelihood of a hypothesis tends to be flat and uncoupled from the obtained *p* value in the early stages of a research program. At this stage, *p* values predict p(H|D) rather poorly. As theory and experience mature, however, the probabilities assigned to hypotheses begin to fall into a bimodal distribution the researcher’s experience allows more informed guesses as to which hypotheses are true and which are false. When a null hypothesis is tested that has already been rejected several times, its probability prior to the next study is low and so is the expected *p* value.

Consider research on the self-enhancement bias as another example for the use of ST in a mature research domain. After years of confirmatory findings, the researcher can predict that most respondents will regard themselves as above average when rating themselves and the average person on dimensions of personal importance (Krueger, Heck, & Asendorpf, 2017). The prior probability of the null hypothesis of no self-enhancement is low and the meta-probability of a low *p* value is high. When *p* values are closely linked to the priors, their surprise value is low they do not afford much belief updating. In light of this consideration, a desire for a strong correlation between p(D|H) and p(H|D) must be balanced against the desire to maximize learning from the data, that is, the difference between p(H) and p(H|D). A certain hypothesis requires no additional data to increase this certainty. ST is most valuable when the researcher’s theory and experience call for tests of novel and *somewhat* risky hypotheses. If the hypothesis is neither novel nor risky, little can be learned if, in contrast, the hypothesis is too risky, the effort of testing it is likely wasted.