1.4 Scientific Research Methods: Data Analysis, Theorizing, and Replication
Learning Objectives
Define the null hypothesis and why it is the first assumption in statistics.
Describe the factors that are used to determine that two groups are statistically different.
Give examples of unethical practices in statistics.
Describe different types of reliability and validity in research and explain why they are important.
Assess causes of the replication crisis in psychology.
Now that we’ve formed a hypothesis, selected our participants, designed our experiment, and collected some data, we use statistics to evaluate the results. In this section, we discuss data evaluation and analysis, looking at ways of determining central values, such as the average or median and the level of variation within the data set. We describe some basic statistical tests to determine if groups in the study differ or if there is a correlation between variables. We also discuss some common errors in statistical analysis as well as unethical ways of evaluating or presenting data. Then, we discuss reliability and validity, which are used to evaluate the quality of the research design. Finally, we discuss the last two steps of the scientific method, theorizing and replication, including problems with replication in psychology and other sciences.
Data and Statistics
Recall that in an experiment, dependent variables are the measurements that could be influenced or depend on levels of the independent variable. As one might imagine, there are many ways of taking measurements in psychology. We could have people solve problems and measure the percentage of questions that were answered correctly. We could measure heart rate, brainwave activity, or the level of a stress hormone in a person’s saliva. Researchers use well-tested surveys where participants answer questions. Survey questions often use Likert (pronounced LICK-ert) scales, which measure the degree of response. For example, how would you answer the following question using the Likert scale ranging from strongly agree to strongly disagree?
I would like to get, on average, more sleep.
Strongly Agree | Agree | Neutral | Disagree | Strongly Disagree |
1 | 2 | 3 | 4 | 5 |
Statistics is the science of collecting, analyzing, and interpreting data, and it plays a significant role in psychological research. Psychological research often tries to convert dependent variables (behavior, attitudes, physiological responses, answers on a Likert scale, etc.) into numbers; in other words, quantify the measurements. Creating ways to quantify the measurements helps reduce the problem of subjectivity (personal interpretation), but it also allows for an analysis of data to determine if the hypothesis can or cannot be supported.
If a lot of data are collected, a way to describe the data is to look at values that represent middle points, or what statisticians call central tendencies of the data set. The mean Also known as an average, it is a measure of central tendency where the scores are summed and divided by the number of measurements or participants. is the average of all the responses, which is the sum of the responses divided by the number of recordings. Suppose we asked 10 people our Likert scale question about sleep, and we get the results shown in Table 1.1. We can see that the mean score is 2.10, which gives some indication that our group, on average, wants more sleep. The median Used to describe the middle number in a list of numbers, where there is an equal number of values above and below the median number. Often used instead of mean if there are outliers in the data set. is often used when the values get extreme. For example, suppose we asked 10 people how many minutes it usually takes them to fall asleep. We find nine out of 10 take between five and 20 minutes, which is typical, but one takes two hours (120 minutes). The mean time would be 24 minutes, which would not be a good representation of the time it takes most people to fall asleep. The median is the number where there is an equal number of values above and below in the data collected, which in our example is 13 minutes (refer to Table 1.1).
Table 1.1 Mean versus Median Data Set
More Sleep Likert Scale | Minutes to Fall Asleep | |
---|---|---|
Mean | 2.10 | 24 |
Median | 2.00 | 13 |
Subject 1 | 2.00 | 11 |
Subject 2 | 3.00 | 13 |
Subject 3 | 1.00 | 26 |
Subject 4 | 2.00 | 13 |
Subject 5 | 3.00 | 5 |
Subject 6 | 2.00 | 10 |
Subject 7 | 3.00 | 22 |
Subject 8 | 1.00 | 4 |
Subject 9 | 2.00 | 16 |
Subject 10 | 2.00 | 120 |
Responses of ten subjects to a Likert scale about getting more sleep and the time it takes for them to fall asleep. One can see that the mean (green) and median (orange) are very similar on the Likert score scale but very different on the time to fall asleep.
Source: Martin Shapiro
Another critical aspect of a data set is the variance. The standard deviation (SD) A statistical term that represents the variability in a data set. is a measure of variance—the greater the SD, the more spread out the data. This can be visualized with a bell curve or normal distribution curve (refer to Figure 1.16). In a bell curve, the mean and median are at the peak of the curve, and as we go to the right or left there are fewer values in the data set. The wider the bell curve, the more variance in the data set.
Figure 1.16 Bell Curve
A bell curve, or normal distribution, of a data set showing the increments of standard deviations (σ) and the frequency of the score as the y-axis. One can see that 68.2 percent of all the data collected is within one standard deviation above (34.1) and below (34.1) the mean.

Source: M. W. Toews, https://en.wikipedia.org/wiki/Normal_distribution#/media/File:Standard_deviation_diagram.svg. Availble under CC BY 2.5, https://creativecommons.org/licenses/by/2.5/deed.en.
Long Description
The y-axis is labeled “Frequency of score” from 0.0 to 0.4 and the x-axis is “Standard deviations” from 0.0 to 3σ. The bell curve begins at -3σ, peaks at 0, and finishes at 3σ. Slices within the graph measure 0.1%, 2.1%,13.6%, 34.1%, 34.1%, 13.6%, 2.1%, and 0.1%.
Statistical Analysis
Statistics can compare the mean and variance of two or more normally distributed data sets to determine if they meet the standards of being different. Let’s use the example of the sleep and memory test we talked about earlier. Suppose we have 100 participants. Half of the participants are given a set of facts to remember at night and recall in the morning (Group M), and the other half learn the facts in the morning and are tested at night (Group N). Our dependent variable is the percentage of facts that they remember (0 to 100 percent). Our hypothesis is that the two groups will differ in the percentage they remember. However, in our data analysis, we start with an assumption of the null hypothesis The first assumption in statistics that there are no differences between groups or variables. A statistic may allow the researcher to reject the null hypothesis. , which assumes that the data from the two groups come from the same bell curve; in other words, they do not differ. If statistics give us a reason to reject the null hypothesis, we support our hypothesis that the groups are different. Statistics can help us determine if we can reject the null hypothesis to a certain confidence level. Let’s take two possible outcomes.
Figure 1.17 shows a possible result with Group M, which averaged 53 percent remembered correctly, and Group N, which averaged 50 percent remembered correctly. One can see that there is not a great deal of variance between the two groups. Figure 1.18 shows another possible result, with the two groups having the same mean percentage correct but much greater variance. Statistical tests consider not only mean differences but also the level of variance within each group. Therefore, a statistical test might produce a finding that the two groups are statistically different in Figure 1.17 because the variance is low but find that the two groups represented in Figure 1.18 are not statistically different because there is such high variance. If that is true, the statistics allow us to reject the null hypothesis for the results shown in Figure 1.17 but not for the results shown in Figure 1.18. There are also statistical tests to determine if two measurements show positive or negative correlations, which also take into account the variance of the data.
Figure 1.17 Low Variance Results
The top figure shows the fictional results of a memory test, with Group M tested in the morning and Group N tested at night. The error bars represent one standard deviation. The bottom figure is an estimation of what bell curves would look like for the data set from the groups. Although the difference in means is small (50 versus 53 percent), the two groups were calculated to be significantly different from one another.

Source: Martin Shapiro
Long Description
The bar graph y-axis is labeled “Percentage remembered” going from 0 to 100. A blue bar labeled “Group N” is on the left, extending up to 50. A purple bar labeled “Group M” is on the right, extending to 53. The line graph y-axis is labeled “Frequency” and the x-axis is labeled “Percentage remembered” extending from 0 to 100. The blue line begins at 44 on the x-axis and goes sharply upward then down, ending at 56. The purple line begins at 46, goes up to the same peak as the blue line, then down and ends at 59.
Figure 1.18 High Variance Results
The top figure shows the fictional results of a memory test, with Group M tested in the morning and Group N tested at night. The error bars represent one standard deviation. The bottom figure is an estimation of what bell curves would look like for the data set from the groups. Although both groups have the same mean as in Figure 1.17, the large variance keeps the groups from being calculated as significantly different from one another.

Source: Martin Shapiro
Long Description
The bar graph y-axis is labeled “Percentage remembered” going from 0 to 100. A blue bar labeled “Group N” is on the left, extending up to 57. A purple bar labeled “Group M” is on the right, extending to 59. The line graph y-axis is labeled “Frequency” and the x-axis is labeled “Percentage remembered” extending from 0 to 100. The blue line begins at 0 on the x-axis and curves gently upward then down, ending at 90. The purple line begins at 10, goes across and below the blue line, peaking at the center, and then down, ending at 100.
Unethical Statistics
There can be ethical problems associated with statistics. Graphs are often presented in magazines, newspapers, social media, and news programs that are designed intentionally to misrepresent data and mislead the public for the purpose of persuasion (refer to “How to Spot a Misleading Graph—Lea Gaslowitz”). One such misrepresentation is a misleading or uneven y-axis. For example, if Figure 1.18 were presented without the error bars and a narrowed y-axis, it would appear as if there were a much larger difference between the groups (refer to Figure 1.19).
How to Spot a Misleading Graph—Lea Gaslowitz
This animated video gives several examples of how graphs and figures can be misleading.
Figure 1.19 Misleading Graph
This is a graph of the same data set as Figure 1.18. Narrowing the y-axis value range makes the two groups look very different. Without the error bars, it would be easy to convince someone that these two groups differ.

Source: Martin Shapiro
Researchers often feel pressure to publish their research to keep their jobs in universities, receive promotions, or receive grants. Also, scientific journals are often reluctant to publish articles that have non-significant or ambiguous results; this is known as publication bias A trend where only research that produces results that support their hypothesis is published. This might encourage unethical data manipulation. (Song et al., 2010). Unfortunately, the pressure to publish may lead some researchers to manipulate their data and alter the way they show their results to make it appear that they have significant findings worth publishing. Sometimes researchers collect many different data sets (i.e., many dependent variables) and then only show positive results and disregard others. Data manipulation and poor research practices can be extremely dangerous: the first study to claim that vaccinations caused autism was based on poor research design and manipulated data. The study was eventually retracted by the journal that first published it, but this one study helped launch an anti-vaccination movement even though numerous well-run and highly controlled studies show no link.
Reliability and Validity
So, what are some ways to determine if you’ve run a good experiment? A mark of a good research design is the extent to which it produces consistent results when variations on the experiment are run again. This is known as the reliability A measure of the strength of a research design or assessment tool by the degree that the same or similar results will be observed if conducted again at a different time. of an experimental design or assessment tool (McLeod, 2007). For example, if a personality test had good reliability, a person would expect the same results if they took it several times. There are different ways of evaluating reliability. One could run the experiment again to see if the same or similar results are produced. Another way is to see if the same results are produced by the same measurements but evaluated by two different people. Suppose a psychologist has two research assistants separately watch a video of people's behavior and follow guidelines for quantifying what they observe. The psychologist could then compare the results of both research assistants to look for consistency in how they rate the videos.
An experiment is considered to have validity A measure of the strength of a research design or assessment by the degree to which it accurately tests what it was intended to test. if it is accurate in testing what it was designed to test. A way to test validity is by looking at the results in the context of other similar research. Internal validity refers to whether changes to the dependent variable are due to levels of an independent variable and not some other factor. Internal validity is improved by running well-controlled experiments, having accurate and quantifiable measurements, actively reducing subjectivity, having good control groups, and reducing extraneous or confounding variables.
External validity is the extent to which the results of a study can generalize to other settings or participants. If someone is conducting research on methods of therapy to help people with depression, then one hopes these methods could be used for people outside of the study. The problems in sampling we discussed reduce the validity of an experiment. For example, suppose a researcher wants to conduct an experiment on a method for reducing stress in people but only collects data on 18- to 20-year-old college students. In that case, the results may not generalize to all age groups, education and income levels, and people from other countries and cultures.
Theorizing
We now come to the fourth stage in the scientific method, theorizing. A theory A statement of explanation of some aspect of the natural world that is well supported by evidence—the better the evidence, the better the theory. Theories are modified with new evidence. is a statement of explanation of some aspect of the natural world that is well-supported by evidence—the better the evidence, the better the theory. Often the word theory is misrepresented in public discourse as someone’s subjective idea without evidence. The theory of natural selection to explain the evolution and change of organisms has tremendous support from over a century of research in biology, geology, chemistry, genetics, physics, mathematics, and psychology. The theory that our climate is changing because of human activity has overwhelming evidence. A theory from our sleep and memory experiment might be that quality sleep helps convert short-term to long-term memory. However, a theory is not an absolute fact. Theories are dynamic and adaptable as new evidence is acquired. While theories serve as a consensus for explaining behavior or mental processes, some of the best theories create new questions.
Replication
The fifth stage of the scientific method is replication The important but often overlooked final step in the scientific method where the complete or a part of the experimental design is conducted again to support results. . This is one of the most essential and yet often underemployed parts of the scientific method. In the past, replicating one’s own or other people’s findings has not been a high priority as journals tend to want to publish novel and exciting research instead of doing what others have already done. There are other disincentives for researchers to spend time, effort, and resources replicating other people’s work. There’s not a lot of prestige in saying, “I just spent six months diligently working and discovered what someone else already discovered.” Also, if one fails to replicate a finding, it might be that the second attempt was done incorrectly.
The problem is that when researchers make a concerted effort and rerun experiments, results are often not replicated. The lack of replicability is a problem in many sciences. For example, in 2012, researchers at the biotechnology firm Amgen attempted to replicate the research described in highly referenced journal articles about cancer treatment. The research team at Amgen could not replicate the results of 47 out of 53 “landmark” cancer studies (Begley & Ellis, 2012; Baker, 2016). Research in psychology and sociology has had a particular problem with replication to the point that it has been dubbed a “replication crisis” (for review, refer to Yong, 2012, 2018). Recently, a large group of researchers in psychology have been attempting to replicate the findings of one hundred published research papers in three psychology journals as part of the Reproducibility Project: Psychology (refer to the Reproducibility Project website). They found that only 39 percent of the newly run experiments replicated the original results (Baker, 2015a, 2015b). Whether the findings were replicated per se was not completely straightforward. For example, of the 61 studies that did not replicate, about twenty-four produced “moderately similar” results to the original experiments (refer to Figure 1.20).
Figure 1.20 Replication of Research
Baker (2015a) shows the evaluation of the replication of one hundred psychology research studies. The shades of blue represent the degree to which the results were replicated—the lighter the shade, the better the replication.

Source: Baker, M. (2015). First results from psychology’s largest reproducibility test. Nature. https://doi.org/10.1038/nature.2015.17433.
Long Description
Labeled “Reliability Test” it shows a group of squares representing the 61 no responses ranging from (left to right, top to bottom) very similar at the top to not at all similar at the bottom. To the right is another group of squares representing the 39 yes responses ranging from virtually identical to somewhat similar.
Myths, Lies, and Scams: Homeopathy and the Placebo Effect
Homeopathy is considered an alternative medicine that comes in the form of pills, oils, and lotions. Homeopathic products are said to be remedies or cures for all sorts of ailments, including pain, poor mobility, heart problems, cancer, immune system dysfunction, and mental health issues like depression, stress, and anxiety (Hahn, 2013). The idea is that a remedy for a physical or psychological problem can be produced by creating a product that is made of the essence of a plant, medicine, or chemical. An extract from a healing plant might be diluted once and diluted again and again until there is only a microscopic concentration. The dilutions are so extreme that there is no detectable trace of the active ingredient. A way to think about it is if you put one drop of a chemical in a lake, mix it around, walk to the other side of the lake, take a teaspoon of water, and consider that medicine (refer to “Homeopathy Explained—Gentle Healing or Reckless Fraud?”). However, according to some homeopathic claims, the water maintains a “memory” of the substance (Letzter, 2017). Although extreme dilution and “water memory” are the historical ways of creating homeopathy, today homeopathic remedies cover a wide range of products with varied ingredients that are not regulated in the ways that conventional forms of medication are tested and regulated by the U.S. Food and Drug Administration.
Homeopathy Explained—Gentle Healing or Reckless Fraud?
This is an animated video about homeopathy and how it is a pseudoscience.
Homeopathy is a pseudoscience that makes statements or beliefs proposed to be based on science but does not follow the scientific method or provide reasonable evidence of how they work on the body and brain or their effectiveness. There have been many review articles looking at the studies investigating whether homeopathic medicine is an effective form of treatment for physical or psychological problems, and to date there has been no evidence that homeopathic medicine serves any therapeutic function beyond a placebo effect (Ernst, 2002; Hahn, 2013; Stub et al., 2016; Teixeira, 2010). An article that appeared in the medical journal The Lancet (2005) said that scientists should no longer conduct research on homeopathic remedies.
Still, homeopathic products are a multibillion-dollar industry. In 2021, the global homeopathic market value was about $6.2 billion, and it is projected to grow to about $19.7 billion by 2030 (Precedence Research, 2021). That is a tremendous amount of money for something with no medically beneficial active ingredient. Some people, however, swear homeopathic products work for them. This is likely due to the placebo effect and time. If one genuinely believes that a pill reduces anxiety, then to a small degree, they feel less anxious. With a stronger belief comes a better placebo effect, which can be facilitated if the homeopathic product is expensive (Brooks, 2009). If a person pays $58 for a small vial of homeopathic pills, they’ll be confident that their money was well spent. Also, as the saying goes, time heals all wounds. We tend to get better over time and assume the homeopathic pill contributed to it. There is also a whole industry around homeopathic products for pets. This confused me initially because homeopathy is based on a belief that something works, so the pet would have to believe and appreciate that its owner just spent $42.50 for some magic oil (refer to Figure 1.21). However, the placebo effect still works on the owner who believes they see their pet getting better. Or they give their pet more attention, which is all the pet wanted in the first place.
Key Takeaways
Some statistics take into account mean differences, the number of participants, and the variance of the data.
The null hypothesis is the assumption that there is no difference between groups or measurements, and statistics help to provide confidence in rejecting or accepting the null hypothesis.
There are some inherent problems in the system that governs publication in that journals typically only want to publish significant results (publication bias), which encourages researchers to “find” significant results.
The reliability of an experiment is the degree to which the results can be replicated, and the validity is the degree to which an experiment tests what it intended to test.
A theory is a well-supported explanation of results in the context of the results of other research.
Many sciences, including psychology, have a replication crisis, where researchers have found it difficult to replicate the results of previous experiments.