R Tutorials | Hypothesis Tests in R | #rstats
https://global-fintech.blogspot.com/2016/03/hypothesis-tests-in-r.html
#rstats In this series of our R tutorials we show how probability distributions are used for hypothesis tests in R. In statistics, the researcher is confronted with two challenges: what question to ask and what the structure of the research data is. As a consequence, the researcher is confronted with the choice between parametric tests and non-parametric tests.
Hypothesis Tests in R |
How Hypothesis Tests in R Work
A test using the probability distribution to tell whether a given sample belongs to the investigated population or not is called a probability or hypothesis test. We construct hypothesis tests in R using the knowledge about the population. In statistics, we take a random sample and see how much its parameters (the mean, the shape of the distribution or the shape of the curve) deviate from those of the total population using the area under the distribution - the so-called z- or bell(-shape) curve - whose surface is equal to one.
The null-hypothesis (H0) allows us to take a sample and estimate how much it deviates from the hypothesis population. By that we mean how common it would be to see such a difference from the expected result.
The p-value is the probability of how the sample deviates from the population. It represents the area from the measurement in question going further away from the mean under the curve. It shows how common it would be to take that sample. If the p-value is very low (p-value < 0.05), we reject the null-hypothesis and conclude that our sample belongs to a different distribution.
The null-hypothesis (H0) allows us to take a sample and estimate how much it deviates from the hypothesis population. By that we mean how common it would be to see such a difference from the expected result.
The p-value is the probability of how the sample deviates from the population. It represents the area from the measurement in question going further away from the mean under the curve. It shows how common it would be to take that sample. If the p-value is very low (p-value < 0.05), we reject the null-hypothesis and conclude that our sample belongs to a different distribution.
Fore more info about the p-value and distributions in R, please read our tutorial Probability Distributions in R.
How to choose between Hypothesis Tests in R for your Research
In statistical research the following criteria are important for the choice of the hypothesis test that you are going to use:
- the question you want to answer
- the type of the variable
- the type of the scale
- the structure of the data – grouped or paired
- the number of measurements.
Based on those criteria, we use parametric or non-parametric tests in R to test the data in order to get the results that we need that is to say to see how common or uncommon the observation is.
Hypothesis Tests in R: Parametric Tests in R vs. Non-parametric Tests in R
Paramteric tests in R are used when the parameters of the population distribution (the mean, the standard deviation or the shape of the distribution) are known. Parametric tests help to figure out how uncommon an observation is. We can use the following parametric tests in R provided that we know the distribution:
- z-test
- t-test
- anova (analysis of variance).
When little is known about the distribution, we can use non-parametric tests in R. More precisely, non-parametric tests are applicable when your data do not fulfil the criteria for using a parametric test in R e.g. in case when some parameters are not known, the sample is too small, the population has a skewed distribution or the data is on the ordinal scale level (in which case the mean and the standard deviation no longer make sense). The following methods count as non-parametric tests in R (or distribution-free tests):
- Mann-Whitney Test (or Wilcoxon rank-sum test) - an alternative to the groupwise t-test
- Wilcoxon Signed-rank Test - an alternative to the paired t-test
- Kruskal Wallis Test – an alternative to the parametric ANOVA.
Technically, non-parametric tests can be used when the distribution is known. Practically, it is still preferable to use parametric tests in that case, for non-parametric tests have lower statistical power and the probability that one should reject a false H0 is usually higher with a parametric test.
Parametric Tests in R
For parametric tests in R such as z-test or t-test we shall need to show that the population or sampling distribution are normally distributed. To test whether the data is normally distributed, we can apply the so-called Shapiro Test in R - shapiro.test() – showing whether a variable follows a normal law. The tested H0 hypothesis is that the variable follows a normal law. The p-value shows to which extent H0 is true. If the p-value is very small, the H0 hypothesis is rejected.
We can also want to see whether two variables are independent. For that purpose we can use Pearson’s Test in R which is also called the Chi-squared test - chisq.test(). The tested H0 hypothesis is that the variables are independent. The p-value shows to which extent H0 is true. If the p-value is very small, the H0 hypothesis is wrong and the variables are dependent.
Let’s see some examples:
Now, let’s have a closer look at probability (hypothesis) testing. We shall start by parametric tests in R, namely z-tests in R, t-tests in R and ANOVA in R. Then, we should have a look at non-parametric tests. Before we have a closer look at that, it needs to be mentioned that in statistics we basically transform the common values such as kg or cm into special statistical values (e.g. z-values or t-values). We replace, therefore, common measurement units with variability unit. The goal is to see whether the difference is significant or not given the variability.
Z-Test in R
The z-test in statistics is used to find out whether a sample belongs to a population or not in repeated samplings. To conduct a z-test in R, we need, fist of all, to take a sample from a population, e.g. the the tumor markers found in the blood. We should distinguish between the case of a single person and a group of people.
We shall treat the single-person case first. We need to know the mean and the standard deviation of the tested population. The null-hypothesis (H0) presupposes that the sample was taken from a certain population, i.e. it has a population with a certain mean and a certain standard deviation. I would remind that the data must be on the interval or ratio scale and the population must be normally distributed. We should also check whether the null-hypothesis matches the question that we want to ask. The latter should be about the mean.
Next, we transform a normally distributed variable to the z-variable (using the mean and the standard deviation of the population denoted by the Greek letters mu (μ) and sigma (σ) respectively, obtaining the standard normal distribution:
The formula above helps us to discover whether the variable (x) deviates from the population mean. If the probability i.e. the p-value, is too low, we discard the hypothesis, logically assuming that our sample belongs to a different – alternative – distribution.
Should, however, our sample contain several persons, that is to say a randomly selected group of people (of the sample size n), we measure the mean tumor marker level of the whole group, its level being lower than that of the normal distribution. We calculate the standard deviation of the sampling distribution by taking the standard deviation of the population and dividing by the square root of n:
For that case, we transform the formula in the following way:
The variable x-bar (x̄) is the sample mean. We subtract it from the population mean and divide by the standard deviation of the sampling distribution. Now we know the standard deviation of the population and divide it by the square root of n. Thus, we transform our measurement to the z-value. This allows us to find how common the sample would be on repeated samplings.
Finally, a small remark: we use the Greek letters to talk about the whole population (e.g. μ and σ), whereas the Latin letters are used to describe the given sample (x̄ and n).
T-Test in R
Let’s spice up this post with some history. The t-test was developed and published in 1908 by William Gosset, a chemist and statistician who worked for the Guinness brewery in Dublin. Gosset was not allowed by Guiness to use his own name in his research, so he used a pen name Student. Therefore, his tests are often referred to as Student tests. In this post we shall refer to them as t-tests.
In his work, Gosset had to solve a problem of inferring the data from small samples. The context of Gossets work was process control sampling in the brewery.
Now, let’s switch from history to the main topic of this tutorial - statistics, namely to the t-test in R. A small reminder. As it was mentioned above, if we know the standard deviation of the population (σ), we use the z-test to check our hypothesis. Unfortunately, it may not always be the case. Under some conditions the standard deviation of the population is unknown. What can we do? We shall have to infer the information about the population standard deviation from the sample standard deviation denoted as s in the following way:
From the formula, we can see that the larger the sample sizes, the better is our estimate of the population standard deviation. At large sample sizes, the t-distribution would be approaching the z-distribution.
If we compare the means of two samples to find out how large is the difference, we apply the so-called groupwise t-test:
Should we ask ourselves whether the change in the mean and the sample standard deviation is less or larger than 0 and are not interested whether the group mean has changed, we apply the paired t-test:
In our research, we can take advantage of the t-test in R - t.test() - for different purposes.
1. First and foremost, we may need to test whether the mean of a normally distributed variable equals a given number (the tested H0 hypothesis is that the mean equals the given number):
t.test(mydata$q1, mu=1.53)
2. We can also need to know whether the means of two normally distributed variables of equal variance have the same mean (the tested H0 hypothesis is that the means are equal) – the groupwise t-test:
t.test(mydata$q4, mydata$q5)
3. Alternatively, it may become necessary to test whether the mean of a normally distributed variable is the same in two different subgroups of the sample (the tested H0 hypothesis is that the means are equal) – the pairwise t-test:
t.test(mydata$q4~mydata$q1).
The two aforementioned examples are the case for the groupwise t-test.
4. Last, but not least, we may want to display the confidence interval for the mean of a normal variable - t.test()$conf.int:
t.test(q5,conf.level=0.95)$conf.int.
Now, we shall practise a little bit with t-tests in R.
Now, we shall practise a little bit with t-tests in R.
ANOVA in R
ANOVA is an abbreviation for the analysis of variance which is used quite commonly to find the difference between different samples. The H0 hypothesis demands that the two models are equivalent.
In this post, we shall not go into the details about ANOVA in R. We have, however, already touched upon it in one of our R Tutorials, more precisely in the #rstatistics series about regression analysis with R. We shall dedicate one of our next R Turorials.
Now, let us show you how you can visualise hypothesis tests in R.
Non-parametric Tests in R
Let’s make a lyrical pause to our discussion of parametric and non-parametric tests in R. Hardly can we imagine our life today without technology. Indeed, in the recent years our life has changed so enormously that we cannot even imagine it without our Blackberries or iPhones, profiles on Facebook, hashtags, emoticons and, for sure, the omnipresent selfies!
Earlier, life was less complicated, people tended to be devoted to what they had, and valued more their feelings and relations, in other words they were more committed or verbindlich as Germans put it. Nevertheless, we are not as old to grumble now about how modernity and technology transforms our life. We cannot but observe the other side of the coin. Everything was dead slow and expensive. That appeals, in particular, to the world of statistics and calculations. It is even difficult to imagine in our days that the notorious plan Ost developed in 1940 – 1941 which consisted in colonising and germanising Russia and contained lots of economic calculations aimed at optimising the processes on the occupied territory, could have been done less than in a month if Germans had had in their possession the technology tools such as R or at least MS Excel that we have today. and, surely, R, the work would have taken less time. Instead, it took Germans almost a year to carry out all the necessary calculations. Lucklily, for the Soviets…
Anyway, let us not get distracted by the history. That example shows us not only that there are pros and cons in all historical times, but also that before computers mathematics and statistics were much more cumbersome (though less advanced) than today. Returning to our topic of hypothesis testing we should admit that challenged with non-parametric tests today we are much luckier than decades ago and the lack of knowledge about distributions is compensated with computational power through resampling.
Sometimes we have to deal with the data that does not fulfil the criteria for the parametric (or distribution-free, for the distribution as such no longer matters) test. The reasons can be the following:
- some of the parameters may be unknown
- the sample may be too small
- the population may have a skewed (i.e. not normal) distribution
- the data may be ordinal (consequently, the mean and the standard deviation do not hel us much here).
In case of non-parametric tests we make no assumptions about the distribution. Instead, we use sampling. There are alternatives to parametric tests among non-parametric tests.
1. Wilcoxon Rank-sum test (also called Mann-Whitney Test) with continuity correction in R is an alternative to the groupwise t-test in R. It checks namely whether the means of two variables, at least one of which is not normally distributed, have the same mean value - wilcox.test(). The tested H0 hypothesis is that the means are equal. If the p-value is very small, the H0 hypothesis is wrong.
2. Wilcoxon Signed-rank Test in R is an alternative to the pairwise t-test.
3. Kruskal-Wallis Rank-sum test in R is an alternative to the ANOVA and tests whether the mean value of a non-normal variable is the same in two different subgroups of the sample - kruskal.test(). The tested H0 hypothesis is that the means are equal. If the p-value is very small, the H0 hypothesis is wrong.
Let's practise a little bit with non-parametric tests in R.
Let's practise a little bit with non-parametric tests in R.
You can find the full script as well as the data file in my GitHub repository.