Everything you need to know about statistics (but were afraid to ask)

Does the thought of p-values and regressions make you break out in a cold sweat? Never fear – read on for answers to some of those burning statistical questions that keep you up 87.9% of the night.

  • What are my hypotheses?

There are two types of hypothesis you need to get your head around: null and alternative. The null hypothesis always states the status quo: there is no difference between two populations, there is no effect of adding fertiliser, there is no relationship between weather and growth rates.

Basically, nothing interesting is happening. Generally, scientists conduct an experiment seeking to disprove the null hypothesis. We build up evidence, through data collection, against the null, and if the evidence is sufficient we can say with a degree of probability that the null hypothesis is not true.

We then accept the alternative hypothesis. This hypothesis states the opposite of the null: there is a difference, there is an effect, there is a relationship.

  • What’s so special about 5%?

One of the most common numbers you stumble across in statistics is alpha = 0.05 (or in some fields 0.01 or 0.10). Alpha denotes the fixed significance level for a given hypothesis test. Before starting any statistical analyses, along with stating hypotheses, you choose a significance level you’re testing at.

This states the threshold at which you are prepared to accept the possibility of a Type I Error – otherwise known as a false positive – rejecting a null hypothesis that is actually true.

  • Type what error?

Most often we are concerned primarily with reducing the chance of a Type I Error over its counterpart (Type II Error – accepting a false null hypothesis). It all depends on what the impact of either error will be.

Take a pharmaceutical company testing a new drug; if the drug actually doesn’t work (a true null hypothesis) then rejecting this null and asserting that the drug does work could have huge repercussions – particularly if patients are given this drug over one that actually does work. The pharmaceutical company would be concerned primarily with reducing the likelihood of a Type I Error.

Sometimes, a Type II Error could be more important. Environmental testing is one such example; if the effect of toxins on water quality is examined, and in truth the null hypothesis is false (that is, the presence of toxins does affect water quality) a Type II Error would mean accepting a false null hypothesis, and concluding there is no effect of toxins.

The down-stream issues could be dire, if toxin levels are allowed to remain high and there is some health effect on people using that water.

  • What is a p-value, really?

Because p-values are thrown about in science like confetti, it’s important to understand what they do and don’t mean. A p-value expresses the probability of getting a given result from a hypothesis test, or a more extreme result, if the null hypothesis were true.

Given we are trying to reject the null hypothesis, what this tells us is the odds of getting our experimental data if the null hypothesis is correct. If the odds are sufficiently low we feel confident in rejecting the null and accepting the alternative hypothesis.

What is sufficiently low? As mentioned above, the typical fixed significance level is 0.05. So if the probability portrayed by the p-value is less than 5% you reject the null hypothesis. But a fixed significance level can be deceiving: if 5% is significant, why is 6% not?

It pays to remember that such probabilities are continuous, and any given significance level is arbitrary. In other words, don’t throw your data away simply because you get a p-value of 6-10%.

  • How much replication do I have?

This is probably the biggest issue when it comes to experimental design, in which the focus is on ensuring the right type of data, in large enough quantities, is available to answer given questions as clearly and efficiently as possible.

Pseudoreplication refers to the over-inflation of degrees of freedom (a mathematical restriction put in place when we calculate a parameter – e.g. a mean – from a sample). How would this work in practice?

Say you’re researching cholesterol levels by taking blood from 20 male participants.

Each male is tested twice, giving 40 test results. But the level of replication is not 40, it’s actually only 20 – a requisite for replication is that each replicate is independent of all others. In this case, two blood tests from the same person are intricately linked.

If you were to analyse the data with a sample size of 40, you would be committing the sin of pseudoreplication: inflating your degrees of freedom (which incidentally helps to create a significant test result). Thus, if you start an experiment understanding the concept of independent replication, you can avoid this pitfall.

  • How do I know what analysis to do?

There is a key piece of prior knowledge that will help you determine how to analyse your data. What kind of variable are you dealing with? There are two most common types of variable:

  • Continuous variables. These can take any value. Were you to you measure the time until a reaction was complete, the results might be 30 seconds, two minutes and 13 seconds, or three minutes and 50 seconds.
  • Categorical variables. These fit into – you guessed it – categories. For instance, you might have three different field sites, or four brands of fertiliser. All continuous variables can be converted into categorical variables.

With the above example we could categorise the results into less than one minute, one to three minutes, and greater than three minutes. Categorical variables cannot be converted back to continuous variables, so it’s generally best to record data as “continuous” where possible to give yourself more options for analysis.

Deciding which to use between the two main types of analysis is easy once you know what variables you have:

ANOVA (Analysis of Variance) is used to compare a categorical variable with a continuous variable – for instance, fertiliser treatment versus plant growth in centimetres.

Linear Regression is used when comparing two continuous variables – for instance, time versus growth in centimetres.

Though there are many analysis tools available, ANOVA and linear regression will get you a long way in looking at your data. So if you can start by working out what variables you have, it’s an easy second step to choose the relevant analysis.

Ok, so perhaps that’s not everything you need to know about statistics, but it’s a start. Go forth and analyse!

For more such insights, log into our website https://international-maths-challenge.com

Credit of the article given to Sarah-Jane O’Connor, University of Canterbury