Statistics, or statistical procedures, refer to a set of mathematical procedures to organise, summarise and interpret data[1]. Being able to analyse and interpret statistical data is a key skill for researchers and professionals from many different disciplines.
This module introduces basic statistical concepts that underlie statistical analyses, such as types of measurement of data, and useful definitions related to samples and population.
Two families of statistics will also be discussed:
Finally you will learn about the graphic representations of data:
The study of statistics involves the following skills and knowledge:
As a researcher or professional in your area, you will need to make decisions based on statistical data, to interpret statistical data in research articles, and to conduct your own research and interpret its data.
[1] The word ‘data’ is from Latin and is plural, ‘datum’ is the singular. However, ‘data’ as a singular noun is widely accepted in some disciplines these days.
Take a look at this example:
Take a look at this example:
Would you think that 80% of dentists recommended Colgate while 20% recommended other brands?
In fact, when dentists were surveyed, they could choose several brands instead of just one. Therefore, other brands could be just as popular as Colgate. The percentage of dentists who chose other brands was not presented in this example.
Example taken from: http://www.statisticshowto.com/misleading-statistics-examples/
We can distinguish between two types of families of statistics:
Descriptive statistics
Descriptive statistics are used to summarise and describe data (information that has been collected).
Examples of descriptive statistics are the average age of university students, or the number of female and male students undertaking a Health Sciences degree.
Inferential statistics:
With Inferential statistics, data are usually collected from a sample; that is, a smaller representative subset of the larger population we wish to investigate.
Examples of inferential statistics are statistical techniques to explore the relationship between variables (e.g. correlation coefficients). These techniques show us whether two variables are related: for instance, whether there is a relationship between stress levels and academic results.
DESCRIPTIVE STATISTICS | Question | INFERENTIAL STATISTICS |
We would like to know how many university students experience high stress levels. | ||
We would like to know whether female students experience higher stress levels than male students. | ||
We would like to know what study strategies are used by first year university students. | ||
We would like to know whether there is are a relationship between university students' study strategies and their academic results. |
DESCRIPTIVE STATISTICS: In these two examples, the objective is to describe study strategies and assess the number of students who experience stress. Thus, you'd employ descriptive statistics.
INFERENTIAL STATISTICS: In these examples, the objective is to assess whether one group experiences more stress than the other, and whether there is a relationship between two variables. Thus, you'd employ inferential statistics.
Population and sample: These are sets of individuals in a study, and are representative of the entire group that you wish to study (e.g. all Australian university students). Because accessing the whole population is rarely possible, data is usually collected from a sample or set of the relevant population.
Sampling error: Even though samples should be representative of the population, in some cases sampling errors may have a negative impact. For example, a sample may over represent certain individuals by having a particular characteristic (e.g. more students who are highly stressed might choose to volunteer for a study on stress at university, than students who are not stressed). Sample size, or the number of participants/observations, will also affect the results of inferential statistics.
Data: Measurements or observations collected from a sample or population.
Statistic: Characteristics of the sample under study (e.g. mean or average study time of La Trobe university students).
Parameter: Characteristics of a population (e.g. mean or average study time of university students).
Variable: A property or characteristic of a person, event or object that can take on different values or amounts (e.g. study time). Variables can be grouped into two categories: independent and dependent variables.
Independent variable: A variable that is manipulated in order to investigate its effect on another variable.
Dependent variable: A variable that is affected by the independent variable.
Controlled variables: These are the variables that you keep constant (controlled) during the experiment or research study.
For example, if we want to determine what type of antidepressant (drug A, drug B or drug C) is most effective in the treatment of depression:
Another example: If we want to investigate whether students who listen to music when they study obtain higher grades than those who do not:
Identify dependent and independent variables in these research questions. Choose appropriate box.
Q1- We would like to investigate whether students who attend learning skills workshops obtain higher grades than students who do not attend these workshops.DEPENDENT VARIABLE | Question | INDEPENDENT VARIABLE |
Workshop attendance (Yes/No) | ||
Grades |
Explanation: You are interested in assessing whether attending workshops (independent variable) has an effect on grades (dependent variable)
Q2- Is there a link between hours of television viewing and violent behaviour in children aged 8-14?DEPENDENT VARIABLE | Question | INDEPENDENT VARIABLE |
Hours of television viewing | ||
Violent behaviour |
Explanation: You would like to investigate whether the number of hours of television viewing (independent variable) has an effect on violent behaviour (dependent variable)
Understanding your data is important as it determines the level of measurement. There are four levels of measurement of data: nominal, ordinal, interval, and ratio.
Nominal variables express a qualitative (either/or) attribute and do not imply a numerical ordering. These variables are known as categorical or nominal. Examples of this type of variable are gender (male/female), or marital status (married, single, divorced,…).
Ordinal variables express categories with a natural order; that is, values that can be ranked. However, the precise difference or distance between categories is unknown. Examples of ordinal variables are:
- Educational level, which can be ranked (e.g. from primary education to postgraduate research,
- The rate of agreement or disagreement with a statement or question.
Thus, individuals can be ranked according to the importance they give to religion, but the precise difference between two responses (e.g. very important and important) cannot be defined.
The diagram below shows the differences between each type of measurement.
Retrieved from http://www.slideshare.net/KarenHarker1/inferential-statistics-34291836
Variables | NOMINAL | ORDINAL | INTERVAL | RATIO |
Gender | ||||
Exam grades |
Gender is a nominal variable - it expresses two categories: males and females. Exam grades are measured using a numeric scale, and a student can obtain a zero score. Thus, this is a ratio variable.
Variables | NOMINAL | ORDINAL | INTERVAL | RATIO |
Year level (first, second, third) | ||||
Stress levels (measured from 0 to 10) |
Year level is an ordinal variable as it includes categories (first, second, third) that can be ordered. Stress levels, measured from 0 to 10, represent numeric values with an absolute zero; thus, this is a ratio variable.
Variables | NOMINAL | ORDINAL | INTERVAL | RATIO |
Discipline | ||||
Dropping out of university |
In this example, both variables are nominal as they express categories: Arts and Science (Variable: Discipline) and Yes/No (dropping out of university, as the variable is expressed in terms of dropping out or not dropping out).
Statistical information is usually presented and summarised using graphics and tables. The type of graph used to summarise and present data will depend on the level of measurement of the data:
Pie charts
In a pie chart, each category is represented by a slice of the pie. The area of the slice represents the percentage of responses in the category.
Fig 1. Frequencies of previous computer ownership by current Mac users (Retrieved from http://www.onlinestatbook.com/Online_Statistics_Education.pdf)
Pie charts are particularly helpful when displaying frequencies of a small number of categories. But they can be confusing if there are a large number of categories, or data from two different studies or experiments are presented.
Bar charts
Bar charts display the frequencies of different categories. Bar charts can also be used to compare the responses of two or more groups. In the example below, the bar chart compares the perceived stress of males and females by marital status (e.g. single, divorced, separated,…).
This bar chart allows us to compare perceived stress by gender and perceived stress. The X axis shows two groups: males and females; whilst the Y axis represents their scores on perceived stress. The higher the bar, the higher their perceived stress.
Moreover, each bar represents a group of males or females by marital status (the legend of the bar on the right shows what colour is used for each group). A quick look tells us that the separated women are the most stressed group, and widowed women perceive less stress than any other group.
However, it is important to note that we can’t conclude that widowed women in general are less stressed, or that separated women in general will be more stressed than any other group. This is an example of descriptive statistics: the graph summarises perceived stress by gender and marital status. To investigate whether there is a significant difference between these groups- and whether we can generalise these results to the wider population- we will need to employ inferential statistics.
Fig 3. Example of histograms. Adapted from Pallant, J. (2013). SPSS survival manual: A step by step guide to data analysis using IBM SPSS (5th ed.). Sydney, Melbourne, Auckland, London: Allen & Unwin
The x axis represents scores on perceived stress (e.g. a person can score 30, 35, 40,…). The y axis represents the frequency of each score; that is, how many times each score occurs. For instance, a score of 20 has a frequency of 10 for females; that is, 10 females scored 20 on perceived stress. Each histogram represents the frequency of stress scores for males and females and allows you to compare which scores are most frequent for each group.
Again, this only gives you descriptive information about your groups, but does not allow you to conclude that females are significantly more stressed than males, for instance.
The Khan Academy has produced a series of videos that explain diverse statistical concepts.
The RMIT Learning Lab contains fact sheets on statistical concepts, including useful statistical definitions.