topic badge

9.02 Scatterplots

Scatterplots

We often analyze bivariate data to determine whether a relationship between the two variables exists. A scatterplot can be used to display bivariate, numerical data once the independent and dependent variables are defined.

The analysis of bivariate data should include:

  • Form, usually described as a linear relationship or a nonlinear relationship

  • Strength, describing how closely the data points match the model line or curve

If the relationship between the variables is linear, the direction of the relationship can be described as positive or negative.

  • Positive relationship: as the independent variable increases, the dependent variable increases

  • Negative relationship: as the independent variable increases, the dependent variable decreases

The dashed lines in the scatterplots will help us visualize possible trends in the data.

10
20
30
40
50
60
70
80
90
x
1
2
3
4
5
6
7
8
9
y
Perfect positive relationship since points are exactly along the model with a positive slope
5
10
15
20
25
30
35
40
45
x
5
10
15
20
25
30
35
40
45
y
Perfect negative relationship since points are exactly along the model with a negative slope
10
20
30
40
50
60
70
80
90
x
1
2
3
4
y
Strong negative relationship since points are tightly clustered along the model
1
2
3
4
5
6
7
8
9
x
1
2
3
4
5
6
7
8
9
y
Moderate negative relationship since points are relatively clustered along the model
0.2
0.3
x
254
255
256
257
258
259
260
y
Weak positive relationship since points are loosely clustered along the model
1
2
3
4
5
6
7
8
9
x
1
2
3
4
5
6
7
8
9
y
No relationship since there is no evident clustering of the data

When comparing bivariate data, it may be necessary to separate the data into categories. For example, when comparing the weights of dogs during their first year after birth, the data might not show a relationship because large dogs (like Boxers) will grow much more than small dogs (like Yorkies).

We can compare categorical variables in scatterplots by using different colors or symbols.

The weights of small, medium, and large dogs over time are shown in the scatterplot.

A scatterplot titled 'Weight of dogs over time'. The horizontal axis shows the age in months from 0 to 12, and the vertical axis shows the weight in pounds from 0 to 60. The key indicates that the blue dots represent small dogs, the green dots represent medium dogs, and the black dots represent large dogs. The weight increases together with the age for all dog sizes, but in different margins. Ask your teacher for more information.

Different colored dots represent the different categories or sizes of dogs.

For each category, there is a strong, positive linear relationship between the dogs' age and weight.

It is important to note that the existence of a relationship between two variables in a scatterplot, regardless of strength, does not necessarily imply that one causes the other. Causation can only be determined from an appropriately designed statistical experiment.

Examples

Example 1

For each scatterplot, determine whether the variables have a linear relationship, a nonlinear relationship, or no relationship. If there is a relationship, describe its strength.

If the relationship is linear, describe the direction as positive or negative.

a
1
2
3
4
5
6
7
x
5
10
15
20
25
30
35
40
45
50
55
y
Worked Solution
Create a strategy

A relationship between two variables exists if the points follow a similar trend. The points will roughly form a line (linear) or a curve (nonlinear) if there is a relationship.

To describe the strength of the relationship, we can analyze how tightly the data points are clustered or grouped together.

Apply the idea

The y-values are decreasing at a slower and slower rate, causing the point to form a curve. This shows there is a nonlinear relationship between the variables.

Because the points are tightly clustered, the relationship is strong.

b
1
2
3
4
5
6
7
8
9
x
10
20
30
40
50
60
70
80
90
y
Worked Solution
Create a strategy

A relationship between two variables exists if the points follow a similar trend. If there is no trend or no shape to the data, then there is no relationship between the variables.

Apply the idea

There is no trend in this data, meaning there is no relationship between the variables.

Reflect and check

We could try to sketch a line of fit for the data, like the one shown, but the points are far from the line. A negative, linear trend would suggest that y decreases as x increases, which we cannot conclude for this data set.

1
2
3
4
5
6
7
8
9
x
10
20
30
40
50
60
70
80
90
y
c
2
4
6
8
10
12
14
16
18
x
5
10
15
20
25
30
35
40
45
50
y
Worked Solution
Create a strategy

First, we must determine if a relationship between the variables exists. If a relationship exists, we can describe the strength by analyzing how tightly the data points are clustered.

If the relationship between the variables is linear, the direction of the relationship can be described as positive or negative.

  • Positive relationship: as the independent variable increases, the dependent variable increases

  • Negative relationship: as the independent variable increases, the dependent variable decreases

Apply the idea

As the x-values increase, the y-values also increase. This indicates there is a positive, linear relationship between the variables.

However, the points are not tightly clustered, so the relationship between the variables is moderate.

Example 2

Justin recently had surgery for a torn muscle in his leg. He is taking medication for the pain as well as attending regular physical therapy sessions. He learns that not everyone's insurance plan covers physical therapy.

a

Justin wants to investigate whether the post-surgery pain from a torn muscle lasts longer for patients who only take medication compared to those who can attend physical therapy sessions. Which question should he use for his investigation?

A
What are the pain levels of patients that have had surgery for torn muscles?
B
What type of medication do doctors prescribe for pain management in surgery patients?
C
How does the pain level change over time in patients using physical therapy and medication compared to those using medication only?
D
What percentage of people have insurance plans that cover physical therapy costs?
Worked Solution
Create a strategy

Consider the factors that Justin is interested in investigating and whether the questions would lead to data that addresses each of the factors.

Apply the idea

Option A: The answer to this question would only focus on the pain levels of the patient. Since Justin wants to know how long the pain lasts as well, this question would not be suitable for his investigation.

Option B: The answer to this question would lead to categorical data (types of medication), which Justin is not interested in. This question would not be suitable for his investigation.

Option C: This question considers multiple factors: the pain level of patients, the length of time that they feel pain, whether patients take medication only, or whether they take medication and attend physical therapy sessions. This accounts for all the factors that Justin is interested in for his investigation.

Option D: Similar to Option A, this question focuses on how many people have insurance plans that cover physical therapy, rather than considering how long pain lasts or how much pain the patient is in. This question would not be suitable for her investigation.

Justin should use the question in Option C.

Reflect and check

Remember that Justin's statistical question is different from the survey questions that he would use to collect the data. Possible survey questions might be:

  • Do you take medication for your pain?

  • Do you attend physical therapy sessions?

  • On a scale of 1–10, how painful is it to move your repaired muscle?

b

Describe the data that would need to be collected to answer Justin's statistical question.

Worked Solution
Create a strategy

Use the statistical question from part (a) to identify the variables and/or categories of interest.

Apply the idea

The bivariate, numerical data that needs to be collected is:

  • The pain levels in patients that had surgery for a torn muscle (measured on a numerical scale like 1-10)

  • The time (measured in days or weeks) since the surgery

The data should be separated into two categories:

  • Patients that take medication only

  • Patients that take medication and attend physical therapy sessions

Reflect and check

Since the data is bivariate and numerical, it can be represented by a scatterplot. The independent variable is time, and the dependent variable is the patients' pain level. An example scatterplot is shown:

A scatterplot titled 'Pain level changes over time'. The horizontal axis shows the time in weeks from 0 to 8, and the vertical axis shows the pain level from 0 to 8. The key indicates that the blue dots represent group a (Physical therapy and medication) and the green dots represent group b (Medication only). The pain level decreases as the time increases for both groups, but the  decrease for group A is weekly while group b is every 2 weeks.

Example 3

A surfing company is located in various coastal states across the U.S. When analyzing their data, they separate the store locations into two regions: the Western region and the Eastern region. The scatterplot shows data collected to answer the question, "How have the sales of our product changed over time in each of the sales regions?"

A scatterplot titled 'Product sales over time'. The horizontal axis shows time in months from 0 to 20, and the vertical axis shows sales in dollars from 0 to 720. The key indicates that the blue dots represent Western region and the green dots represent Eastern region. The blue dots show an increasing sales as time increases pattern, while the green dots shows the opposite.
a

Identify the independent and dependent variables in this context.

Worked Solution
Create a strategy

Recall that the independent variable is not affected by the other variable, while the dependent variable may be affected or changed by the other variable.

On a scatterplot, the independent variable is placed on the horizontal axis, and the dependent variable is placed on the vertical axis.

Apply the idea

The independent variable is time (measured in months), and the dependent variable is the amount of sales (measured in dollars).

b

The owner of the company makes this conclusion: "The sales of the product are improving with time." Which sales region was the owner analyzing?

Worked Solution
Create a strategy

In part (a), we found that the amount of sales is the dependent variable, and time is the independent variable. This means the owner concluded that the dependent variable increases as the independent variable increases.

Apply the idea

According to the owner's statement, both variables are increasing which indicates a positive relationship. Both sets of data values show a linear relationship, but only the blue dots show a positive relationship.

According to the key (or legend), the blue points represent data from the Western region.

Reflect and check

If the owner was analyzing the Eastern region (the black points), the conclusion would have been, "The sales of the product are decreasing over time."

Example 4

Adria heard that children who learn to speak at a young age are more likely to be gifted and talented in later stages of life. She decides to investigate this using the data cycle.

a

Formulate a statistical question for Adria that would lead to the collection of data that can be represented in a scatterplot.

Worked Solution
Create a strategy

First, we need to identify the variables of interest. Then, we need to write a question such that the answer to the question addresses both variables.

Apply the idea

From the given information, we gather that Adria is interested in two variables:

  1. The age when a child first spoke

  2. Their intelligence level later in life

The information is not specific about the later stages of life. We can choose any stage of life after birth, such as the teenage years.

One possible statistical question is, "What is the relationship between the age at which a child first spoke and their level of intelligence as teenagers?"

Reflect and check

Other possible questions are:

  • How does the age at which a child first spoke influence their level of intelligence as adults?

  • If a child first spoke at 6 months old, what level of intelligence are they expected to have as a teenager?

  • Which range of ages for when a child first spoke correspond to the highest levels of intelligence?

This could also be separated into multiple categories: age when a child first spoke versus intelligence level after middle school, after high school, and after university.

b

The table shows the ages of some teenagers when they first spoke and their results in an aptitude test:

Age when first spoke (months)142791621171071924
Aptitude test results9669931018792991049397

Create a scatterplot to model the data.

Worked Solution
Create a strategy

Let x represent the age when the child first spoke and y represent the aptitude test results as a teenager.

The minimum value for x is 7 and the maximum is 27, so we can use a scale of 5 to label the x-axis. The minimum value for y is 69 and the maximum is 104, so we can use a scale of 20 to label the x-axis.

Apply the idea
5
10
15
20
25
\text{Age (months)}
20
40
60
80
100
\text{Aptitude score}
c

Draw a conclusion about the data by answering the statistical question from part (a).

Worked Solution
Create a strategy

To describe the relationship between the age at which a child first spoke and their level of intelligence as teenagers, we can analyze the following features of the data:

  • Form: linear or nonlinear

  • Strength: strong or weak

If the data follows a linear trend, we can describe the direction as positive or negative.

Apply the idea

The points are relatively close together, indicating a strong relationship. As the age increases, the aptitude score decreases slightly, indicating a negative, linear relationship.

The relationship between the age when a child first spoke and their aptitude test score as a teenager has a strong, negative, linear relationship.

This suggests that as the age at which a child first spoke increases, their intelligence level as a teenager tends to decrease.

Reflect and check

The closer the points are to forming a line or curve, the stronger their relationship will be. A strong relationship between two quantities suggests that the value of one quantity can be predicted with some accuracy given the other quantity, but is not enough evidence to suggest that changes in one quantity directly cause changes in the other.

Idea summary

The analysis of bivariate data should include:

  • Form, usually described as a linear relationship or a nonlinear relationship

  • Strength, describing how closely the data points match the model line or curve

If the relationship between the variables is linear, the direction of the relationship can be described as positive or negative.

  • Positive relationship: as the independent variable increases, the dependent variable increases

  • Negative relationship: as the independent variable increases, the dependent variable decreases

Outcomes

A.ST.1

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on representing bivariate data in scatterplots and determining the curve of best fit using linear and quadratic functions.

A.ST.1a

Formulate investigative questions that require the collection or acquisition of bivariate data.

A.ST.1b

Determine what variables could be used to explain a given contextual problem or situation or answer investigative questions.

A.ST.1c

Determine an appropriate method to collect a representative sample, which could include a simple random sample, to answer an investigative question.

A.ST.1d

Given a table of ordered pairs or a scatter plot representing no more than 30 data points, use available technology to determine whether a linear or quadratic function would represent the relationship, and if so, determine the equation of the curve of best fit.

A.ST.1h

Analyze relationships between two quantitative variables revealed in a scatterplot.

A.ST.1i

Make conclusions based on the analysis of a set of bivariate data and communicate the results.

What is Mathspace

About Mathspace