👩🏫 2.2 Displaying and Summarizing a Numeric Variable #1 (Tabbed)
Displaying and Summarizing a Numeric Variable
In this section, we'll explore how to display a numeric variable which provides a measurement about our observational units. This section is a bit long, so I've split it into two parts. This content should take about 🐢⌚ 40 minutes if you worked through all of the content (read, watch at 1x speed, attempt problems then watch/read feedback). Expand the tabs below to explore.
Representing a Numeric Variable with Graphs
⚠ Caution ⚠
Dotplots and stemplots become exceedingly difficult to read as the number of observations increases. This gives us reason for our most useful (IMO) of all displays for a numeric variable: the histogram.
There are many displays for numeric variables. In this section, we’ll explore dotplots, stemplots, and histograms. In the next section, we’ll explore the final data visualization for a single numeric variable in our course, boxplots.
Dotplots
A dotplot represents each observed data value for a numeric variable as a dot along a horizontal axis. The dotplot below shows the number of states visited by students Links to an external site. in my Math 119 classes in Spring 2022. What are some interesting things you notice about this graph? Where would you fall?
It's likely you got everything you needed just looking at this graph, but if you want to watch me talk about it, especially elaborating on the advantages and disadvantages of using a dotplot, you can watch the video below.
Video on Dotplots
Watch the first two minutes of this video, if you'd like a brief discussion of the display above.
Understanding Check
Below, you'll find a dotplot of weekly exercise hours for my Fall 2021 statistics students Links to an external site.. Fill in the blanks to answer questions based on the dotplot.
Stemplots
A stemplot (or stem-and-leaf plot) organizes a numeric variable similarly to a dotplot, but can allow for a better look at the distribution of the variable with the way data are ‘binned’ by their stems. The same data with number of states visited is displayed in a stemplot to the right. What does the “2 : 8” represent? What advantages or disadvantages do you see this having over the dotplot?
It's likely you got everything you needed just looking at this graph, but if you want to watch me talk about it, especially elaborating on the advantages and disadvantages of stemplots, you can watch the video below.
Video on Stemplots
Understanding Check #1
You've stumbled upon this untitled stemplot as you wandered around a seedy part of town that has been known to be frequented by data viz folks. Based on the stemplot key given, what is the actual value of the unknown variable represented by the highlighted 6 in the fourth row?
Decimal point is at the colon.
Leaf unit = 0.1
0 : 004 0 : 59 1 : 0223444 1 : 555556677788888999 2 : 000001112233333344444 2 : 55566667777799999 3 : 000000001112223334 3 : 55666788889999
Understanding Check #2
This question is from an old AP Statistics exam that is no longer secure. The back-to-back stem-and-leaf plot below gives the percentage of students who dropped out of school at each of the 49 high schools in a large metropolitan school district.
Histograms
A histogram is essentially a bar chart for numeric data. We end up categorizing our numeric variable into bins or ranges of values. Similar to a bar chart, the x-axis represents the values the collected variable can take - only now, they are numeric. The y-axis (height of each bar) still represents the frequency or relative frequency for the interval of data values represented in that bin. Concepts in Statistics has a great visual of how to move from a dot plot to a histogram Links to an external site..
In the histogram to the below, the values along the x-axis represent actual scores on the first exam for students in my Spring 2018 Math 119 classes. The height of each bar shows the number of students who scored in the range represented by where that bar is positioned. For instance, the arrow on the histogram is pointing at the bar for scores between 70 and 80 on the exam and the height of the bar at 18 means that 18 students scored between a 70 and 80 on Exam 1. Just like in our bar charts in Section 1, the shape of our histogram will not change whether we use frequencies, relative frequencies, or percentages for our y-axis, only the label there will!
⚠ Caution ⚠
Reasonable bin sizes are crucial to see the distribution of the numeric variable. Setting the bin sizes to a width of 1, see what the histogram would have looked like to the right.
While a histogram is similar to a bar chart, a bar chart loses the spacing that is part of a measurement, really distorting what is happening with our numeric variable.
A few notes here:
- Most software includes the lower boundary of a bin in that class. So a student who scored a 70 on the exam would be in our bar with the arrow while a student who scored an 80 would be in the next bar.
- Most software have default bin sizes that are useful for looking at the distribution. Occasionally, we may choose our own bins based on making the data interpretable as I did above by breaking the bins for an exam score into typical letter grades, but the default is usually a good starting point.
- Some software has spaces between bars. This is an abomination, especially if the data is continuous.
Understanding Check
Here is a histogram of the fiber content in 25 child cereals. (Assume left-hand endpoints are included in each bin.)
Use the histogram above to determine if the two statements below are true or false.
Want more than the feedback provided above? Expand the solution below for a video walkthrough with more discussion on how to breakdown histograms.
💡 Understanding Check Solution
Example
A meteorological station in Hawaii has gathered the following average daily wind speeds (in mph) over 43 days and displayed it in the histogram to the right.
What percentage of days had speeds between 30 and 50 mph?
What percentage of days had speeds above 60 mph or below 20 mph?
Consider these answers and then see if you got them right in the video below.
Example Video Walkthrough
Understanding Check
The following histogram shows the amount spent on a haircut for 103 Math 119 students in my classes this semester.
More on Bin Size
We used the same set of data to construct these three histograms of student scores. Are you surprised by how different the distribution looks in each histogram?
The histogram on the left has a bin width of 20. The first bin starts at 40. To create the middle histogram, we changed the bin width to 10 but kept the first bin starting at 40. To create the last histogram, we kept the bin width at 10 but started the first bin at 45.
These changes affect our description of the shape, center, and spread of this set of data. For example, in the histogram on the left, the distribution looks symmetric with a central peak. In the histogram on the right, the distribution looks slightly skewed to the right. Based on the middle histogram, we might estimate that most students scored between 70 and 80. But the histogram on the right suggests that typical students scored between 65 and 75.
Why does changing the bin size and the starting point of the first bin change the histogram so drastically?
When we change the bins, the data gets grouped differently. The different grouping affects the appearance of the histogram.
To illustrate this point, we highlighted the five students who scored in the 70s in each histogram.
-
-
- In the histogram on the left, these five students are grouped in the middle bin with other students who scored between 60 and 80.
- In the histogram in the middle, these five students form a bin of their own, since no other students scored between 70 and 80.
- In the histogram on the right, these five students are in separate bins.
-
Which histogram gives the most helpful summary of the distribution?
For this situation, the middle histogram is probably the most useful summary because the intervals correspond to letter grades.
Our general advice is as follows:
-
-
- Avoid histograms with large bin widths that group data into only a few bins. A histogram constructed with large bin widths will show the distribution as a “skyscraper.” This does not give good information about variability in the distribution.
- Avoid histograms with small bin widths that group data into lots of bins. A histogram constructed with small bin widths will show the distribution as a “pancake.” This does not help us see the pattern in the data.
-
Use the simulation below to answer the questions in the next Try It.
If the simulation doesn't load, you can open the simulation in it's own window Links to an external site..
Understanding Check
Use the simulation above to answer these questions. Investigate the changes in the histogram by changing the bin width. In the simulation you can change the bin width in two ways: (1) move the slider or (2) choose a value for the bin width. To see the dotplot, click “show individual grades.”
No Bar Charts Allowed
Below, you can see a bar chart made using the data for the number of states visited in my Spring 2022 classes. Why is this a problematic display compared to a dotplot, stemplot, or histogram?
Did you see how everything is evenly spaced until 21? And there isn't a visible jump between 21 and 28 because bar charts are for categories - not numbers! We are also limiting our bin size to 1 by making a bar chart, which means we may miss out on some helpful information on the distribution of the variable.
Some content and questions can be found in Lumen Learning's Concepts of Statistics the original copyright is provided by: Open Learning Initiative. Located at: http://oli.cmu.edu License: CC BY: Attribution
Before you click "Next" please read through all of tabbed pages.