Welcome back to class! In this session, we are diving deep into Chapter 2: Data, Reality, and Problem Solving. specifically focusing on Sections 2.2 and 2.3. Before we can perform any heavy statistical lifting, we must first understand the nature of the data we are holding. Just as a chef needs to know the difference between salt and sugar, a statistician must know the difference between nominal and ratio data.
2.2 Data Classification
The first step in any analysis is asking yourself: Is this data credible? Once we establish credibility, we categorize the data into two primary types:
- Qualitative Data (Categorical): This type of data describes attributes or characteristics. It places an individual into a category or group, such as eye color, gender, or zip codes.
- Quantitative Data (Numerical): This type of data measures or counts a specific attribute. It results in numerical values where arithmetic operations (like averaging) make sense.
Within Quantitative data, we have a crucial subdivision:
- Discrete: Data restricted to a set of values, usually integers, with gaps in between (e.g., the number of eggs in a carton).
- Continuous: Data that can take on any value within an interval (e.g., temperature, height, or weight).
The Four Levels of Measurement
One of the most important concepts from these sections is the hierarchy of measurement. As we move up this list, the data becomes more complex and allows for more detailed mathematical analysis:
- Nominal: Data that consists of names, labels, or categories only. The data cannot be arranged in an ordering scheme (e.g., survey responses of "Red, Blue, Green").
- Ordinal: Data that can be arranged in order, but differences between data entries are not meaningful (e.g., a "Very Good, Good, Bad" rating scale or NYT Book rankings).
- Interval: Quantitative data where order matters and differences are meaningful, but there is no natural zero. A classic example is temperature in Fahrenheit. $40^\circ F$ is not "twice as hot" as $20^\circ F$.
- Ratio: Similar to interval data, but with a natural zero point (where zero means "none"). Ratios are meaningful here. For example, if you have $\$40$ and your friend has $\$20$, you have exactly twice as much money ($40/20 = 2$).
2.3 Time Series vs. Cross-Sectional Data
Finally, we look at when the data was collected. This distinction changes how we visualize the data.
Time Series Data originates from measurements taken from some process over equally spaced intervals of time. If you are looking at a line graph showing the global average surface temperature from 1880 to 2020, you are looking at time series data.
Cross-Sectional Data are measurements created at approximately the same period of time. For example, a table listing the Life Expectancy at Birth for 12 different countries in the year 2015 is a snapshot—a cross-section—of that specific moment in time.
Variables in Experiments
When designing experiments, we must also identify our variables carefully to avoid skewing results:
- Explanatory Variable: The variable that causes or explains changes (the input).
- Response Variable: The outcome of interest (the output).
- Confounding Variables: "Extra" variables not accounted for during experimentation that can skew results.
Make sure to review the class videos below to see us work through specific examples, such as determining the data type for "The heights of 48 randomly selected female students" versus "The rankings of books on a best-sellers list."