Statistics Explained Simply: An NBAThemed Introduction

Statistics Explained Simply: An NBAThemed Introduction

Content

An Introduction to Statistics: No Maths Required

Hey team! Justin Zeltzer from zstatistics.com is tackling a challenge: explaining statistics in under half an hour, without the math! This article summarizes the key concepts from his video, making it perfect for those starting a statistics course or anyone curious about the subject. To keep things interesting, all examples are NBAthemed, focusing on Justin's newfound obsession with American basketball.

Types of Data

When dealing with statistics, you'll encounter two main types of data: categorical data and numerical data.

Categorical Data

Categorical data is further divided into:

  • Nominal Data: There is no order to the categories. For example, what team does Steph Curry play for? (Atlanta Hawks, Boston Celtics, Golden State Warriors, etc.). The order of the teams doesn't matter.
  • Ordinal Data: There is a defined order to the categories. For example, what position does Steph Curry play? (Guard, Forward, Center). There's a general order based on court position and height.

Numerical Data

Numerical data is divided into:

  • Discrete Data: Data that can only take specific, separate values. For example, how many free throws has Steph missed tonight? He can only miss whole numbers (0, 1, 2, 3, etc.). He can't miss 1.5 free throws.
  • Continuous Data: Data that can take any value within a range. For example, what's Steph's height? While often recorded as 191 centimeters, his exact height could be 191.3, 191.217, and so on.

Proportions and Percentages

Even though a percentage seems numerical, it's a special case. Consider Steph's threepoint percentage this season. Each threepoint attempt is nominal data (made or missed). The percentage is a numerical summary of that nominal data.

Distributions

A distribution shows how data is spread out. Consider the heights of NBA players. The smallest player is 5'9", and the tallest is 7'3". A probability density function describes the distribution of heights between those extremes. It shows the probability of selecting a player of a certain height at random.

Common Distributions

  • Normal Distribution (Bell Curve): The most common distribution, where the bulk of the data is clustered in the middle, with fewer values at the extremes.
  • Uniform Distribution: Every value has an equal probability of occurring.
  • Bimodal Distribution: Two peaks, indicating two common values.
  • Skewed Distribution: Data is bunched up on one side, with a long tail extending in the other direction (leftskewed or rightskewed).

Sampling Distributions

What happens if you take a sample of players (e.g., 10 players) and calculate their average height? The distribution of these average heights is a sampling distribution.

The sampling distribution will have the same mean as the underlying distribution, but it will be narrower. This is because extreme average heights are less likely when you're averaging multiple players.

This is crucial in statistics! Every study starts with a sample, and inferences are made based on that sample. Understanding the sampling distribution is essential.

Sampling and Estimation

How good is Steph Curry at threepointers? He's made 61 out of 128 attempts this season (0.4766). This is a sample statistic. But what's his *true* threepoint percentage, the percentage he will perform at in the long run? This is represented by the parameter theta (θ), which represents a parameter we can never truly know.

Our sample statistic (0.4766) is an estimate for theta. There's uncertainty around this estimate. Statistics helps quantify this uncertainty using tools like confidence intervals. A 95% confidence interval provides a range within which we are 95% confident that the true value of theta lies.

Comparing Players: Steph Curry vs. Meyers Leonard

Meyers Leonard has a higher threepoint percentage (0.6) than Steph Curry this season, but he's only taken 15 shots. While his best estimate of theta might be higher than Steph's, his confidence interval will be much wider because we have less information about his longterm shooting ability.

Parameters vs. Sample Statistics

Parameters are the unknown, fixed values we're trying to estimate. They're represented by Greek letters:

  • μ (mu): Mean of a numerical variable.
  • σ (sigma): Standard deviation of a numerical variable.
  • π (pi) or θ (theta): Proportion of a categorical variable.
  • ρ (rho): Correlation between two variables.
  • β (beta): Gradient between two variables (used in regression).

Sample statistics are calculated from our sample data and are used to estimate parameters. They're represented by lowercase Roman letters:

  • x̄ (xbar): Sample mean.
  • s: Sample standard deviation.
  • p: Sample proportion.
  • r: Sample correlation.
  • b: Sample gradient.

Hypothesis Testing

Hypothesis testing assesses whether there's enough evidence to support a claim. Example: Is there enough evidence to suggest that Meyers Leonard is shooting above 50%? (His sample statistic is 0.6).

We start with a null hypothesis (H₀): Meyers Leonard's longterm threepoint percentage is less than or equal to 50% (θ ≤ 0.5). Statisticians are conservative; we assume the reverse is true unless there's strong evidence to reject it.

The alternate hypothesis (H₁): Meyers Leonard's longterm threepoint percentage is greater than 50% (θ > 0.5). This is what we're seeking evidence for.

Probability Distribution and Rejection Region

We create a probability distribution assuming the null hypothesis is true (θ = 0.5). This shows the likelihood of different outcomes (number of threepointers made out of 15).

A rejection region is a range of values that are considered too extreme to be consistent with the null hypothesis. It's often set at 5% of the distribution (the level of significance). If our sample statistic falls within the rejection region, we reject the null hypothesis.

In Meyers Leonard's case (9 out of 15), his result is not extreme enough to reject the null hypothesis. Even though his sample is above 50%, it's not significantly above.

Important Notes about Hypothesis Testing

  • We never *prove* anything. We can only *infer* based on the evidence.
  • We never *accept* the null hypothesis. We *fail to reject* it. This means we don't have enough evidence to reject it. It doesn't mean we believe it's true.

PValues

A pvalue measures how extreme the sample is. Hypothesis tests assess whether our sample is extreme, while the pvalue measures *how* extreme it is.

The pvalue calculates the probability of observing a sample statistic as extreme as, or more extreme than, the one we obtained, assuming the null hypothesis is true.

  • Small pvalue: The sample is very extreme, providing evidence to reject the null hypothesis.
  • Large pvalue: The sample is less extreme, and we're less likely to reject the null hypothesis.

If the pvalue is less than the level of significance (e.g., 0.05), we reject the null hypothesis.

PHacking: A Misuse of PValues

Phacking is a problematic practice where researchers manipulate data or analyses to obtain statistically significant results (p < 0.05) when no true effect exists.

Good Research vs. Bad Research

  • Good Research: Theorize an effect, collect data, and test *only* that effect.
  • Bad Research (PHacking): Collect a large dataset with many variables, test numerous relationships, and selectively report only the statistically significant results. This inflates the chance of finding spurious associations.

When you test many different things, it becomes more likely that one of them, by chance, will be statistically significant (p < 0.05). If you test 20 different things, you'd expect one of them to be significant by chance alone.

Phacking is a significant concern because it can lead to the publication of false or misleading findings, undermining the reliability of scientific research.

This article provides a basic introduction to statistical concepts, illustrated with NBA examples. For more indepth discussions, visit zstatistics.com.

Statistics Explained Simply: An NBAThemed Introduction | VidScribe AI