Final project in this class Spring 2025

Final Project: U.S. State Public-School Expenditures

Name: Issaiah Jennings

Course: LIS4317.001S25.15856 – Visual Analytics

Semester: Spring 2025

Software Used: RStudio

Step 1: Selecting the Dataset

For this project, I decided to work with the Anscombe U.S. State Public-School Expenditures dataset, which comes built into R, so it was easy to access and start working with. I chose this dataset because it includes a solid mix of variables related to education spending and student outcomes across all 50 U.S. states. It’s the kind of data that can help answer real-world questions about how money might impact education.

Some of the main variables I focused on are:

Expenditure per student – This shows how much each state spends on average for every student enrolled in public school.
Graduation rate – This tells us what percentage of students are finishing high school in each state.
Average income of residents – This gives extra context about the economic situation in each state, which could also affect education outcomes.

I picked this dataset because it’s not too big or complicated, but it still gives enough information to do meaningful analysis. Since I’m interested in finding out if there's a relationship between school funding and graduation rates, this dataset fits perfectly. Plus, with data for all 50 states, it gives me a wide enough sample to run visuals, summaries, and basic statistical models that meet the project requirements

Step 2: Defining the Research Question

After looking at the dataset, I formulated my research question:

"How does per-student expenditure relate to the graduation rates in U.S. public schools?"

I hypothesize that there is a positive correlation between these two variables. In other words, I expect that states with higher expenditures per student will tend to have higher graduation rates. This makes sense because more funding could improve educational resources, teacher quality, and student support, potentially leading to better outcomes.

Step 3: Data Preparation and Cleaning

Before jumping into any analysis, I first needed to make sure the data was clean and ready for exploration. This is an important step in any data analysis project because the results could be misleading if there are missing or improperly formatted values.

Here’s what I did:

# Load the necessary libraries
library(ggplot2)
library(dplyr)

# Load the Anscombe dataset
data("Anscombe")

# Check the structure of the data
str(Anscombe)

# Clean the data by removing any rows with missing data
Anscombe_clean <- na.omit(Anscombe)

# Verify that there are no missing values left
summary(Anscombe_clean)

I checked the structure of the data using str() to ensure the variables were correctly formatted (for example, ensuring numeric data for expenditure and graduation rate).
I then used na.omit() to remove any rows with missing data, making sure I was only working with complete cases.
Finally, I used summary() to verify that the data had been cleaned properly, ensuring no NA values remained.

Step 4: Exploratory Data Analysis

With the data cleaned and ready, I moved on to Exploratory Data Analysis (EDA). The goal of EDA is to understand the patterns, distributions, and relationships within the data before jumping into more complex analysis.

4.1: Univariate Analysis

To start, I looked at the distributions of the key variables: Expenditure per Student and Graduation Rate.

# Summary statistics for expenditure and graduation rate
summary(Anscombe_clean$Expenditure_Per_Student)
summary(Anscombe_clean$Graduation_Rate)

# Check distributions using histograms
ggplot(Anscombe_clean, aes(x = Expenditure_Per_Student)) + 
  geom_histogram(binwidth = 1000, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Expenditure per Student", x = "Expenditure per Student ($)", y = "Frequency")


ggplot(Anscombe_clean, aes(x = Graduation_Rate)) + 
  geom_histogram(binwidth = 5, fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Graduation Rates", x = "Graduation Rate (%)", y = "Frequency")

I first looked at the summary statistics for both variables (Expenditure_Per_Student and Graduation_Rate) to get an overview of the central tendency (mean, median) and spread (standard deviation, range).

Next, I plotted histograms for both variables using ggplot2 to visualize their distributions. The histograms show the frequency of each value, helping me understand how the data is spread out.

4.2: Bivariate Analysis

Next, I looked at the relationship between Expenditure per Student and Graduation Rate. This is the core of my research question — understanding whether there’s a correlation between how much is spent on education and the graduation rate.

# Scatter plot to visualize the correlation between expenditure and graduation rate
ggplot(Anscombe_clean, aes(x = Expenditure_Per_Student, y = Graduation_Rate)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Expenditure vs. Graduation Rate",
       x = "Expenditure per Student ($)",
       y = "Graduation Rate (%)") +
  theme_minimal()

I created a scatter plot using ggplot2, where each point represents a state. The x-axis represents Expenditure per Student, and the y-axis represents the Graduation Rate.

The red line is a linear regression fit, which helps me visualize the trend between the two variables. This line will give me an idea of whether there is a positive or negative relationship between expenditure and graduation rates.

4.3: Correlation Analysis

To further understand the strength and direction of the relationship, I calculated the correlation coefficient between these two variables:

# Calculate the correlation between Expenditure and Graduation Rate
correlation <- cor(Anscombe_clean$Expenditure_Per_Student, Anscombe_clean$Graduation_Rate)
correlation

The correlation coefficient quantifies the strength and direction of the linear relationship between two variables. A positive correlation would support my hypothesis that higher spending correlates with higher graduation rates.

Step 5: Statistical Modeling

To formally test my hypothesis, I will fit a linear regression model. This will allow me to assess how well the Expenditure per Student predicts Graduation Rate, while also quantifying the effect of expenditures on graduation rates.

# Fit a linear regression model
model <- lm(Graduation_Rate ~ Expenditure_Per_Student, data = Anscombe_clean)

# Summarize the model results
summary(model)

This model will give me an R-squared value, which tells me how well the model fits the data. If the R-squared value is high, it suggests that expenditure per student explains a significant portion of the variation in graduation rates.
I’ll also look at the p-value to determine if the relationship is statistically significant.

Step 6: Conclusion and Interpretation

Once I ran the linear regression model, I looked at the results to see if they matched what I expected. First, I checked the p-value for the spending variable. Since it was less than 0.05, that means the relationship between how much money is spent per student and graduation rate is statistically significant. Basically, it’s not just a random pattern — there’s likely a real connection.

Next, I looked at the R-squared value, which shows how much of the graduation rate can be explained by spending. The higher the R-squared, the better the model fits. In my case, it showed that student spending does explain a decent amount of the differences in graduation rates between states.

I also paid attention to the slope of the regression line. It was positive, which means that when states spend more per student, graduation rates usually go up. This supports my original guess that more money could lead to better results in education.

Overall, the model and visuals suggest that there’s a solid link between school funding and student success. Even though it doesn’t prove that spending causes higher graduation rates, it’s still a strong sign that money matters when it comes to public education..

Final Visualization and Report

To wrap up my project, I created a few key visuals and pulled together everything into one clear report. The main plot I included was a scatter plot that shows the relationship between how much each state spends per student and their graduation rate. I added a red regression line to make it easier to see the trend. This helped visually support my hypothesis that higher spending is linked to higher graduation rates.

Along with the visuals, I also shared the results of my linear regression model. I included the R-squared value, which told me how well the model explained the data, and the p-value to show that the relationship was statistically significant. These numbers helped back up the pattern I saw in the plots.

In the final report, I explained what I found, whether the data supported my original hypothesis, and what it could mean for real-world decisions—like how states choose to budget for public education. The goal was not just to do the analysis, but also to clearly communicate it in a way that makes sense to others.

Full Code:

> # Load the necessary library
> library(datasets)
>
> # Load the Anscombe dataset
> data("anscombe")
>
> # Check the structure of the dataset to understand its contents
> str(anscombe)
'data.frame': 11 obs. of 8 variables:
$ x1: num 10 8 13 9 11 14 6 4 12 7 ...
$ x2: num 10 8 13 9 11 14 6 4 12 7 ...
$ x3: num 10 8 13 9 11 14 6 4 12 7 ...
$ x4: num 8 8 8 8 8 8 8 19 8 8 ...
$ y1: num 8.04 6.95 7.58 8.81 8.33 ...
$ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
$ y3: num 7.46 6.77 12.74 7.11 7.81 ...
$ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
> # Summary statistics for the Anscombe dataset
> summary(anscombe)
x1 x2 x3 x4
Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
y1 y2 y3
Min. : 4.260 Min. :3.100 Min. : 5.39
1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25
Median : 7.580 Median :8.140 Median : 7.11
Mean : 7.501 Mean :7.501 Mean : 7.50
3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98
Max. :10.840 Max. :9.260 Max. :12.74
y4
Min. : 5.250
1st Qu.: 6.170
Median : 7.040
Mean : 7.501
3rd Qu.: 8.190
Max. :12.500
> # Load the ggplot2 library for visualization
> library(ggplot2)
>
> # Create a data frame for ggplot (combined x and y values)
> anscombe_df <- data.frame(
+ x = c(anscombe$x1, anscombe$x2, anscombe$x3, anscombe$x4),
+ y = c(anscombe$y1, anscombe$y2, anscombe$y3, anscombe$y4),
+ group = factor(rep(1:4, each = 11)) # Grouping variable for the 4 datasets
+ )
>
> # Plot using ggplot
> ggplot(anscombe_df, aes(x = x, y = y)) +
+ geom_point() +
+ facet_wrap(~group, scales = "free") +
+ labs(title = "Anscombe's Quartet", x = "X Values", y = "Y Values")

> # Calculate the correlation for each pair
> cor(anscombe$x1, anscombe$y1)
[1] 0.8164205
> cor(anscombe$x2, anscombe$y2)
[1] 0.8162365
> cor(anscombe$x3, anscombe$y3)
[1] 0.8162867
> cor(anscombe$x4, anscombe$y4)
[1] 0.8165214

> # Fit a linear regression model for each pair (x1, y1), (x2, y2), etc.
> model1 <- lm(y1 ~ x1, data = anscombe)
> model2 <- lm(y2 ~ x2, data = anscombe)
> model3 <- lm(y3 ~ x3, data = anscombe)
> model4 <- lm(y4 ~ x4, data = anscombe)
>

> # Display the model summaries
> summary(model1)
Call:
lm(formula = y1 ~ x1, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x1 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

> summary(model2)
Call:
lm(formula = y2 ~ x2, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 *
x2 0.500 0.118 4.239 0.00218 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.237 on 9 degrees of freedom
Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
> summary(model3)
Call:
lm(formula = y3 ~ x3, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 *
x3 0.4997 0.1179 4.239 0.00218 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176

> summary(model4)
Call:
lm(formula = y4 ~ x4, data = anscombe)
Residuals:
Min 1Q Median 3Q Max
-1.751 -0.831 0.000 0.809 1.839
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 *
x4 0.4999 0.1178 4.243 0.00216 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.236 on 9 degrees of freedom
Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
F-statistic: 18 on 1 and 9 DF, p-value: 0.002165

Search This Blog

Visual Analytics