Project By Max Motz and Alec Rovner
Website Link: https://arovn10.github.io/
This projects goal is to determine whether or not the large amounts of money people pay to send their children to private school over public school is "worth it". Overall the project will graze over many factors that people do not know private schools offer over public schools. Benefits to private schools observed will include the significant difference in student-to-teacher ratio, standardized test scores, and future income after graduating.
Pulling files from Github
#Collab code that clones github and extracts CSVs from it
%cd /
%cd content
%rm -r arovn10.github.io
!git clone https://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@github.com/arovn10/arovn10.github.io.git
#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
%cd arovn10.github.io/
!git pull
from google.colab import drive
import gzip
#drive.mount('/content/drive')
#%cd /content/drive/My Drive/cmps3160
# !git pull
#%cd _projects/FinalProjectRovnerMotz
#test
import numpy as np
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
We initially set out to use the IPUMS data set to compare and contrast public and private school students with a host of factors including post-grad salary, owned real-estate, veteran status, and other factors to find interesting correlations and trends to analyze the difference in attending individuals and long-term effects of the two schooling systems. While we are still using the IPUMS data set, we are pivoting toward a more comprehensive analysis of the similarities and differences between public and private school characteristics to determine the fundamental differences of both types of institutions. This will involve a comparative analysis between a public and private school dataset both from school year 2019-2020 and an analysis of IPUMS census data regarding the characteristics of students and graduees who attended each type of school.
The second data set that we are using comes from NCES and involves private schools, the day-to-day attributes of those private schools, and also attributes of the students within the schools. This dataset will allow us to view how successful private school students are overall and if they attend a 4-year university after.
The third dataset that we are using also comes from NCES and is information on public schools and their day-to-day information such as student-to-teacher ratio and how many students per grade there are. This will allow us to compare and contrast the different situations of students at these schools.
The fourth dataset we are using is the Nations Report Card datset. It contains standardized test scores for reading and writing for both public and private schools. This allows us to visualize the differences between the students testing abilities after being in their respective schooling system for some time as we chose grade 12 for the standardized testing scores.
Our collaboration plan is to utilize Google Drives Collab Feature to simultaneously work on a Python notebook. We will make regular commits to a github repository for version control and fork/work on independent copies of the file at times when we are both working on different aspects of the file. Our group has been meeting one to two times per week in order to recap what we have been working on and set goals for each of us to individually commit and push to the project by our next meeting. The majority of our meetings will take place in person on Wednesday night each week as we are both available then.
Through the four data sets, the question we hope to answer is "What are the fundamental difference in characteristics between public and private schools and what is the general difference in the outcomes of students attending either? Given the data, what differences in experience are likely to occur (Part 1) and what outcomes should we expect to see (Part 2)?"
A section for dataset links and extra resources is located at the end of the notebook file for those interested in reading more on the subject
Compared to public schools, private school will typically have smaller school sizes, higher student to teacher ratio, less demographic variation, and a similar age distribution to public school. We believe the data will show that private school kids do better academically and will be more likely to have success in their future academic and work pursuits.
#Student Data Set
fileLoc = "/content/arovn10.github.io/usa_00007.csv.gz"
a_file = gzip.open(fileLoc, "rb")
df = pd.read_csv(a_file, engine = 'python')
#df.head()
#Private School Dataset
fileLoc = "/content/arovn10.github.io/privateschool1920.csv"
df2 = pd.read_csv(fileLoc, engine = 'python')
#df2.head()
#Public School Dataset
fileLoc = "/content/arovn10.github.io/PublicSchool.csv.gz"
a_file = gzip.open(fileLoc, "rb")
df3 = pd.read_csv(a_file, engine = 'python')
#df3.head()
#Public and Private School Standardized Test Dataset
#Nations Report Card Dataset 1 - Math
fileLoc = "/content/arovn10.github.io/NDECoreExcel_Mathematics, Grade 12, Gender_20221207015704.csv"
df4 = pd.read_csv(fileLoc, engine = 'python', skiprows = 7)
new_headers = df4.iloc[0] #grab the first row for the header
df4 = df4[1:] #take the data less the header row
df4.columns = new_headers #set the header row as the df header
#Nations Report Card Dataset 2 - Reading
fileLoc = "/content/arovn10.github.io/NDECoreExcel_Reading, Grade 12, Gender_20221215035021.csv"
df5 = pd.read_csv(fileLoc, engine = 'python', skiprows = 7)
new_headers = df5.iloc[0] #grab the first row for the header
df5 = df5[1:] #take the data less the header row
df5.columns = new_headers #set the header row as the df header
Initial Issues Encountered with Datasets
A major issue we encountered while reading the datasets is that the initial dataset was crashing the ram during the reading process, as it was a huge 9GB files. We have since altered are query to the most recent year and removed unecessary variables from our query to have it working in Collab. After learning from this mistake, we have found datasets that are more exact with our needs and also eliminate unnecessary variables before downloading.
Dataset Short Description: The IPUMS dataset that we chose contains US census data in order to get the most random and generalized public and private school information along with life factors for each datapoint. This is the largest scope dataset that we use. italicized text
Variables Included
df.keys()
Variable Legend DF1:
The IPUMS dataset is a US census based data set that gives us plenty of demographic information on the population and their life factors after they attended public or private school.
Tidying the IPUMS data
df['SCHLTYPE'] = df['SCHLTYPE'].map({3: 3, 4: 3, 5: 3, 6: 3, 7: 3}).fillna(df['SCHLTYPE'])
df['SCHLTYPE'].value_counts()
tidydf1 = df.melt(value_vars=['GQ','SCHLTYPE','INCTOT', 'DEGFIELD2'])
tidydf1
Here we had to simplify DF1 SCHLTYPE values into two numbers as there any many different types of private schools recorded. Any SCHLTYPE form 3-7 is a private school so we mapped the values all to 3 and then tidied the data. Value of 0 for SCHLTYPE mean N/A and values of 1 mean Not enrolled.
tidydf1.loc[tidydf1['variable'] == 'SCHLTYPE'].loc[tidydf1['value'] >= 3]
Here we melt the three mostimportant variables into "tidydf1". This will allow us to hone in on the most important variables in our IPUMS dataset. The variables chosen are Group Quarter, School Type(public or private), and Total Income of each person.
Dataset Short Description: The Private School dataset that we chose contains nationwide data on private schools, their locations, total enrollment, graduation rates, and many other useful datapoints that give great insight into the key differences between public and private schools.
Variables Included
There are 156 variables in the Private School Dataset. Obviously it would be impractical to fit all of them in a singular datatable so we chose important ones to query and tidy up.
for key in df2.keys():
print(key)
Variable Legend DF2:
Tidying
df2.rename(columns = {'8D Percent to 4-Year College': 'Percent to 4-Year College' ,'5 Total Student Enrollment': "Total Student Enrollment",'11 Type of School': 'Type of School', '16 Hours in School Day for Students': 'Hours in School Day for Students'}, inplace = True)
tidydf2 = df2.melt(value_vars=['Percent to 4-Year College','Total Student Enrollment','Type of School', 'Hours in School Day for Students', 'Student Teacher Ratio'])
tidydf2
Similar to the private school dataset, there is an exceedingly large amount of variables in the public. Obviously it would be impractical to fit all of them in a singular datatable so we chose important ones to query and tidy up.
Dataset Short Description: The Public School dataset that we chose contains nationwide data on public schools, their locations, total enrollment, graduation rates, and many other useful datapoints that give great insight into the differences betwen public and private schools.
Variables Included
df3.keys()
Variable Legend DF3:
Dataset Short Description: The Nations Report Card dataset contains information on standardized test scores for reading and writing for both public and private schools. This allows us to visualize the differences between the students testing abilities after being in their respective schooling system for some time as we chose grade 12 for the standardized testing scores.
df4
#Public and Private School Standardized Test Dataset
#TIDY UP DF4 SOME MORE
df4 = df4.iloc[:, 0:4]
df4['Jurisdiction'] = df4['Jurisdiction'].map({'National public': 'Public', 'National private': 'Private'})
df4.Year = df4['Year'].map({'1990¹': '1990','1990': '1990', '1992¹': '1992', '1992': '1992', '1996¹': '1996', '1996': '1996', '2000¹': '2000', '2000': '2000'})
df4 = df4.iloc[:-8]
#take the data less the las 8 rows that are filler
df4.head(100)
#TIDY UP DF5 SOME MORE
df5 = df5.iloc[:, 0:4]
df5['Jurisdiction'] = df5['Jurisdiction'].map({'National public': 'Public', 'National private': 'Private'})
df5.Year = df5['Year'].map({'1992¹': '1992', '1992': '1992', '1994¹': '1994', '1996': '1996', '1998¹': '1998'})
df5 = df5.iloc[:-8]
#take the data less the las 8 rows that are filler
df5 = df5.dropna()
df5.head(100)
This Portion of the project analyzes some of the objective statistical differences between public and private school.
Number of Students Per a school
Student Ratios Public/Private
df2['Total Student Enrollment'].mean()
(np.log(df2['Total Student Enrollment'])).plot.hist(bins=50)
The average number of kids who attend a given private school is 182
APS = df3.loc[df3['TOTAL'] > 0]
APS['TOTAL'].mean()
(np.log(APS['TOTAL'])).plot.hist(bins=50)
The average number of kids who attend a given public school is 527
APS['TOTAL'].mean()/df2['Total Student Enrollment'].mean()
There are 2.89x as many students in each public school than private school on average
Number of Students Per a grade
kiddist2 = pd.DataFrame().assign(Kinder=df2['4C Kindergarten Enrollment'], First=df2['4F First Grade Enrollment'], Second=df2['4G Second Grade Enrollment'], Third=df2['4H Third Grade Enrollment'], Fourth=df2['4I Fourth Grade Enrollment'], Fith=df2['4J Fifth Grade Enrollment'], Sixth=df2['4K Sixth Grade Enrollment'], Seventh=df2['4L Seventh Grade Enrollment'], Eigth=df2['4M Eighth Grade Enrollment'], Ninth=df2['4N Ninth Grade Enrollment'], Tenth=df2['4O Tenth Grade Enrollment'], Eleventh=df2['4P Eleventh Grade Enrollment'], Twelth=df2['4Q Twelfth Grade Enrollment'])
kiddist2.mean()
(kiddist2.mean()).plot.bar()
There is a noticeable increase in private school kids in highschool than middle and lower school. This is most likely because 9-12 schools are common enough to increase high school numbers.
kiddist3 = pd.DataFrame().assign(Prekinder = df3['PK'], Kinder=df3['KG'], First=df3['G01'], Second=df3['G02'], Third=df3['G03'], Fourth=df3['G04'], Fith=df3['G05'], Sixth=df3['G06'], Seventh=df3['G07'], Eigth=df3['G08'], Ninth=df3['G09'], Tenth=df3['G10'], Eleventh=df3['G11'], Twelth=df3['G12'],Thirteen=df3['G13'])
kiddist3.mean()
(kiddist3.mean()).plot.bar()
There is a noticeable increase in public school kids in middle school and a bigger one in high school.
kiddist2.mean().corr(kiddist3.mean(), method='pearson', min_periods=None)
The correlation between the ratio of kids in public and private school is 0.84
Student Teacher Ratio
df2['Student Teacher Ratio'].describe()
df3['STUTERATIO'].describe()
print(df3['STUTERATIO'].mean()/df2['Student Teacher Ratio'].mean())
It is clear that the student-to-teacher ratio across tens of thousands of schools is lower for private schools than public schools. Public schools have a higher ratio by a factor of 1.46.
Location
Most Common Cities
df2["Location City"].value_counts()
The most common city for Private school is Millersburg, followed by Miami and Fredericksburg
df3["LCITY"].value_counts()
The most common city for public school is Houston, followed by Chicacog and Los Angeles
Longitude and Lattitude
Private School Distribution
privloc = pd.DataFrame().assign(Latitude = df2['Latitude'], Longitude = df2["Longitude"])
privloc
privloc.plot.scatter(x="Latitude", y="Longitude", alpha = 0.3);
display(df2['Latitude'].mean())
display(df2['Longitude'].mean())
Public School Distribution
publoc = pd.DataFrame().assign(Latitude = df3['LATCOD'], Longitude = df3["LONCOD"])
publoc
publoc.plot.scatter(x="Latitude", y="Longitude", alpha = 0.3);
display(df3['LATCOD'].mean())
display(df3['LONCOD'].mean())
Private School Compared to Public School coordinates
privloc["Type"] = "Private"
publoc["Type"] = "Public"
location = pd.concat([publoc,privloc], ignore_index=True)
colors = location["Type"].map({
"Public": "blue",
"Private": "red"
})
location.plot.scatter(x="Latitude", y="Longitude", c=colors, alpha = 0.3);
location["Type"].value_counts()
It seems that private school and public schools general appear in the same locations with public school being more localized to big cities and private schools being more common elsewhere.
This portion of our product analyzes public and private school student performance using The Nation's Report Card math and english scores and outcomes using the IPUMS U.S. census dataset. While the main focus of both of these datasets is not Private or Public school data and thus have a weaker link to the overall project, they both have public school and private school distinctions for students and therefore can be used to identify and analyze potential differences.
Disclaimer: While this portion of the project attempts to analyze the performance and outcome of Private school students, it is important to disclose that we do not have access to all factors that could have an affect on students. The most prevalent of these factors to keep in mind is the wealth of the average private school family vs. that of the average public school family. Almost every private school has a pricy tuition and families who have the resources to pay most likely also have the resources to help their children succeed in school and onwards. Therefore, while we cannot say for certain how much of an influence private school had on their outcomes, we can measure the correlation between going to one type of school and having a certain performance and outcome.
Report Card Score Math Analysis
import seaborn as sns
df4 = df4.astype({'Average scale score':'int'})
#df4 = df4.groupby(by = ['Jurisdiction']).head()
sns.catplot(data=df4, x='Year', y='Average scale score', hue='Jurisdiction', col = 'Gender', kind = 'bar')
Report Card Score English Analysis
import seaborn as sns
df5 = df5.astype({'Average scale score':'int'})
#df4 = df4.groupby(by = ['Jurisdiction']).head()
sns.catplot(data=df5, x='Year', y='Average scale score', hue='Jurisdiction', col = 'Gender', kind = 'bar')
It can be seen here that although the scores each year are relatively stable, private schools always have a slight edge on testing score.
Analysis of IPUMS Dataset
#The School Type variable shows the proportion of indiviudals in the dataset and what school they go to
#0 is N/A
#1 is Not enrolled
#2 is Public School
#3 is private school
df["SCHLTYPE"].value_counts().plot.pie(y='mass', figsize=(5, 5))
#Interesting Stat
PUBS = df["SCHLTYPE"].loc[df["SCHLTYPE"]== 2].sum()/2
PRIS = df["SCHLTYPE"].loc[df["SCHLTYPE"]== 3].sum()/3
PUBS / PRIS
#598521 / 137437
There are over 4x as many public school kids in the dataset than private school kids
#The Group Quarters variable idicates what living situation the individual resided in
#1 is Household under 1970 definition
#2 is Additional households under 1990 definition
#3 is Group quarters--Institutions
#4 is Other group quarters
#5 is Additional households under 2000 definition
df["GQ"].value_counts()
(df["GQ"].value_counts() ** 0.2).plot.bar()
Predictive Model
Using U.S. Census data, we created a model that sets out to predict an individual's income based on if they went to public or private school with the given context of how far in the academic process they are.
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
# Create a new variable
features = ["SCHLTYPE", "EDUCD"]
Y_train = df["INCTOT"]
X_train_dict = df[features].to_dict(orient="records")
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
x_new_dict = {
"SCHLTYPE":3, #Private School
"EDUCD":101 #Bachelor's Degree
}
X_new = vec.transform([x_new_dict])
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(x_new_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)
# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
As you can see, the prediction of the salary of a private school kid who graduated from college is anywhere from 41,047 to 68,280 dollars.
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
# Create a new variable
features = ["SCHLTYPE", "EDUCD"]
Y_train = df["INCTOT"]
X_train_dict = df[features].to_dict(orient="records")
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
x_new_dict = {
"SCHLTYPE":2, #Private School
"EDUCD":101 #Bachelor's Degree
}
X_new = vec.transform([x_new_dict])
# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(x_new_dict)
# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)
# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
model = KNeighborsRegressor(n_neighbors=50)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
And the prediction salary of public school kids who also graduated from college is anywhere from 12,100 to 33,708 dollars (smaller likely as a result of unemployed individuals being included)
While only considering a couple factors, the model does line up with the notion that private school kids generally end up with higher paying jobs - Whether that be a product of the school itself or other factors such as familial wealth
It is difficult to say whether private schools have a better overall effect on children than public schools, as this can vary depending on many factors, including the specific schools and the individual needs and abilities of the children involved. In general, private schools may have smaller class sizes and more individualized attention, which can be beneficial for some students. However, public schools are often more diverse and may provide a broader range of extracurricular activities and other opportunities for students. Ultimately, the best choice for a child's education will depend on their individual needs and the specific schools available in their area. Something that has been observed through our data analysis is that overall, the standrdized test scores of private schools are slightly higher than public schools for every year and subject that we measured. For higher test scores, private school may be the better option, but it is still hard to tell whether or not those who attended public vs. private school have a better degree of success in life.
Private school Data - https://nces.ed.gov/surveys/pss/pssdata.asp
Research Paper on predictors of academic success in higher education - https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-020-0177-7
%%shell
jupyter nbconvert --to html /content/arovn10.github.io/RovnerMotzFinalProject.ipynb