Project By Max Motz and Alec Rovner

Website Link: https://arovn10.github.io/

Introduction¶

This projects goal is to determine whether or not the large amounts of money people pay to send their children to private school over public school is "worth it". Overall the project will graze over many factors that people do not know private schools offer over public schools. Benefits to private schools observed will include the significant difference in student-to-teacher ratio, standardized test scores, and future income after graduating.

Pulling files from Github

#Collab code that clones github and extracts CSVs from it
%cd /
%cd content
%rm -r arovn10.github.io
!git clone https://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@github.com/arovn10/arovn10.github.io.git
#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
%cd arovn10.github.io/

/
/content
Cloning into 'arovn10.github.io'...
remote: Enumerating objects: 115, done.
remote: Counting objects: 100% (50/50), done.
remote: Compressing objects: 100% (50/50), done.
remote: Total 115 (delta 23), reused 0 (delta 0), pack-reused 65
Receiving objects: 100% (115/115), 68.18 MiB | 20.34 MiB/s, done.
Resolving deltas: 100% (42/42), done.
/content/arovn10.github.io

!git pull

Already up to date.

from google.colab import drive
import gzip
#drive.mount('/content/drive')
#%cd /content/drive/My Drive/cmps3160
# !git pull
#%cd _projects/FinalProjectRovnerMotz
#test

import numpy as np
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

Question and Preliminary Info¶

Final Version Data Science: Max Motz and Alec Rovner:¶

We initially set out to use the IPUMS data set to compare and contrast public and private school students with a host of factors including post-grad salary, owned real-estate, veteran status, and other factors to find interesting correlations and trends to analyze the difference in attending individuals and long-term effects of the two schooling systems. While we are still using the IPUMS data set, we are pivoting toward a more comprehensive analysis of the similarities and differences between public and private school characteristics to determine the fundamental differences of both types of institutions. This will involve a comparative analysis between a public and private school dataset both from school year 2019-2020 and an analysis of IPUMS census data regarding the characteristics of students and graduees who attended each type of school.

The second data set that we are using comes from NCES and involves private schools, the day-to-day attributes of those private schools, and also attributes of the students within the schools. This dataset will allow us to view how successful private school students are overall and if they attend a 4-year university after.

The third dataset that we are using also comes from NCES and is information on public schools and their day-to-day information such as student-to-teacher ratio and how many students per grade there are. This will allow us to compare and contrast the different situations of students at these schools.

The fourth dataset we are using is the Nations Report Card datset. It contains standardized test scores for reading and writing for both public and private schools. This allows us to visualize the differences between the students testing abilities after being in their respective schooling system for some time as we chose grade 12 for the standardized testing scores.

Our collaboration plan is to utilize Google Drives Collab Feature to simultaneously work on a Python notebook. We will make regular commits to a github repository for version control and fork/work on independent copies of the file at times when we are both working on different aspects of the file. Our group has been meeting one to two times per week in order to recap what we have been working on and set goals for each of us to individually commit and push to the project by our next meeting. The majority of our meetings will take place in person on Wednesday night each week as we are both available then.

Through the four data sets, the question we hope to answer is "What are the fundamental difference in characteristics between public and private schools and what is the general difference in the outcomes of students attending either? Given the data, what differences in experience are likely to occur (Part 1) and what outcomes should we expect to see (Part 2)?"

A section for dataset links and extra resources is located at the end of the notebook file for those interested in reading more on the subject

Hypothesis¶

Compared to public schools, private school will typically have smaller school sizes, higher student to teacher ratio, less demographic variation, and a similar age distribution to public school. We believe the data will show that private school kids do better academically and will be more likely to have success in their future academic and work pursuits.

Initializing and Tidying every Dataset¶

#Student Data Set
fileLoc = "/content/arovn10.github.io/usa_00007.csv.gz"
a_file = gzip.open(fileLoc, "rb")
df = pd.read_csv(a_file, engine = 'python')
#df.head()

#Private School Dataset
fileLoc = "/content/arovn10.github.io/privateschool1920.csv"
df2 = pd.read_csv(fileLoc, engine = 'python')
#df2.head()
  
#Public School Dataset
fileLoc = "/content/arovn10.github.io/PublicSchool.csv.gz"
a_file = gzip.open(fileLoc, "rb")
df3 = pd.read_csv(a_file, engine = 'python')
#df3.head()

#Public and Private School Standardized Test Dataset

#Nations Report Card Dataset 1 - Math
fileLoc = "/content/arovn10.github.io/NDECoreExcel_Mathematics, Grade 12, Gender_20221207015704.csv"

df4 = pd.read_csv(fileLoc, engine = 'python', skiprows = 7)
new_headers = df4.iloc[0] #grab the first row for the header
df4 = df4[1:] #take the data less the header row
df4.columns = new_headers #set the header row as the df header

#Nations Report Card Dataset 2 - Reading
fileLoc = "/content/arovn10.github.io/NDECoreExcel_Reading, Grade 12, Gender_20221215035021.csv"

df5 = pd.read_csv(fileLoc, engine = 'python', skiprows = 7)
new_headers = df5.iloc[0] #grab the first row for the header
df5 = df5[1:] #take the data less the header row
df5.columns = new_headers #set the header row as the df header

Initial Issues Encountered with Datasets

A major issue we encountered while reading the datasets is that the initial dataset was crashing the ram during the reading process, as it was a huge 9GB files. We have since altered are query to the most recent year and removed unecessary variables from our query to have it working in Collab. After learning from this mistake, we have found datasets that are more exact with our needs and also eliminate unnecessary variables before downloading.

Datasets¶

DataSet 1: IPUMS¶

Dataset Short Description: The IPUMS dataset that we chose contains US census data in order to get the most random and generalized public and private school information along with life factors for each datapoint. This is the largest scope dataset that we use. italicized text

Variables Included

df.keys()

Index(['YEAR', 'CBSERIAL', 'GQ', 'EDUC', 'EDUCD', 'SCHLTYPE', 'DEGFIELD2',
       'DEGFIELD2D', 'EMPSTAT', 'EMPSTATD', 'INCTOT'],
      dtype='object')

Variable Legend DF1:

YEAR (Census year)
GQ (Group Quarters Status)
EDUC (Education Attainment [general version])
EDUCD (Education Attainment [detailed version])
SCHLTYPE (Public or private school)
DEGFIELD2 (Field of degree (2) [general version])
DEGFIELD2D (Field of degree (2) [detailed version])
EMPSTAT (Employment status [general version])
EMPSTATD (Employment status [detailed version])
INCTOT (Total personal income)

The IPUMS dataset is a US census based data set that gives us plenty of demographic information on the population and their life factors after they attended public or private school.

Tidying the IPUMS data

df['SCHLTYPE'] = df['SCHLTYPE'].map({3: 3, 4: 3, 5: 3, 6: 3, 7: 3}).fillna(df['SCHLTYPE'])
df['SCHLTYPE'].value_counts()

tidydf1 = df.melt(value_vars=['GQ','SCHLTYPE','INCTOT', 'DEGFIELD2'])
tidydf1

Here we had to simplify DF1 SCHLTYPE values into two numbers as there any many different types of private schools recorded. Any SCHLTYPE form 3-7 is a private school so we mapped the values all to 3 and then tidied the data. Value of 0 for SCHLTYPE mean N/A and values of 1 mean Not enrolled.

tidydf1.loc[tidydf1['variable'] == 'SCHLTYPE'].loc[tidydf1['value'] >= 3]

Here we melt the three mostimportant variables into "tidydf1". This will allow us to hone in on the most important variables in our IPUMS dataset. The variables chosen are Group Quarter, School Type(public or private), and Total Income of each person.

DataSet 2: Private Schools¶

Dataset Short Description: The Private School dataset that we chose contains nationwide data on private schools, their locations, total enrollment, graduation rates, and many other useful datapoints that give great insight into the key differences between public and private schools.

Variables Included

There are 156 variables in the Private School Dataset. Obviously it would be impractical to fit all of them in a singular datatable so we chose important ones to query and tidy up.

for key in df2.keys():
  print(key)

Method of Interview
4A Ungraded Offered
4A Ungraded Enrollment
4B Nursery/Prekindergarten Offered
4B Nursery/Prekindergarten Enrollment
4C Kindergarten Offered
4C Kindergarten Enrollment
4D Transitional Kindergarten Offered
4D Transitional Kindergarten Enrollment
4E Transitional First Grade Offered
4E Transitional First Grade Enrollment
4F First Grade Offered
4F First Grade Enrollment
4G Second Grade Offered
4G Second Grade Enrollment
4H Third Grade Offered
4H Third Grade Enrollment
4I Fourth Grade Offered
4I Fourth Grade Enrollment
4J Fifth Grade Offered
4J Fifth Grade Enrollment
4K Sixth Grade Offered
4K Sixth Grade Enrollment
4L Seventh Grade Offered
4L Seventh Grade Enrollment
4M Eighth Grade Offered
4M Eighth Grade Enrollment
4N Ninth Grade Offered
4N Ninth Grade Enrollment
4O Tenth Grade Offered
4O Tenth Grade Enrollment
4P Eleventh Grade Offered
4P Eleventh Grade Enrollment
4Q Twelfth Grade Offered
4Q Twelfth Grade Enrollment
5 Total Student Enrollment
6A Hispanic or Latino Students
6B White Students
6C Black Students
6D Asian Students
6E Native Hawaiian/Pacific Islander Students
6F American Indian/Alaska Native Students
6G Students of Two or More Races
7A Is School Coeducational
7B Number of Male Students
8A Students in 12th Grade Last Year
8B Number of 12th Graders Last Year
8D Percent to 4-Year College
9A Length of School Day for Kindergarten
9B Days Per Week for Kindergarten
10 Full-Time Teachers
10 More Than 3/4 Time Teachers
10 More Than 1/2 Time Teachers
10 More Than 1/4 Time Teachers
10 Less Than 1/4 Time Teachers
10 Total Number of Teachers
11 Type of School
12A School Supports Home Schooling
12B School Located in Private Home
13A Does School Have Religious Orientation
13B School Affiliation
13C Religious Orientation
13C Type of Catholic School
14 No Associations
14 Accelerated Christian Education
14 American Association of Christian Schools
14 Association of Christian Schools International
14 Association of Christian Teachers and Schools
14 Association of Classical and Christian Schools
14 Christian Schools International
14 Council of Islamic Schools in North America
14 Evangelical Lutheran Education Association
14 Friends Council on Education
14 General Conference of SDA Church
14 Islamic School League of America
14 Jesuit Secondary Education Association
14 National Association of Episcopal Schools
14 National Catholic Educational Association
14 National Christian School Association
14 National Society for Hebrew Day Schools
14 Oral Roberts University Educational Fellowship
14 Prizmah: Center for Jewish Day Schools
14 Southern Baptist Association of Christian Schools
14 Other Religious School Associations
14 American Montessori Society
14 Association Montessori International
14 Other Montessori Associations
14 Association of Military Colleges and Schools
14 Association of Waldorf Schools of North America
14 National Association of Private Special Education Centers
14 Other Associations for Exceptional Children
14 European Council for International Schools
14 National Association for the Education of Young Children
14 National Association of Laboratory Schools
14 National Coalition of Girls' Schools
14 Other Special Emphasis Associations
14 Alternative School Network
14 National Association of Independent Schools
14 State/Regional Independent School Association
14 National Independent Private Schools Association
14 The Association of Boarding Schools
14 Other School Associations
15 Days in School Year
16 Hours in School Day for Students
16 Minutes in School Day for Students
17 Library or Library Media Center
Permanent Identification Number
School Name 
Mailing Address
Mailing City
Mailing State Postal Code
Mailing Zip Code
Mailing Zip4 Code
School phone number
ANSI County Code
ANSI County Code + State Code
County Name
Location Address
Location City
Location State Postal Code
Location Zip Code
Location Zip4 Code
Census Region
ANSI State Code (physical location of school)
Urban-centric Locale Code
Latitude
Longitude
State Legislative District - Lower
State Legislative District - Upper
114th Congressional District
2019-20 Lowest Grade Level Recode
2019-20 Highest Grade Level Recode
Universe frame to which  school belongs
Tabulation Flag
School Typology
Affiliation
School Orientation
Diocese or archdiocese for Catholic schools
Level of Instruction
Level of Instruction.1
Total Number of Students (K-12, UG)
Size of School (K-12, UG)
Number of K-12 Teachers (FTE)
Urban-centric Community Type
Length of School Day in Hours
Number of Male Students (K-12, UG)
Total Number of Kindergarten Students
Percentage of American Indian or Alaskan Native Students
Percentage of Asian Students
Percentage of Native Hawaiian or Pacific Islander students
Percentage of Hispanic Students
Percentage of White Students
Percentage of Black Students
Percentage of Students of Two or More Races
Student Teacher Ratio
Unnamed: 155

Variable Legend DF2:

Percent to 4 Year College - Percentage of Graduates that Went to College
Total Student Enrollment - Total Population at School
Type of School - School Type
Hours in School Day for Students - Length of School Day
Student Teacher Ratio

Tidying

df2.rename(columns = {'8D Percent to 4-Year College': 'Percent to 4-Year College' ,'5 Total Student Enrollment': "Total Student Enrollment",'11 Type of School': 'Type of School', '16 Hours in School Day for Students': 'Hours in School Day for Students'}, inplace = True)

tidydf2 = df2.melt(value_vars=['Percent to 4-Year College','Total Student Enrollment','Type of School', 'Hours in School Day for Students', 'Student Teacher Ratio'])
tidydf2

Similar to the private school dataset, there is an exceedingly large amount of variables in the public. Obviously it would be impractical to fit all of them in a singular datatable so we chose important ones to query and tidy up.

DataSet 3: Public Schools¶

Dataset Short Description: The Public School dataset that we chose contains nationwide data on public schools, their locations, total enrollment, graduation rates, and many other useful datapoints that give great insight into the differences betwen public and private schools.

Variables Included

df3.keys()

Index(['X', 'Y', 'NCESSCH', 'SURVYEAR', 'STABR', 'LEAID', 'ST_LEAID',
       'LEA_NAME', 'SCH_NAME', 'LSTREET1', 'LSTREET2', 'LCITY', 'LSTATE',
       'LZIP', 'LZIP4', 'PHONE', 'GSLO', 'GSHI', 'VIRTUAL', 'TOTFRL', 'FRELCH',
       'REDLCH', 'PK', 'KG', 'G01', 'G02', 'G03', 'G04', 'G05', 'G06', 'G07',
       'G08', 'G09', 'G10', 'G11', 'G12', 'G13', 'TOTAL', 'MEMBER', 'AM', 'HI',
       'BL', 'WH', 'HP', 'TR', 'FTE', 'LATCOD', 'LONCOD', 'ULOCALE', 'NMCNTY',
       'STUTERATIO', 'TITLEI', 'STITLEI', 'AMALM', 'AMALF', 'ASALM', 'ASALF',
       'HIALM', 'HIALF', 'BLALM', 'BLALF', 'WHALM', 'WHALF', 'HPALM', 'HPALF',
       'TRALM', 'TRALF', 'TOTMENROL', 'TOTFENROL', 'STATUS', 'UG', 'AE',
       'SCHOOL_TYPE_TEXT', 'SY_STATUS_TEXT', 'SCHOOL_LEVEL', 'AS',
       'CHARTER_TEXT', 'MAGNET_TEXT', 'OBJECTID'],
      dtype='object')

Variable Legend DF3:

SCHOOL_TYPE_TEXT - School Type
Total - Total students, all grades (includes AE)
STUTERATIO - Student to Teacher Ratio

DataSet 4: Report Card Dataset¶

Dataset Short Description: The Nations Report Card dataset contains information on standardized test scores for reading and writing for both public and private schools. This allows us to visualize the differences between the students testing abilities after being in their respective schooling system for some time as we chose grade 12 for the standardized testing scores.

df4

#Public and Private School Standardized Test Dataset

#TIDY UP DF4 SOME MORE
df4 = df4.iloc[:, 0:4]
df4['Jurisdiction'] = df4['Jurisdiction'].map({'National public': 'Public', 'National private': 'Private'})
df4.Year = df4['Year'].map({'1990¹': '1990','1990': '1990', '1992¹': '1992', '1992': '1992', '1996¹': '1996', '1996': '1996', '2000¹': '2000', '2000': '2000'})
df4 = df4.iloc[:-8]
  #take the data less the las 8 rows that are filler
df4.head(100)

#TIDY UP DF5 SOME MORE
df5 = df5.iloc[:, 0:4]
df5['Jurisdiction'] = df5['Jurisdiction'].map({'National public': 'Public', 'National private': 'Private'})
df5.Year = df5['Year'].map({'1992¹': '1992', '1992': '1992', '1994¹': '1994', '1996': '1996', '1998¹': '1998'})
df5 = df5.iloc[:-8]
  #take the data less the las 8 rows that are filler
df5 = df5.dropna()
df5.head(100)

EDA Part 1: Instituonal Analysis¶

This Portion of the project analyzes some of the objective statistical differences between public and private school.

Number of Students Per a school

Student Ratios Public/Private

df2['Total Student Enrollment'].mean()

182.11417578342295

(np.log(df2['Total Student Enrollment'])).plot.hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe426fa3b20>

The average number of kids who attend a given private school is 182

APS = df3.loc[df3['TOTAL'] > 0]
APS['TOTAL'].mean()

526.7716811224755

(np.log(APS['TOTAL'])).plot.hist(bins=50)

<matplotlib.axes._subplots.AxesSubplot at 0x7fe426dee340>

The average number of kids who attend a given public school is 527

APS['TOTAL'].mean()/df2['Total Student Enrollment'].mean()

2.8925352947204517

There are 2.89x as many students in each public school than private school on average

Number of Students Per a grade

kiddist2 = pd.DataFrame().assign(Kinder=df2['4C Kindergarten Enrollment'], First=df2['4F First Grade Enrollment'], Second=df2['4G Second Grade Enrollment'], Third=df2['4H Third Grade Enrollment'], Fourth=df2['4I Fourth Grade Enrollment'], Fith=df2['4J Fifth Grade Enrollment'], Sixth=df2['4K Sixth Grade Enrollment'], Seventh=df2['4L Seventh Grade Enrollment'], Eigth=df2['4M Eighth Grade Enrollment'], Ninth=df2['4N Ninth Grade Enrollment'], Tenth=df2['4O Tenth Grade Enrollment'], Eleventh=df2['4P Eleventh Grade Enrollment'], Twelth=df2['4Q Twelfth Grade Enrollment'])
kiddist2.mean()

Kinder      18.237556
First       16.609326
Second      16.308787
Third       16.446030
Fourth      16.440389
Fith        16.813804
Sixth       18.011428
Seventh     18.836964
Eigth       19.000951
Ninth       33.864311
Tenth       34.956910
Eleventh    35.968309
Twelth      36.138910
dtype: float64

(kiddist2.mean()).plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe41f65bf10>

There is a noticeable increase in private school kids in highschool than middle and lower school. This is most likely because 9-12 schools are common enough to increase high school numbers.

kiddist3 = pd.DataFrame().assign(Prekinder = df3['PK'], Kinder=df3['KG'], First=df3['G01'], Second=df3['G02'], Third=df3['G03'], Fourth=df3['G04'], Fith=df3['G05'], Sixth=df3['G06'], Seventh=df3['G07'], Eigth=df3['G08'], Ninth=df3['G09'], Tenth=df3['G10'], Eleventh=df3['G11'], Twelth=df3['G12'],Thirteen=df3['G13'])
kiddist3.mean()

Prekinder     37.321986
Kinder        66.777246
First         65.120011
Second        64.911839
Third         65.732845
Fourth        66.282372
Fith          69.436625
Sixth         95.914576
Seventh      109.734960
Eigth        107.402135
Ninth        130.727898
Tenth        125.723943
Eleventh     119.206975
Twelth       117.606496
Thirteen      15.142857
dtype: float64

(kiddist3.mean()).plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe4281eb040>

There is a noticeable increase in public school kids in middle school and a bigger one in high school.

kiddist2.mean().corr(kiddist3.mean(), method='pearson', min_periods=None)

0.8374020928730136

The correlation between the ratio of kids in public and private school is 0.84

Student Teacher Ratio

df2['Student Teacher Ratio'].describe()

count    21572.000000
mean        10.564167
std         37.902679
min          0.068966
25%          6.071429
50%          9.149468
75%         12.689884
max       5230.000000
Name: Student Teacher Ratio, dtype: float64

df3['STUTERATIO'].describe()

count    86194.000000
mean        15.443857
std         39.390543
min       -100.000000
25%         12.244898
50%         14.745189
75%         17.118270
max       7100.000000
Name: STUTERATIO, dtype: float64

print(df3['STUTERATIO'].mean()/df2['Student Teacher Ratio'].mean())

1.4619095686235621

It is clear that the student-to-teacher ratio across tens of thousands of schools is lower for private schools than public schools. Public schools have a higher ratio by a factor of 1.46.

Location

Most Common Cities

df2["Location City"].value_counts()

MILLERSBURG       38
MIAMI             27
FREDERICKSBURG    24
APPLE CREEK       24
LAKEWOOD          22
                  ..
NEOSHO             1
FLORISSANT         1
BOURBON            1
TUNAS              1
CODY               1
Name: Location City, Length: 2042, dtype: int64

The most common city for Private school is Millersburg, followed by Miami and Fredericksburg

df3["LCITY"].value_counts()

HOUSTON              796
Chicago              664
Los Angeles          583
BROOKLYN             576
SAN ANTONIO          544
                    ... 
RICHBURG               1
Los Nietos             1
NORTH TARRYTOWN        1
South San Gabriel      1
MENANDS                1
Name: LCITY, Length: 14603, dtype: int64

The most common city for public school is Houston, followed by Chicacog and Los Angeles

Longitude and Lattitude

Private School Distribution

privloc = pd.DataFrame().assign(Latitude = df2['Latitude'], Longitude = df2["Longitude"])
privloc

privloc.plot.scatter(x="Latitude", y="Longitude", alpha = 0.3);

display(df2['Latitude'].mean())
display(df2['Longitude'].mean())

37.912500703040976

-90.29729882199148

Public School Distribution

publoc = pd.DataFrame().assign(Latitude = df3['LATCOD'], Longitude = df3["LONCOD"])
publoc

publoc.plot.scatter(x="Latitude", y="Longitude", alpha = 0.3);

display(df3['LATCOD'].mean())
display(df3['LONCOD'].mean())

37.83579752299039

-93.06111102597183

Private School Compared to Public School coordinates

privloc["Type"] = "Private"
publoc["Type"] = "Public"
location = pd.concat([publoc,privloc], ignore_index=True)
colors = location["Type"].map({
    "Public": "blue",
    "Private": "red"
})

location.plot.scatter(x="Latitude", y="Longitude", c=colors, alpha = 0.3);
location["Type"].value_counts()

Public     100455
Private     21572
Name: Type, dtype: int64

It seems that private school and public schools general appear in the same locations with public school being more localized to big cities and private schools being more common elsewhere.

EDA Part 2: Idividual Analysis¶

This portion of our product analyzes public and private school student performance using The Nation's Report Card math and english scores and outcomes using the IPUMS U.S. census dataset. While the main focus of both of these datasets is not Private or Public school data and thus have a weaker link to the overall project, they both have public school and private school distinctions for students and therefore can be used to identify and analyze potential differences.

Disclaimer: While this portion of the project attempts to analyze the performance and outcome of Private school students, it is important to disclose that we do not have access to all factors that could have an affect on students. The most prevalent of these factors to keep in mind is the wealth of the average private school family vs. that of the average public school family. Almost every private school has a pricy tuition and families who have the resources to pay most likely also have the resources to help their children succeed in school and onwards. Therefore, while we cannot say for certain how much of an influence private school had on their outcomes, we can measure the correlation between going to one type of school and having a certain performance and outcome.

Report Card Score Math Analysis

import seaborn as sns


df4 = df4.astype({'Average scale score':'int'})
#df4 = df4.groupby(by = ['Jurisdiction']).head()
sns.catplot(data=df4, x='Year', y='Average scale score', hue='Jurisdiction', col = 'Gender', kind = 'bar')

<seaborn.axisgrid.FacetGrid at 0x7fe4157d1c70>

Report Card Score English Analysis

import seaborn as sns


df5 = df5.astype({'Average scale score':'int'})
#df4 = df4.groupby(by = ['Jurisdiction']).head()
sns.catplot(data=df5, x='Year', y='Average scale score', hue='Jurisdiction', col = 'Gender', kind = 'bar')

<seaborn.axisgrid.FacetGrid at 0x7fe4156ce4f0>

It can be seen here that although the scores each year are relatively stable, private schools always have a slight edge on testing score.

Analysis of IPUMS Dataset

#The School Type variable shows the proportion of indiviudals in the dataset and what school they go to
#0 is N/A
#1 is Not enrolled
#2 is Public School
#3 is private school
df["SCHLTYPE"].value_counts().plot.pie(y='mass', figsize=(5, 5))

<matplotlib.axes._subplots.AxesSubplot at 0x7fe412e14c70>

#Interesting Stat
PUBS = df["SCHLTYPE"].loc[df["SCHLTYPE"]== 2].sum()/2
PRIS = df["SCHLTYPE"].loc[df["SCHLTYPE"]== 3].sum()/3
PUBS / PRIS
#598521 / 137437

4.354875324694223

There are over 4x as many public school kids in the dataset than private school kids

#The Group Quarters variable idicates what living situation the individual resided in
#1 is Household under 1970 definition
#2 is Additional households under 1990 definition
#3 is Group quarters--Institutions
#4 is Other group quarters
#5 is Additional households under 2000 definition
df["GQ"].value_counts()

1    3081285
4      78375
3      72946
2       6006
5        941
Name: GQ, dtype: int64

(df["GQ"].value_counts() ** 0.2).plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe412dca730>

Predictive Model

Using U.S. Census data, we created a model that sets out to predict an individual's income based on if they went to public or private school with the given context of how far in the academic process they are.

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# Create a new variable
features = ["SCHLTYPE", "EDUCD"]
Y_train = df["INCTOT"]
X_train_dict = df[features].to_dict(orient="records")

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)

X_train = vec.transform(X_train_dict)
x_new_dict = {
    "SCHLTYPE":3, #Private School
    "EDUCD":101 #Bachelor's Degree
    
}

X_new = vec.transform([x_new_dict])

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(x_new_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([68280.])

model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([52670.])

model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([41047.])

As you can see, the prediction of the salary of a private school kid who graduated from college is anywhere from 41,047 to 68,280 dollars.

from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# Create a new variable
features = ["SCHLTYPE", "EDUCD"]
Y_train = df["INCTOT"]
X_train_dict = df[features].to_dict(orient="records")

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)

X_train = vec.transform(X_train_dict)
x_new_dict = {
    "SCHLTYPE":2, #Private School
    "EDUCD":101 #Bachelor's Degree
    
}

X_new = vec.transform([x_new_dict])

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(x_new_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([12100.])

model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([14330.])

model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([30996.66666667])

model = KNeighborsRegressor(n_neighbors=50)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)

array([33708.])

And the prediction salary of public school kids who also graduated from college is anywhere from 12,100 to 33,708 dollars (smaller likely as a result of unemployed individuals being included)

While only considering a couple factors, the model does line up with the notion that private school kids generally end up with higher paying jobs - Whether that be a product of the school itself or other factors such as familial wealth

Conclusion¶

It is difficult to say whether private schools have a better overall effect on children than public schools, as this can vary depending on many factors, including the specific schools and the individual needs and abilities of the children involved. In general, private schools may have smaller class sizes and more individualized attention, which can be beneficial for some students. However, public schools are often more diverse and may provide a broader range of extracurricular activities and other opportunities for students. Ultimately, the best choice for a child's education will depend on their individual needs and the specific schools available in their area. Something that has been observed through our data analysis is that overall, the standrdized test scores of private schools are slightly higher than public schools for every year and subject that we measured. For higher test scores, private school may be the better option, but it is still hard to tell whether or not those who attended public vs. private school have a better degree of success in life.

Further Resources¶

Nations Report Card - https://www.nationsreportcard.gov/ndecore/xplore/NDE
IPUMS USA Census Data - https://usa.ipums.org/usa/
Public School Characteristics Data - https://catalog.data.gov/dataset/public-school-characteristics-2019-20-3dfd2
Private school Data - https://nces.ed.gov/surveys/pss/pssdata.asp
Research Paper on predictors of academic success in higher education - https://educationaltechnologyjournal.springeropen.com/articles/10.1186/s41239-020-0177-7
Article comparing public schools and private schools with study findings: https://www.publicschoolreview.com/blog/new-study-confirms-that-private-schools-are-no-better-than-public-schools

%%shell
jupyter nbconvert --to html /content/arovn10.github.io/RovnerMotzFinalProject.ipynb

[NbConvertApp] Converting notebook /content/arovn10.github.io/RovnerMotzFinalProject.ipynb to html
[NbConvertApp] Writing 388297 bytes to /content/arovn10.github.io/RovnerMotzFinalProject.html

	Year	Jurisdiction	Gender	Average scale score	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1990¹	National public	Male	297	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1990¹	National public	Female	291	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1990¹	National private	Male	300	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1990¹	National private	Female	298	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	1992¹	National public	Male	299	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	1992¹	National public	Female	296	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	1992¹	National private	Male	317	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	1992¹	National private	Female	312	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	1996¹	National public	Male	304	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	1996¹	National public	Female	302	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	1996¹	National private	Male	316	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	1996¹	National private	Female	313	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	1996	National public	Male	302	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	1996	National public	Female	299	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	1996	National private	Male	311	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	1996	National private	Female	309	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17	2000¹	National public	Male	302	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
18	2000¹	National public	Female	298	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
19	2000¹	National private	Male	315	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20	2000¹	National private	Female	314	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
21	2000	National public	Male	301	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	2000	National public	Female	297	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
23	2000	National private	Male	316	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	2000	National private	Female	314	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
26	¹ Accommodations were not permitted for this a...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	NOTE: Some apparent differences between estima...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
28	SOURCE: U.S. Department of Education, Institut...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
29	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
30	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
32	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	variable	value
0	GQ	4.0
1	GQ	3.0
2	GQ	4.0
3	GQ	3.0
4	GQ	3.0
...	...	...
12958207	DEGFIELD2	0.0
12958208	DEGFIELD2	0.0
12958209	DEGFIELD2	0.0
12958210	DEGFIELD2	0.0
12958211	DEGFIELD2	0.0

	variable	value
3239567	SCHLTYPE	3.0
3239571	SCHLTYPE	3.0
3239575	SCHLTYPE	3.0
3239590	SCHLTYPE	3.0
3239605	SCHLTYPE	3.0
...	...	...
6478951	SCHLTYPE	3.0
6478952	SCHLTYPE	3.0
6478991	SCHLTYPE	3.0
6479055	SCHLTYPE	3.0
6479056	SCHLTYPE	3.0

	variable	value
0	Percent to 4-Year College	NaN
1	Percent to 4-Year College	NaN
2	Percent to 4-Year College	0.000000
3	Percent to 4-Year College	65.000000
4	Percent to 4-Year College	50.000000
...	...	...
107855	Student Teacher Ratio	10.740741
107856	Student Teacher Ratio	5.517241
107857	Student Teacher Ratio	8.000000
107858	Student Teacher Ratio	6.428571
107859	Student Teacher Ratio	5.483871

	Year	Jurisdiction	Gender	Average scale score
1	1990	Public	Male	297
2	1990	Public	Female	291
3	1990	Private	Male	300
4	1990	Private	Female	298
5	1992	Public	Male	299
6	1992	Public	Female	296
7	1992	Private	Male	317
8	1992	Private	Female	312
9	1996	Public	Male	304
10	1996	Public	Female	302
11	1996	Private	Male	316
12	1996	Private	Female	313
13	1996	Public	Male	302
14	1996	Public	Female	299
15	1996	Private	Male	311
16	1996	Private	Female	309
17	2000	Public	Male	302
18	2000	Public	Female	298
19	2000	Private	Male	315
20	2000	Private	Female	314
21	2000	Public	Male	301
22	2000	Public	Female	297
23	2000	Private	Male	316
24	2000	Private	Female	314

	Year	Jurisdiction	Gender	Average scale score
29	1998	Public	Male	281
30	1998	Public	Female	297
31	1998	Private	Male	295
32	1998	Private	Female	310
33	1994	Public	Male	279
34	1994	Public	Female	293
35	1994	Private	Male	294
36	1994	Private	Female	310
37	1992	Public	Male	285
38	1992	Public	Female	295
39	1992	Private	Male	302
40	1992	Private	Female	312