Project By Max Motz and Alec Rovner

Website Link: https://arovn10.github.io/

Introduction

This projects goal is to determine whether or not the large amounts of money people pay to send their children to private school over public school is "worth it". Overall the project will graze over many factors that people do not know private schools offer over public schools. Benefits to private schools observed will include the significant difference in student-to-teacher ratio, standardized test scores, and future income after graduating.

Pulling files from Github

In [ ]:
#Collab code that clones github and extracts CSVs from it
%cd /
%cd content
%rm -r arovn10.github.io
!git clone https://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@github.com/arovn10/arovn10.github.io.git
#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
%cd arovn10.github.io/
/
/content
Cloning into 'arovn10.github.io'...
remote: Enumerating objects: 115, done.
remote: Counting objects: 100% (50/50), done.
remote: Compressing objects: 100% (50/50), done.
remote: Total 115 (delta 23), reused 0 (delta 0), pack-reused 65
Receiving objects: 100% (115/115), 68.18 MiB | 20.34 MiB/s, done.
Resolving deltas: 100% (42/42), done.
/content/arovn10.github.io
In [ ]:
!git pull
Already up to date.
In [ ]:
from google.colab import drive
import gzip
#drive.mount('/content/drive')
#%cd /content/drive/My Drive/cmps3160
# !git pull
#%cd _projects/FinalProjectRovnerMotz
#test

import numpy as np
import pandas as pd
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen

Question and Preliminary Info

Final Version Data Science: Max Motz and Alec Rovner:

We initially set out to use the IPUMS data set to compare and contrast public and private school students with a host of factors including post-grad salary, owned real-estate, veteran status, and other factors to find interesting correlations and trends to analyze the difference in attending individuals and long-term effects of the two schooling systems. While we are still using the IPUMS data set, we are pivoting toward a more comprehensive analysis of the similarities and differences between public and private school characteristics to determine the fundamental differences of both types of institutions. This will involve a comparative analysis between a public and private school dataset both from school year 2019-2020 and an analysis of IPUMS census data regarding the characteristics of students and graduees who attended each type of school.

The second data set that we are using comes from NCES and involves private schools, the day-to-day attributes of those private schools, and also attributes of the students within the schools. This dataset will allow us to view how successful private school students are overall and if they attend a 4-year university after.

The third dataset that we are using also comes from NCES and is information on public schools and their day-to-day information such as student-to-teacher ratio and how many students per grade there are. This will allow us to compare and contrast the different situations of students at these schools.

The fourth dataset we are using is the Nations Report Card datset. It contains standardized test scores for reading and writing for both public and private schools. This allows us to visualize the differences between the students testing abilities after being in their respective schooling system for some time as we chose grade 12 for the standardized testing scores.

Our collaboration plan is to utilize Google Drives Collab Feature to simultaneously work on a Python notebook. We will make regular commits to a github repository for version control and fork/work on independent copies of the file at times when we are both working on different aspects of the file. Our group has been meeting one to two times per week in order to recap what we have been working on and set goals for each of us to individually commit and push to the project by our next meeting. The majority of our meetings will take place in person on Wednesday night each week as we are both available then.

Through the four data sets, the question we hope to answer is "What are the fundamental difference in characteristics between public and private schools and what is the general difference in the outcomes of students attending either? Given the data, what differences in experience are likely to occur (Part 1) and what outcomes should we expect to see (Part 2)?"

A section for dataset links and extra resources is located at the end of the notebook file for those interested in reading more on the subject

Hypothesis

Compared to public schools, private school will typically have smaller school sizes, higher student to teacher ratio, less demographic variation, and a similar age distribution to public school. We believe the data will show that private school kids do better academically and will be more likely to have success in their future academic and work pursuits.

Initializing and Tidying every Dataset

In [ ]:
#Student Data Set
fileLoc = "/content/arovn10.github.io/usa_00007.csv.gz"
a_file = gzip.open(fileLoc, "rb")
df = pd.read_csv(a_file, engine = 'python')
#df.head()

#Private School Dataset
fileLoc = "/content/arovn10.github.io/privateschool1920.csv"
df2 = pd.read_csv(fileLoc, engine = 'python')
#df2.head()
  
#Public School Dataset
fileLoc = "/content/arovn10.github.io/PublicSchool.csv.gz"
a_file = gzip.open(fileLoc, "rb")
df3 = pd.read_csv(a_file, engine = 'python')
#df3.head()

#Public and Private School Standardized Test Dataset

#Nations Report Card Dataset 1 - Math
fileLoc = "/content/arovn10.github.io/NDECoreExcel_Mathematics, Grade 12, Gender_20221207015704.csv"

df4 = pd.read_csv(fileLoc, engine = 'python', skiprows = 7)
new_headers = df4.iloc[0] #grab the first row for the header
df4 = df4[1:] #take the data less the header row
df4.columns = new_headers #set the header row as the df header

#Nations Report Card Dataset 2 - Reading
fileLoc = "/content/arovn10.github.io/NDECoreExcel_Reading, Grade 12, Gender_20221215035021.csv"

df5 = pd.read_csv(fileLoc, engine = 'python', skiprows = 7)
new_headers = df5.iloc[0] #grab the first row for the header
df5 = df5[1:] #take the data less the header row
df5.columns = new_headers #set the header row as the df header

Initial Issues Encountered with Datasets

A major issue we encountered while reading the datasets is that the initial dataset was crashing the ram during the reading process, as it was a huge 9GB files. We have since altered are query to the most recent year and removed unecessary variables from our query to have it working in Collab. After learning from this mistake, we have found datasets that are more exact with our needs and also eliminate unnecessary variables before downloading.

Datasets

DataSet 1: IPUMS

Dataset Short Description: The IPUMS dataset that we chose contains US census data in order to get the most random and generalized public and private school information along with life factors for each datapoint. This is the largest scope dataset that we use. italicized text

Variables Included

In [ ]:
df.keys()
Out[ ]:
Index(['YEAR', 'CBSERIAL', 'GQ', 'EDUC', 'EDUCD', 'SCHLTYPE', 'DEGFIELD2',
       'DEGFIELD2D', 'EMPSTAT', 'EMPSTATD', 'INCTOT'],
      dtype='object')

Variable Legend DF1:

  • YEAR (Census year)
  • GQ (Group Quarters Status)
  • EDUC (Education Attainment [general version])
  • EDUCD (Education Attainment [detailed version])
  • SCHLTYPE (Public or private school)
  • DEGFIELD2 (Field of degree (2) [general version])
  • DEGFIELD2D (Field of degree (2) [detailed version])
  • EMPSTAT (Employment status [general version])
  • EMPSTATD (Employment status [detailed version])
  • INCTOT (Total personal income)

The IPUMS dataset is a US census based data set that gives us plenty of demographic information on the population and their life factors after they attended public or private school.

Tidying the IPUMS data

In [ ]:
df['SCHLTYPE'] = df['SCHLTYPE'].map({3: 3, 4: 3, 5: 3, 6: 3, 7: 3}).fillna(df['SCHLTYPE'])
df['SCHLTYPE'].value_counts()

tidydf1 = df.melt(value_vars=['GQ','SCHLTYPE','INCTOT', 'DEGFIELD2'])
tidydf1
Out[ ]:
variable value
0 GQ 4.0
1 GQ 3.0
2 GQ 4.0
3 GQ 3.0
4 GQ 3.0
... ... ...
12958207 DEGFIELD2 0.0
12958208 DEGFIELD2 0.0
12958209 DEGFIELD2 0.0
12958210 DEGFIELD2 0.0
12958211 DEGFIELD2 0.0

12958212 rows × 2 columns

Here we had to simplify DF1 SCHLTYPE values into two numbers as there any many different types of private schools recorded. Any SCHLTYPE form 3-7 is a private school so we mapped the values all to 3 and then tidied the data. Value of 0 for SCHLTYPE mean N/A and values of 1 mean Not enrolled.

In [ ]:
tidydf1.loc[tidydf1['variable'] == 'SCHLTYPE'].loc[tidydf1['value'] >= 3]
Out[ ]:
variable value
3239567 SCHLTYPE 3.0
3239571 SCHLTYPE 3.0
3239575 SCHLTYPE 3.0
3239590 SCHLTYPE 3.0
3239605 SCHLTYPE 3.0
... ... ...
6478951 SCHLTYPE 3.0
6478952 SCHLTYPE 3.0
6478991 SCHLTYPE 3.0
6479055 SCHLTYPE 3.0
6479056 SCHLTYPE 3.0

137437 rows × 2 columns

Here we melt the three mostimportant variables into "tidydf1". This will allow us to hone in on the most important variables in our IPUMS dataset. The variables chosen are Group Quarter, School Type(public or private), and Total Income of each person.

DataSet 2: Private Schools

Dataset Short Description: The Private School dataset that we chose contains nationwide data on private schools, their locations, total enrollment, graduation rates, and many other useful datapoints that give great insight into the key differences between public and private schools.

Variables Included

There are 156 variables in the Private School Dataset. Obviously it would be impractical to fit all of them in a singular datatable so we chose important ones to query and tidy up.

In [ ]:
for key in df2.keys():
  print(key)
Method of Interview
4A Ungraded Offered
4A Ungraded Enrollment
4B Nursery/Prekindergarten Offered
4B Nursery/Prekindergarten Enrollment
4C Kindergarten Offered
4C Kindergarten Enrollment
4D Transitional Kindergarten Offered
4D Transitional Kindergarten Enrollment
4E Transitional First Grade Offered
4E Transitional First Grade Enrollment
4F First Grade Offered
4F First Grade Enrollment
4G Second Grade Offered
4G Second Grade Enrollment
4H Third Grade Offered
4H Third Grade Enrollment
4I Fourth Grade Offered
4I Fourth Grade Enrollment
4J Fifth Grade Offered
4J Fifth Grade Enrollment
4K Sixth Grade Offered
4K Sixth Grade Enrollment
4L Seventh Grade Offered
4L Seventh Grade Enrollment
4M Eighth Grade Offered
4M Eighth Grade Enrollment
4N Ninth Grade Offered
4N Ninth Grade Enrollment
4O Tenth Grade Offered
4O Tenth Grade Enrollment
4P Eleventh Grade Offered
4P Eleventh Grade Enrollment
4Q Twelfth Grade Offered
4Q Twelfth Grade Enrollment
5 Total Student Enrollment
6A Hispanic or Latino Students
6B White Students
6C Black Students
6D Asian Students
6E Native Hawaiian/Pacific Islander Students
6F American Indian/Alaska Native Students
6G Students of Two or More Races
7A Is School Coeducational
7B Number of Male Students
8A Students in 12th Grade Last Year
8B Number of 12th Graders Last Year
8D Percent to 4-Year College
9A Length of School Day for Kindergarten
9B Days Per Week for Kindergarten
10 Full-Time Teachers
10 More Than 3/4 Time Teachers
10 More Than 1/2 Time Teachers
10 More Than 1/4 Time Teachers
10 Less Than 1/4 Time Teachers
10 Total Number of Teachers
11 Type of School
12A School Supports Home Schooling
12B School Located in Private Home
13A Does School Have Religious Orientation
13B School Affiliation
13C Religious Orientation
13C Type of Catholic School
14 No Associations
14 Accelerated Christian Education
14 American Association of Christian Schools
14 Association of Christian Schools International
14 Association of Christian Teachers and Schools
14 Association of Classical and Christian Schools
14 Christian Schools International
14 Council of Islamic Schools in North America
14 Evangelical Lutheran Education Association
14 Friends Council on Education
14 General Conference of SDA Church
14 Islamic School League of America
14 Jesuit Secondary Education Association
14 National Association of Episcopal Schools
14 National Catholic Educational Association
14 National Christian School Association
14 National Society for Hebrew Day Schools
14 Oral Roberts University Educational Fellowship
14 Prizmah: Center for Jewish Day Schools
14 Southern Baptist Association of Christian Schools
14 Other Religious School Associations
14 American Montessori Society
14 Association Montessori International
14 Other Montessori Associations
14 Association of Military Colleges and Schools
14 Association of Waldorf Schools of North America
14 National Association of Private Special Education Centers
14 Other Associations for Exceptional Children
14 European Council for International Schools
14 National Association for the Education of Young Children
14 National Association of Laboratory Schools
14 National Coalition of Girls' Schools
14 Other Special Emphasis Associations
14 Alternative School Network
14 National Association of Independent Schools
14 State/Regional Independent School Association
14 National Independent Private Schools Association
14 The Association of Boarding Schools
14 Other School Associations
15 Days in School Year
16 Hours in School Day for Students
16 Minutes in School Day for Students
17 Library or Library Media Center
Permanent Identification Number
School Name 
Mailing Address
Mailing City
Mailing State Postal Code
Mailing Zip Code
Mailing Zip4 Code
School phone number
ANSI County Code
ANSI County Code + State Code
County Name
Location Address
Location City
Location State Postal Code
Location Zip Code
Location Zip4 Code
Census Region
ANSI State Code (physical location of school)
Urban-centric Locale Code
Latitude
Longitude
State Legislative District - Lower
State Legislative District - Upper
114th Congressional District
2019-20 Lowest Grade Level Recode
2019-20 Highest Grade Level Recode
Universe frame to which  school belongs
Tabulation Flag
School Typology
Affiliation
School Orientation
Diocese or archdiocese for Catholic schools
Level of Instruction
Level of Instruction.1
Total Number of Students (K-12, UG)
Size of School (K-12, UG)
Number of K-12 Teachers (FTE)
Urban-centric Community Type
Length of School Day in Hours
Number of Male Students (K-12, UG)
Total Number of Kindergarten Students
Percentage of American Indian or Alaskan Native Students
Percentage of Asian Students
Percentage of Native Hawaiian or Pacific Islander students
Percentage of Hispanic Students
Percentage of White Students
Percentage of Black Students
Percentage of Students of Two or More Races
Student Teacher Ratio
Unnamed: 155

Variable Legend DF2:

  • Percent to 4 Year College - Percentage of Graduates that Went to College
  • Total Student Enrollment - Total Population at School
  • Type of School - School Type
  • Hours in School Day for Students - Length of School Day
  • Student Teacher Ratio

Tidying

In [ ]:
df2.rename(columns = {'8D Percent to 4-Year College': 'Percent to 4-Year College' ,'5 Total Student Enrollment': "Total Student Enrollment",'11 Type of School': 'Type of School', '16 Hours in School Day for Students': 'Hours in School Day for Students'}, inplace = True)
In [ ]:
tidydf2 = df2.melt(value_vars=['Percent to 4-Year College','Total Student Enrollment','Type of School', 'Hours in School Day for Students', 'Student Teacher Ratio'])
tidydf2
Out[ ]:
variable value
0 Percent to 4-Year College NaN
1 Percent to 4-Year College NaN
2 Percent to 4-Year College 0.000000
3 Percent to 4-Year College 65.000000
4 Percent to 4-Year College 50.000000
... ... ...
107855 Student Teacher Ratio 10.740741
107856 Student Teacher Ratio 5.517241
107857 Student Teacher Ratio 8.000000
107858 Student Teacher Ratio 6.428571
107859 Student Teacher Ratio 5.483871

107860 rows × 2 columns

Similar to the private school dataset, there is an exceedingly large amount of variables in the public. Obviously it would be impractical to fit all of them in a singular datatable so we chose important ones to query and tidy up.

DataSet 3: Public Schools

Dataset Short Description: The Public School dataset that we chose contains nationwide data on public schools, their locations, total enrollment, graduation rates, and many other useful datapoints that give great insight into the differences betwen public and private schools.

Variables Included

In [ ]:
df3.keys()
Out[ ]:
Index(['X', 'Y', 'NCESSCH', 'SURVYEAR', 'STABR', 'LEAID', 'ST_LEAID',
       'LEA_NAME', 'SCH_NAME', 'LSTREET1', 'LSTREET2', 'LCITY', 'LSTATE',
       'LZIP', 'LZIP4', 'PHONE', 'GSLO', 'GSHI', 'VIRTUAL', 'TOTFRL', 'FRELCH',
       'REDLCH', 'PK', 'KG', 'G01', 'G02', 'G03', 'G04', 'G05', 'G06', 'G07',
       'G08', 'G09', 'G10', 'G11', 'G12', 'G13', 'TOTAL', 'MEMBER', 'AM', 'HI',
       'BL', 'WH', 'HP', 'TR', 'FTE', 'LATCOD', 'LONCOD', 'ULOCALE', 'NMCNTY',
       'STUTERATIO', 'TITLEI', 'STITLEI', 'AMALM', 'AMALF', 'ASALM', 'ASALF',
       'HIALM', 'HIALF', 'BLALM', 'BLALF', 'WHALM', 'WHALF', 'HPALM', 'HPALF',
       'TRALM', 'TRALF', 'TOTMENROL', 'TOTFENROL', 'STATUS', 'UG', 'AE',
       'SCHOOL_TYPE_TEXT', 'SY_STATUS_TEXT', 'SCHOOL_LEVEL', 'AS',
       'CHARTER_TEXT', 'MAGNET_TEXT', 'OBJECTID'],
      dtype='object')

Variable Legend DF3:

  • SCHOOL_TYPE_TEXT - School Type
  • Total - Total students, all grades (includes AE)
  • STUTERATIO - Student to Teacher Ratio

DataSet 4: Report Card Dataset

Dataset Short Description: The Nations Report Card dataset contains information on standardized test scores for reading and writing for both public and private schools. This allows us to visualize the differences between the students testing abilities after being in their respective schooling system for some time as we chose grade 12 for the standardized testing scores.

In [ ]:
df4
Out[ ]:
Year Jurisdiction Gender Average scale score NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1990¹ National public Male 297 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1990¹ National public Female 291 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1990¹ National private Male 300 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1990¹ National private Female 298 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1992¹ National public Male 299 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 1992¹ National public Female 296 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 1992¹ National private Male 317 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 1992¹ National private Female 312 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 1996¹ National public Male 304 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 1996¹ National public Female 302 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 1996¹ National private Male 316 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 1996¹ National private Female 313 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 1996 National public Male 302 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 1996 National public Female 299 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 1996 National private Male 311 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 1996 National private Female 309 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 2000¹ National public Male 302 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 2000¹ National public Female 298 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 2000¹ National private Male 315 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 2000¹ National private Female 314 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 2000 National public Male 301 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 2000 National public Female 297 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 2000 National private Male 316 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 2000 National private Female 314 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 ¹ Accommodations were not permitted for this a... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 NOTE: Some apparent differences between estima... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 SOURCE: U.S. Department of Education, Institut... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
30 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
31 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

32 rows × 26 columns

In [ ]:
#Public and Private School Standardized Test Dataset

#TIDY UP DF4 SOME MORE
df4 = df4.iloc[:, 0:4]
df4['Jurisdiction'] = df4['Jurisdiction'].map({'National public': 'Public', 'National private': 'Private'})
df4.Year = df4['Year'].map({'1990¹': '1990','1990': '1990', '1992¹': '1992', '1992': '1992', '1996¹': '1996', '1996': '1996', '2000¹': '2000', '2000': '2000'})
df4 = df4.iloc[:-8]
  #take the data less the las 8 rows that are filler
df4.head(100)
Out[ ]:
Year Jurisdiction Gender Average scale score
1 1990 Public Male 297
2 1990 Public Female 291
3 1990 Private Male 300
4 1990 Private Female 298
5 1992 Public Male 299
6 1992 Public Female 296
7 1992 Private Male 317
8 1992 Private Female 312
9 1996 Public Male 304
10 1996 Public Female 302
11 1996 Private Male 316
12 1996 Private Female 313
13 1996 Public Male 302
14 1996 Public Female 299
15 1996 Private Male 311
16 1996 Private Female 309
17 2000 Public Male 302
18 2000 Public Female 298
19 2000 Private Male 315
20 2000 Private Female 314
21 2000 Public Male 301
22 2000 Public Female 297
23 2000 Private Male 316
24 2000 Private Female 314
In [ ]:
#TIDY UP DF5 SOME MORE
df5 = df5.iloc[:, 0:4]
df5['Jurisdiction'] = df5['Jurisdiction'].map({'National public': 'Public', 'National private': 'Private'})
df5.Year = df5['Year'].map({'1992¹': '1992', '1992': '1992', '1994¹': '1994', '1996': '1996', '1998¹': '1998'})
df5 = df5.iloc[:-8]
  #take the data less the las 8 rows that are filler
df5 = df5.dropna()
df5.head(100)
Out[ ]:
Year Jurisdiction Gender Average scale score
29 1998 Public Male 281
30 1998 Public Female 297
31 1998 Private Male 295
32 1998 Private Female 310
33 1994 Public Male 279
34 1994 Public Female 293
35 1994 Private Male 294
36 1994 Private Female 310
37 1992 Public Male 285
38 1992 Public Female 295
39 1992 Private Male 302
40 1992 Private Female 312

EDA Part 1: Instituonal Analysis

This Portion of the project analyzes some of the objective statistical differences between public and private school.

Number of Students Per a school

Student Ratios Public/Private

In [ ]:
df2['Total Student Enrollment'].mean()
Out[ ]:
182.11417578342295
In [ ]:
(np.log(df2['Total Student Enrollment'])).plot.hist(bins=50)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe426fa3b20>

The average number of kids who attend a given private school is 182

In [ ]:
APS = df3.loc[df3['TOTAL'] > 0]
APS['TOTAL'].mean()
Out[ ]:
526.7716811224755
In [ ]:
(np.log(APS['TOTAL'])).plot.hist(bins=50)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe426dee340>

The average number of kids who attend a given public school is 527

In [ ]:
APS['TOTAL'].mean()/df2['Total Student Enrollment'].mean()
Out[ ]:
2.8925352947204517

There are 2.89x as many students in each public school than private school on average

Number of Students Per a grade

In [ ]:
kiddist2 = pd.DataFrame().assign(Kinder=df2['4C Kindergarten Enrollment'], First=df2['4F First Grade Enrollment'], Second=df2['4G Second Grade Enrollment'], Third=df2['4H Third Grade Enrollment'], Fourth=df2['4I Fourth Grade Enrollment'], Fith=df2['4J Fifth Grade Enrollment'], Sixth=df2['4K Sixth Grade Enrollment'], Seventh=df2['4L Seventh Grade Enrollment'], Eigth=df2['4M Eighth Grade Enrollment'], Ninth=df2['4N Ninth Grade Enrollment'], Tenth=df2['4O Tenth Grade Enrollment'], Eleventh=df2['4P Eleventh Grade Enrollment'], Twelth=df2['4Q Twelfth Grade Enrollment'])
kiddist2.mean()
Out[ ]:
Kinder      18.237556
First       16.609326
Second      16.308787
Third       16.446030
Fourth      16.440389
Fith        16.813804
Sixth       18.011428
Seventh     18.836964
Eigth       19.000951
Ninth       33.864311
Tenth       34.956910
Eleventh    35.968309
Twelth      36.138910
dtype: float64
In [ ]:
(kiddist2.mean()).plot.bar()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe41f65bf10>

There is a noticeable increase in private school kids in highschool than middle and lower school. This is most likely because 9-12 schools are common enough to increase high school numbers.

In [ ]:
kiddist3 = pd.DataFrame().assign(Prekinder = df3['PK'], Kinder=df3['KG'], First=df3['G01'], Second=df3['G02'], Third=df3['G03'], Fourth=df3['G04'], Fith=df3['G05'], Sixth=df3['G06'], Seventh=df3['G07'], Eigth=df3['G08'], Ninth=df3['G09'], Tenth=df3['G10'], Eleventh=df3['G11'], Twelth=df3['G12'],Thirteen=df3['G13'])
kiddist3.mean()
Out[ ]:
Prekinder     37.321986
Kinder        66.777246
First         65.120011
Second        64.911839
Third         65.732845
Fourth        66.282372
Fith          69.436625
Sixth         95.914576
Seventh      109.734960
Eigth        107.402135
Ninth        130.727898
Tenth        125.723943
Eleventh     119.206975
Twelth       117.606496
Thirteen      15.142857
dtype: float64
In [ ]:
(kiddist3.mean()).plot.bar()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe4281eb040>

There is a noticeable increase in public school kids in middle school and a bigger one in high school.

In [ ]:
kiddist2.mean().corr(kiddist3.mean(), method='pearson', min_periods=None)
Out[ ]:
0.8374020928730136

The correlation between the ratio of kids in public and private school is 0.84

Student Teacher Ratio

In [ ]:
df2['Student Teacher Ratio'].describe()
Out[ ]:
count    21572.000000
mean        10.564167
std         37.902679
min          0.068966
25%          6.071429
50%          9.149468
75%         12.689884
max       5230.000000
Name: Student Teacher Ratio, dtype: float64
In [ ]:
df3['STUTERATIO'].describe()
Out[ ]:
count    86194.000000
mean        15.443857
std         39.390543
min       -100.000000
25%         12.244898
50%         14.745189
75%         17.118270
max       7100.000000
Name: STUTERATIO, dtype: float64
In [ ]:
print(df3['STUTERATIO'].mean()/df2['Student Teacher Ratio'].mean())
1.4619095686235621

It is clear that the student-to-teacher ratio across tens of thousands of schools is lower for private schools than public schools. Public schools have a higher ratio by a factor of 1.46.

Location

Most Common Cities

In [ ]:
df2["Location City"].value_counts()
Out[ ]:
MILLERSBURG       38
MIAMI             27
FREDERICKSBURG    24
APPLE CREEK       24
LAKEWOOD          22
                  ..
NEOSHO             1
FLORISSANT         1
BOURBON            1
TUNAS              1
CODY               1
Name: Location City, Length: 2042, dtype: int64

The most common city for Private school is Millersburg, followed by Miami and Fredericksburg

In [ ]:
df3["LCITY"].value_counts()
Out[ ]:
HOUSTON              796
Chicago              664
Los Angeles          583
BROOKLYN             576
SAN ANTONIO          544
                    ... 
RICHBURG               1
Los Nietos             1
NORTH TARRYTOWN        1
South San Gabriel      1
MENANDS                1
Name: LCITY, Length: 14603, dtype: int64

The most common city for public school is Houston, followed by Chicacog and Los Angeles

Longitude and Lattitude

Private School Distribution

In [ ]:
privloc = pd.DataFrame().assign(Latitude = df2['Latitude'], Longitude = df2["Longitude"])
privloc

privloc.plot.scatter(x="Latitude", y="Longitude", alpha = 0.3);
In [ ]:
display(df2['Latitude'].mean())
display(df2['Longitude'].mean())
37.912500703040976
-90.29729882199148

Public School Distribution

In [ ]:
publoc = pd.DataFrame().assign(Latitude = df3['LATCOD'], Longitude = df3["LONCOD"])
publoc

publoc.plot.scatter(x="Latitude", y="Longitude", alpha = 0.3);
In [ ]:
display(df3['LATCOD'].mean())
display(df3['LONCOD'].mean())
37.83579752299039
-93.06111102597183

Private School Compared to Public School coordinates

In [ ]:
privloc["Type"] = "Private"
publoc["Type"] = "Public"
location = pd.concat([publoc,privloc], ignore_index=True)
colors = location["Type"].map({
    "Public": "blue",
    "Private": "red"
})

location.plot.scatter(x="Latitude", y="Longitude", c=colors, alpha = 0.3);
location["Type"].value_counts()
Out[ ]:
Public     100455
Private     21572
Name: Type, dtype: int64

It seems that private school and public schools general appear in the same locations with public school being more localized to big cities and private schools being more common elsewhere.

EDA Part 2: Idividual Analysis

This portion of our product analyzes public and private school student performance using The Nation's Report Card math and english scores and outcomes using the IPUMS U.S. census dataset. While the main focus of both of these datasets is not Private or Public school data and thus have a weaker link to the overall project, they both have public school and private school distinctions for students and therefore can be used to identify and analyze potential differences.

Disclaimer: While this portion of the project attempts to analyze the performance and outcome of Private school students, it is important to disclose that we do not have access to all factors that could have an affect on students. The most prevalent of these factors to keep in mind is the wealth of the average private school family vs. that of the average public school family. Almost every private school has a pricy tuition and families who have the resources to pay most likely also have the resources to help their children succeed in school and onwards. Therefore, while we cannot say for certain how much of an influence private school had on their outcomes, we can measure the correlation between going to one type of school and having a certain performance and outcome.

Report Card Score Math Analysis

In [ ]:
import seaborn as sns


df4 = df4.astype({'Average scale score':'int'})
#df4 = df4.groupby(by = ['Jurisdiction']).head()
sns.catplot(data=df4, x='Year', y='Average scale score', hue='Jurisdiction', col = 'Gender', kind = 'bar')
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7fe4157d1c70>

Report Card Score English Analysis

In [ ]:
import seaborn as sns


df5 = df5.astype({'Average scale score':'int'})
#df4 = df4.groupby(by = ['Jurisdiction']).head()
sns.catplot(data=df5, x='Year', y='Average scale score', hue='Jurisdiction', col = 'Gender', kind = 'bar')
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7fe4156ce4f0>

It can be seen here that although the scores each year are relatively stable, private schools always have a slight edge on testing score.

Analysis of IPUMS Dataset

In [ ]:
#The School Type variable shows the proportion of indiviudals in the dataset and what school they go to
#0 is N/A
#1 is Not enrolled
#2 is Public School
#3 is private school
df["SCHLTYPE"].value_counts().plot.pie(y='mass', figsize=(5, 5))
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe412e14c70>
In [ ]:
#Interesting Stat
PUBS = df["SCHLTYPE"].loc[df["SCHLTYPE"]== 2].sum()/2
PRIS = df["SCHLTYPE"].loc[df["SCHLTYPE"]== 3].sum()/3
PUBS / PRIS
#598521 / 137437
Out[ ]:
4.354875324694223

There are over 4x as many public school kids in the dataset than private school kids

In [ ]:
#The Group Quarters variable idicates what living situation the individual resided in
#1 is Household under 1970 definition
#2 is Additional households under 1990 definition
#3 is Group quarters--Institutions
#4 is Other group quarters
#5 is Additional households under 2000 definition
df["GQ"].value_counts()
Out[ ]:
1    3081285
4      78375
3      72946
2       6006
5        941
Name: GQ, dtype: int64
In [ ]:
(df["GQ"].value_counts() ** 0.2).plot.bar()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe412dca730>

Predictive Model

Using U.S. Census data, we created a model that sets out to predict an individual's income based on if they went to public or private school with the given context of how far in the academic process they are.

In [ ]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# Create a new variable
features = ["SCHLTYPE", "EDUCD"]
Y_train = df["INCTOT"]
X_train_dict = df[features].to_dict(orient="records")

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)

X_train = vec.transform(X_train_dict)
x_new_dict = {
    "SCHLTYPE":3, #Private School
    "EDUCD":101 #Bachelor's Degree
    
}

X_new = vec.transform([x_new_dict])

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(x_new_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([68280.])
In [ ]:
model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([52670.])
In [ ]:
model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([41047.])

As you can see, the prediction of the salary of a private school kid who graduated from college is anywhere from 41,047 to 68,280 dollars.

In [ ]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

# Create a new variable
features = ["SCHLTYPE", "EDUCD"]
Y_train = df["INCTOT"]
X_train_dict = df[features].to_dict(orient="records")

vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)

X_train = vec.transform(X_train_dict)
x_new_dict = {
    "SCHLTYPE":2, #Private School
    "EDUCD":101 #Bachelor's Degree
    
}

X_new = vec.transform([x_new_dict])

# Dummy encoding
vec = DictVectorizer(sparse=False)
vec.fit(X_train_dict)
X_train = vec.transform(X_train_dict)
X_new = vec.transform(x_new_dict)

# Standardization
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_new_sc = scaler.transform(X_new)

# K-Nearest Neighbors Model
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([12100.])
In [ ]:
model = KNeighborsRegressor(n_neighbors=10)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([14330.])
In [ ]:
model = KNeighborsRegressor(n_neighbors=30)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([30996.66666667])
In [ ]:
model = KNeighborsRegressor(n_neighbors=50)
model.fit(X_train_sc, Y_train)
model.predict(X_new_sc)
Out[ ]:
array([33708.])

And the prediction salary of public school kids who also graduated from college is anywhere from 12,100 to 33,708 dollars (smaller likely as a result of unemployed individuals being included)

While only considering a couple factors, the model does line up with the notion that private school kids generally end up with higher paying jobs - Whether that be a product of the school itself or other factors such as familial wealth

Conclusion

It is difficult to say whether private schools have a better overall effect on children than public schools, as this can vary depending on many factors, including the specific schools and the individual needs and abilities of the children involved. In general, private schools may have smaller class sizes and more individualized attention, which can be beneficial for some students. However, public schools are often more diverse and may provide a broader range of extracurricular activities and other opportunities for students. Ultimately, the best choice for a child's education will depend on their individual needs and the specific schools available in their area. Something that has been observed through our data analysis is that overall, the standrdized test scores of private schools are slightly higher than public schools for every year and subject that we measured. For higher test scores, private school may be the better option, but it is still hard to tell whether or not those who attended public vs. private school have a better degree of success in life.


Further Resources

In [ ]:
%%shell
jupyter nbconvert --to html /content/arovn10.github.io/RovnerMotzFinalProject.ipynb
[NbConvertApp] Converting notebook /content/arovn10.github.io/RovnerMotzFinalProject.ipynb to html
[NbConvertApp] Writing 388297 bytes to /content/arovn10.github.io/RovnerMotzFinalProject.html
Out[ ]: