Sunday, September 29, 2013

Breaking down the AP Calc dataset by Demographic Data

I wanted to look at what demographic data classify a county as having AP Calculus for all of the years 2008-2013.  I decided to use the C50 package in R to analyze the data since I was having little luck with logistic regression.  Logistic regression is hard to interpret but decision trees are very easy to interpret.  It slices and dices the data and comes up with simple rules that characterize your data into groups.  Part of the output from this algorithm follows:

pop > 44542: TRUE (20)
pop <= 44542:
:...pop <= 16132: FALSE (36/3)
    pop > 16132:
    :...pHouse > 2.576285: FALSE (5)
        pHouse <= 2.576285:
        :...pctFreeRed <= 28.7193: TRUE (8)
            pctFreeRed > 28.7193:
            :...PCIncome > 30964: TRUE (3)
                PCIncome <= 30964:
                :...pctNoFather <= 17.9: FALSE (10)
                    pctNoFather > 17.9:
                    :...pSQM <= 18.2: FALSE (2)
                        pSQM > 18.2: TRUE (3)

The TRUEs mean that rule classifies the county as having AP Calculus and the number(s) that follow mean how many were classified and if any were misclassified.  Maps and commentary are after the jump.



It's surprising that so many (20) can be classified just based on population of the county.  There are only 37 counties that had AP Calculus 2008-2013 and 20 of them have a population over 44542.

These counties are on the outskirts of the large population counties and have a slightly smaller population, but they have a % Free and Reduced Lunch much less than average
and the number of persons per household is about average for the dataset.
So 28 of the 37 are classified so far.

No comments:

Post a Comment