Fun with R, Octave, Python etc.: 2013

Friday, November 1, 2013

Socioeconomic model of whether Minnesota counties had AP Calculus 2008-2013

This is a first attempt at creating a model using socioeconomic variables and I used those from the decision tree: Population, People per Household, pct Free and Reduced Lunch, Per Capita Income, pct No Father on Birth Certificate, and Population per Square Mile.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.837e+01 1.176e+01 1.561 0.1185
data1$pop 6.863e-05 3.416e-05 2.009 0.0445 *
data1$pHouse -7.694e+00 3.795e+00 -2.027 0.0426 *
data1$pctFreeRed -8.018e-02 5.991e-02 -1.338 0.1808
data1$PCIncome -4.612e-05 1.430e-04 -0.323 0.7470
data1$pctNoFather 7.184e-02 6.560e-02 1.095 0.2735
data1$pSQM 3.925e-02 2.581e-02 1.521 0.1282
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As a way to improve on my model, I've made two graphs to show error in the model.

Sunday, September 29, 2013

Breaking down the AP Calc dataset by Demographic Data

I wanted to look at what demographic data classify a county as having AP Calculus for all of the years 2008-2013. I decided to use the C50 package in R to analyze the data since I was having little luck with logistic regression. Logistic regression is hard to interpret but decision trees are very easy to interpret. It slices and dices the data and comes up with simple rules that characterize your data into groups. Part of the output from this algorithm follows:

pop > 44542: TRUE (20)
pop <= 44542:
:...pop <= 16132: FALSE (36/3)
pop > 16132:
:...pHouse > 2.576285: FALSE (5)
pHouse <= 2.576285:
:...pctFreeRed <= 28.7193: TRUE (8)
pctFreeRed > 28.7193:
:...PCIncome > 30964: TRUE (3)
PCIncome <= 30964:
:...pctNoFather <= 17.9: FALSE (10)
pctNoFather > 17.9:
:...pSQM <= 18.2: FALSE (2)
pSQM > 18.2: TRUE (3)

The TRUEs mean that rule classifies the county as having AP Calculus and the number(s) that follow mean how many were classified and if any were misclassified. Maps and commentary are after the jump.

Visualizing AP Calculus by county in Minnesota

When I was first thinking about this data, I thought it was going to be a case of each year at least one county would add AP Calculus, but no counties would lose AP Calculus. That turns out not to be the case. I'm guessing that state budget problems and district level property tax levies play a bigger role than anything else.

2009 Add Carlton, Redwood, Traverse Drop Isanti

2010 Add Itasca, Nicollet, Nobles, Swift Drop Sibley, Mower, Watonwan, Waseca, Rock, Lincoln, Mille Lacs

2011 Add Sibley, Mower, Waseca, Mille Lacs Drop Meeker, Swift, Beltrami, Cass, Polk

2012 Add Rock, Swift, Kanabec, Beltrami, Polk Drop Itasca, Freeborn, Mille Lacs, Koochiching

2013 Add Cook, Itasca, Mille Lacs, Roseau Drop Sibley, Rock, Pipestone, Chippewa, Renville, Traverse

Tuesday, September 10, 2013

Polynomial curve fitting

I am working through Pattern Recognition and Machine Learning by Christopher Bishop and made these graphs of the least squared error solution to various polynomial fitting problems.

Classification of whether a county has AP Calculus (Part 1)

I used python's Numpy to get some information about how employment, creative jobs, arts jobs, and the ratio of creative jobs and arts jobs to employment influence whether a county has AP Calculus offered within its borders.

The data was called Creative Class County Codes at Data.gov. I used the same zip codes from the earlier project and converted them to counties by downloading a spreadsheet from UnitedStatesZipCodes.org. The code follows:

How Kafkaesque was the Supreme Court during their 2012 term?

Update (7/20/2013): I added tables showing the relative frequency of words that are common between Metamorphosis and the 2012 Supreme Court Opinions.

Thursday, July 18, 2013

Supreme Court 2012 Term Opinion Word Frequency Analysis

The first thing that stuck out was the 'u' which might be more common in a future century when the Supreme Court decisions are texted out. Here it is because I split the text into words on anything that wasn't a letter(lower case or upper case) or an apostrophe. So the 'u', 's', and 'v' are 'U. S. v.' text strings that have been split up on the white space and periods.

Wednesday, July 17, 2013

Word frequency analysis of Kafka's Metamorphosis

This is a simple analysis of the frequency of words in Franz Kafka's Metamorphosis.

Tuesday, June 11, 2013

Pretty Choropleth using ggplot in R

I think it's a pretty interesting graph. I got the data from the Minnesota Department of Health Website.

Monday, June 10, 2013

First time using ggplot2 in R

I used the packages XML, rgdal, maptools, and ggplot2. From XML I used readHTMLTable and got a bunch of tables from the Star Tribune 100 website. I was able to get the cities from the tables and then I used the census 2010 cities shapefile and subsetted severely by whether the city was in Minnesota and had the same name as a city from the Star Tribune 100.

I now have the points of the cities and the outline of the state of Minnesota. Next came the hard part. I don't know a ton about Geographical Information Systems or much about Geography so I couldn't tell what coordinate system my two layers were in. After much stumbling around I realized that the cities were in longitude and latitude and that the outline of the state of Minnesota was in something called UTM. So I used spTransform from rgdal to change coordinate systems from UTM zone 15 to longitude/latitude. After that I plotted using ggplot2 and it seems to work out.

The three dots that seem like they could be artifacts or something are probably Montevideo, Fergus Falls, and Thief River Falls.

Saturday, May 25, 2013

Web Scraping for Education Data

I spent some time today and yesterday doing some data wrangling. I wondered where AP Calculus AB is offered throughout the state of Minnesota. I first went to the College Board site and found this database. There were 9 pages of results that I downloaded and then used readLines and regex to extract the zip codes for each school. There were 164 schools and only 142 unique zip codes. I used the maptools package to create the map and colored it based on the data.

Friday, May 24, 2013

First Post(Zeroth Post?)

Since I can't currently produce results, I'm going to substitute results by writing about my ambitious plans for the next few months on this blog. My plans are to

train a neural network to predict when a particular stock will split and then use Javascript to update stock prices daily and predict the split.
train a neural network to predict numbers and letters and then use an HTML5 Canvas to gather data to predict on.
put out a cool animation or graphic with Octave or R once a week.

Longer range when (if?) I figure out audio capture via HTML or Javascript it would be neat to train a neural network to detect accents.

Fun with R, Octave, Python etc.