Summary: Implement a Symbolic Regression package in R capable of discovering implicit equation relationships in input data by using Genetic Programming.
Description:
SR is a valuable tool in for both analytic and explanatory studies on experimental data [3] that is notably missing from R’s toolset. Ideally regression involves, given a set of labels and multivariate data {y_i, x_i} finding a general symbolic relationship of the form f(x, y)=0 based on goodness of fit measures (e.g. F statistics) adjusted by degrees of freedom. However, until the infrastructure for general relationships exists (adinr), the syrfr package project will concentrate first on regressing only functions of the form f(x_1, ..., x_i) = y instead of general relations.
SR assists in finding such functions f through Genetic Programming evolving a set of potential solutions through crossover, mutation and recombination and evaluating them with a suitable fitness function.
The motivation for a Symbolic Regression system for R is twofold:
In short,
To be implemented:
Later:
Roadmap
References
[1] Schmidt M., Lipson H. (2009), “Symbolic Regression of Implicit Equations,” Genetic Programming Theory and Practice, Vol. 7, Chapter 5, pp. 73-85.
[2] Steven Gustafson, Edmund K. Burke and Natalio Krasnogor, On Improving Genetic Programming for Symbolic Regression (Technical Report)
[3] Schmidt M., Lipson H. (2009) “Distilling Free-Form Natural Laws from Experimental Data,” Science, Vol. 324, no. 5923, pp. 81 - 85.
[4] Ekaterina Vladislavleva, (2008) “Model-based Problem Solving through Symbolic Regression via Pareto Genetic Programming”, Tilburg University, Netherlands.
Skills required: Knowledge of Symbolic Regression. R and C++ programming skills. Ability to use a Version Control System (svn/git). (Idea suggested on 3rd March, 2010 IST)
Test: A test an applicant has to pass in order to qualify for the topic
Proposed test from James Salsman 2010/03/27:
(note: all of these questions are optional except for number 4)
1. Describe each of the following terms as they relate to statistical regression: categorical, periodic, continuous, bimodal, log-normal, logistic, Gompertz, and nonlinear.
2. Explain which parts of http://bit.ly/tablecurve were adopted in SigmaPlot and which weren’t.
3. Use the ‘outliers’ package to improve a regression fit maintaining the correct extrapolation confidence intervals between those with and without outlier exclusions in proportion to the confidence that the outliers were reasonably excluded. (Show your R transcript.)
4. Explain the relationship between degrees of freedom and correlated independent variables.
5. Critique http://sites.google.com/site/gptips4matlab/
6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a linear model of your choice. How can you characterize the degree-of-freedom-adjusted goodness of fit of nonlinear models?
7. Describe the relationship between probabilities used to regress categorical dependent variables – as in http://www.wmitchell.edu/faculty/kritzer/files/BerryBook-1986.pdf – and Bayes’ theorem.
8. A Fourier transform can use a spectrum to model periodicity, but how would you model a modulus range, e.g., c(0,1,2,3,4,0,1,2,3,4,0,1,...)?
9. Read http://datamonster.sbs.arizona.edu/communication/faculty/Hullett_overestimation.pdf and explain why stepwise regression in R prints the F statistic with two different degrees of freedom.
10. What do you think is the most efficient way to store trees representing model functions in R, assuming you wanted to search a space of 100,000 of them?
Mentor: James Salsman 2010/03/08 no student due to stepwise regression not being properly adjusted yet