Statistical and Machine Learning Models to Predict Programming Performance
This thesis details a longitudinal study on factors that influence introductory
programming success and on the development of machine learning
models to predict incoming student performance. Although numerous
studies have developed models to predict programming success, the models
struggled to achieve high accuracy in predicting the likely performance of
incoming students. Our approach overcomes this by providing a machine
learning technique, using a set of three significant factors, that can predict
whether students will be ‘weak’ or ‘strong’ programmers with approximately
80% accuracy after only three weeks of programming experience.
This thesis makes three fundamental contributions. The first contribution
is a longitudinal study identifying factors that influence introductory
programming success, investigating 25 factors at four different institutions.
Evidence of the importance of mathematics, comfort-level and computer
game-playing as predictors of programming performance is provided. A
number of new instruments were developed by the author and a programming
self-esteem measure was shown to out-perform other previous comparable
comfort-level measures in predicting programming performance.
The second contribution of the thesis is an analysis of the use of machine
learning (ML) algorithms to predict performance and is a first attempt to
investigate the effectiveness of a variety of ML algorithms to predict introductory
programming performance. The ML models built as part of this
research are the most effective models so far developed. The models are
effective even when students have just commenced a programming module.
Consequently, timely interventions can be put in place to prevent struggling
students from failing.
The third contribution of the thesis is the recommendation of an algorithm,
based on detailed statistical analysis that should be used by the
computer science education community to predict the likely performance of
incoming students. Optimisations were carried out to investigate if prediction
accuracy could be further increased and an ensemble algorithm, StackingC,
was shown to improve prediction performance.
The factors identified in this thesis and the associated machine learning
models provide a means to predict accurately programming performance
when students have only completed preliminary programming concepts.
This has not previously been possible.