Investigating Active Learning for Short-Answer Scoring Horbach and Palmer 2016
Investigating Active Learning for Short-Answer Scoring
Horbach and Palmer 2016
Comparison of different
item selection methods
Especially in early parts of the learning curve until about 500 items are labeled, uncertainty-based methods show improvement over the random baseline.
The picture changes a bit when we look at the performance of AL methods per prompt and with different seed selection methods - Most noticeable is that we see a wide variety in the performance of the sample selection methods for the various prompts
if any method yields a substantial improvement, it is an uncertainty-based method. On average, boosted
entropy gives the highest gains in both seed selection settings. Comparing random to equal seed selection, performance is rather consistently better when AL starts with a seed set that covers all classes equally. Experiment 1 shows a clear benefit for using equal rather than random seeds
Experiment 2: The influence of seeds
The performance for large RANDOM margin and entropy sampling is slightly better than the small random seed set (curiously not for boosted entropy), but it is still below that of the small equal seed set. However, the trend across items is not completely clear.
We still take it as an indicator that seeds of good quality cannot be outweight by quantity.
Experiment 3: The influence of batch sizes
Therefore we test an alternative setup where we sample
and label 20 items per batch before retraining
Compared to the varying batch size setup (numbers in parentheses), performance goes down, indicating that fine-grained sampling really does provide a benefit, especially early in the learning process. Where larger batch sizes may lead to
selection of instances in the same region of uncertainty, a smaller batch size allows the system to resolve a certain region of uncertainty with fewer labeled training instances
Room for improvement
how best to select - or even generate equally distributed seed sets. One might argue whether an automated approach is necessary: perhaps an experienced teacher could easily browse through the data in a time-efficient way to select clear examples of low-, mid-, and high-scoring answers as seeds
The variability of AL performance across prompts clearly and strongly points to the need for better understanding how attributes of data sets affect the outcome of AL methods. A solution for predicting which AL settings are suitable for
a given data set is an open problem for AL in general
Što je sa skupovima 100-200 pitanja ?
lemma 1- to 4-
grams to capture lexical content of answers
character 2- to 4-grams to account for spelling errors and morphological variation
SMO (SVM) in Weka
Active learning algorithm
Classifier confidence is computed for each item in
the unlabeled data, and the one with the highest entropy (lowest confidence) is selected for labeling
Boosted Entropy Sampling
We adopt their method of boosted entropy sampling, where per-label weights are incorporated into the entropy computation, in order to favor items more likely
to belong to a minority class
this methods tends to select instances that lie on the decision
border between two classes, instead of items at the
intersection of all clasess
aims to select instances that cover as much of the feature space as possible, i.e. that are as diverse as possible
results in selection of items near the center of the pool
(a) random seed selection
(b) equal seed selection
Number of items: In the small seed set condition, and for both random and equal selection methods, 10 individual seed sets per prompt are chosen, each with either 3 or 4 seeds (corresponding to the number of classes per prompt). We repeat this process for the large seed set condition, this time selecting 20 items per seed set.
varying batch sizes...