Please enable JavaScript.
Coggle requires JavaScript to display documents.
ETS: Domain Adaptation and Stacking for Short Answer Scoring Heilman and…
ETS: Domain Adaptation and Stacking for Short Answer Scoring Heilman and Madani 2013.
Features
Baseline features
Four types of lexically-driven text similarity measures, and each is computed by comparing the learner response to both the expected answer(s) and the question, resulting in eight features in total. They are described more fully by Dzikovska et al. (2012).
Intercept feaure
Always equal to one, which, in combination with the domain adaptation technique described in §3.2, allows the system to model the a priori distribution over classes for each domain and item.
Binary indicator features for
the following types of n-grams
lowercased word n-grams in the response text for n ∈ {1, 2, 3}
lowercased word n-grams in the response text for n ∈ {4, 5, . . . , 11}, grouped into 10,000 bins by hashing and using a modulo operation (i.e., the “hashing trick”) (Weinberger et al., 2009)
lowercased character n-grams in the response text for n ∈ {5, 6, 7, 8}
Text Similarity Features
The maximum of the smoothed, uncased BLEU (Papineni et al., 2002) scores obtained by comparing the student response to each correct reference answer. We also include the word n-gram precision and recall values for n ∈ {1, 2, 3, 4} for the maximally similar reference answer.
The maximum of the smoothed, uncased BLEU scores obtained by comparing the student response to each correct training set student answer....
The maximum PERP (Heilman and Madnani, 2012) score obtained by comparing the student response to the correct reference answers
The maximum PERP score obtained by comparing the student response to the correct student answers
Model (supervised)
Logistic regression with L2 regularization
Run 1
: This run included the baseline (§3.1.1), intercept (§3.1.2), and the text-similarity features (§3.1.4) that compare student responses to reference answers (but not those that compare
to scored student responses in the training set)
Run 2
: This run included the baseline (§3.1.1),
intercept (§3.1.2), and n-gram features (§3.1.3).
Run 3
: This run included all features.
Domain adaptation...
Results
Interestingly, the differences in performance between the unseen answers task and the other tasks was somewhat larger for the SciEntsBank dataset than for the Beetle dataset. We speculate that this result is because the SciEntsBank data covered a more diverse set of topics.
It appears that features of the other student responses improve performance for the unseen answers task. For example, the full system (Run 3) performed better than Run 1, which did not includefeatures of other student responses, on the unseen answers task for both Beetle and SciEntsBank.
However, it is not clear whether the PERP and
BLEU features improve performance. The full system (Run 3) did not always outperform Run 2, which did not include these features.
Room for improvement
whether student response features or
reference answer similarity features are more useful in general, and whether there are any systematic
differences between human-machine and humanhuman disagreements