Please enable JavaScript.
Coggle requires JavaScript to display documents.
conversation system evaluation, Proportion Metrics, Similarity Metrics /…
conversation system evaluation
CRS
Effectivenes of RS
Item Diversity
Shannon Diversity Index: Measures diversity based on the distribution of items in the list, with higher entropy indicating greater diversity.
Average Cosine Similarity: Calculates the similarity between items in the recommendation list, with lower average similarity indicating higher diversity
Coverage Rate: The proportion of recommended items relative to the entire item pool.
New Item Coverage: The percentage of items recommended that the user has not previously encountered.
Item Sparsity: how many items in the recommendation list are those that the user has not interacted with before
Intra-List Diversity (ILD): Computes the similarity among recommended items; lower values indicate higher diversity.
Coverage: Proportion of content categories covered by the recommendations.
Novelty
Expected Popularity Complement (EPC): Measures the proportion of low-popularity items in the recommendations.
Discovery Rate: Proportion of recommendations users identify as new or interesting.
Context Compatibility
Dialogue Consistency: Semantic coherence between the recommendations and dialogue context, evaluated using semantic models
Accuracy
match the user's historical needs or preferences
new interest finding
Explainability
Explainable Rate
Guo, Shuyu, et al. "Towards explainable conversational recommender systems." Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023.
persuasiveness
Measuring" why" in recommender systems: a comprehensive survey on the evaluation of explainable
recommendation
Transparency
Temporal meta-path
guided explainable recommendation
conversation quality
error rate:
input error
output error
Comprehension
End-to-end
success rate
slot error
average dialogue length
context Coherence and Topic coherence
Information Sufficiency
Component-level
NLG
BLEU
METEOR
Human evalution
NLU
NER
Semantic Matching
Intent Recognition
Humanity
user satification
reality
language quality
consistency, naturalness or fluency
BLUE Nist Score
efficiency of task
Interaction counts
task completion rates
Task completion times or session times
Proportion Metrics
Similarity Metrics / IR Metrics(precision, recall...)
Click rate
pos/neg correction rate
human feedback
human feedback / classificaion model / GAN
Classification metrics