From Commit Message Generation to History-Aware Commit Message Completion

Experimental Results

Models. We experiment with three CMG approaches (CodeT5, CodeReviewer, RACE) and a LLM (GPT-3.5-turbo).
Data. We use two subsets of our CommitChronicle dataset for evaluation:
- \(CMG_{test}\) – around \(200\)k examples; used for experiments with CMG approaches
- \(LLM_{test}\) – around \(4\)k examples; used for experiments with a LLM

For further details, refer to our paper.

RQs and Key Findings

⬇️ Click on the buttons to expand corresponding subsections!

RQ A1. How do state-of-the-art CMG approaches perform in the completion setting?

RQ A2. How do LLMs perform in comparison with state-of-the-art CMG approaches?

RQ B1. How does using commit message history as an additional input affect the models’ quality?

RQ B2. How do state-of-the-art CMG approaches perform with and without common data filtering steps?

Full Results

Due to the variety of models and configurations in our experiments, we only share a selected subset of the results in our paper. You can find comprehensive results in this section (also available in our repository as JSONLines files).

⬇️ Click on the buttons to expand corresponding subsections!

\(CMG_{test}\) Experiments

Metrics

Context Ratio	Input Setting	Model	B-Norm	EditSim	EM@1	EM@2
0% (generation)	Diff + History	CodeT5	16.80	30.91	17.68	4.27
		RACE	16.91	31.15	17.95	4.36
		CodeReviewer	16.78	30.74	17.83	4.38
	Diff	CodeT5	15.12	28.71	10.90	3.03
		RACE	15.32	29.02	11.37	3.07
		CodeReviewer	15.15	28.76	10.87	3.05
25% (completion)	Diff + History	CodeT5	21.94	33.31	44.98	13.10
		RACE	22.16	33.78	45.36	13.40
		CodeReviewer	21.84	32.90	45.58	13.28
	Diff	CodeT5	17.91	30.54	45.35	12.92
		RACE	18.38	30.91	46.62	13.45
		CodeReviewer	18.10	30.93	46.05	13.35
50% (completion)	Diff + History	CodeT5	26.90	33.95	47.45	12.75
		RACE	27.28	34.69	47.84	13.26
		CodeReviewer	26.94	33.76	48.10	12.90
	Diff	CodeT5	24.13	32.74	49.94	14.03
		RACE	24.74	33.22	50.68	14.38
		CodeReviewer	24.35	33.20	50.90	14.59

Plots

⬆️ 0% context ratio (generation)

⬆️ 25% context ratio (completion)

⬆️ 50% context ratio (completion)

\(LLM_{test}\) Experiments

Metrics

Context Ratio	Input Setting	Model	B-Norm	EditSim	EM@1	EM@2
0% (generation)	Diff + History	CodeT5	15.91	30.25	17.91	3.80
		RACE	15.62	30.14	18.48	3.85
		CodeReviewer	16.02	30.29	17.79	4.10
		GPT-3.5-turbo	11.14	25.86	12.55	2.81
	Diff	CodeT5	14.05	27.92	11.85	2.64
		RACE	14.48	28.45	12.67	3.26
		CodeReviewer	14.36	28.23	12.12	3.16
		GPT-3.5-turbo	9.26	24.58	8.65	1.47
25% (completion)	Diff + History	CodeT5	21.11	32.31	43.68	12.45
		RACE	21.14	32.35	44.92	12.93
		CodeReviewer	21.35	32.68	45.69	13.15
		GPT-3.5-turbo	13.24	27.83	34.34	10.09
	Diff	CodeT5	17.16	30.02	45.19	12.85
		RACE	17.54	30.13	46.68	13.33
		CodeReviewer	17.34	30.45	45.96	13.03
		GPT-3.5-turbo	11.48	26.35	21.84	5.99
50% (completion)	Diff + History	CodeT5	26.72	34.05	48.55	13.24
		RACE	27.08	34.12	49.19	13.24
		CodeReviewer	27.36	34.50	49.76	13.69
		GPT-3.5-turbo	12.35	24.6	32.47	11.29
	Diff	CodeT5	23.06	32.09	49.69	13.72
		RACE	23.64	32.13	50.39	14.12
		CodeReviewer	23.65	32.69	50.76	14.81
		GPT-3.5-turbo	10.93	23.51	24.15	7.86

Plots

⬆️ 0% context ratio (generation)

⬆️ 25% context ratio (completion)

⬆️ 50% context ratio (completion)

Filters Experiments

We observed that the following filters are common in CMG research:

First Sentence. Only the first sentence of commit messages is extracted as the target sequence.
Verb-Direct Object. Only the commit messages that start with verb-direct object grammar structure are considered (e.g., refactor code, but not minor refactoring).
Diff Length <= 100 tokens. Only commits with no more than 100 tokens in diffs are considered.

In \(CMG_{test}\), 10385 examples (5.08%) fit all the filters and 22332 (10.93%) fit neither of the filters. We compared the metrics on the following subsets of \(CMG_{test}\), all consisting of 10385 examples:

Filtered – examples that fit all the filters;
Random – a random subset;
Out-of-Filters – a random subset of examples that fit neither of the filters.

⬇️ Click on the buttons to expand corresponding subsections!

0% context ratio

Metrics

Input Setting	Subset	Model	B-Norm	EditSim	EM@1	EM@2
Diff + History	Filtered	CodeT5	22.21	35.36	31.58	7.54
		RACE	22.3	36.14	32.2	7.62
		CodeReviewer	22.14	35.54	31.8	7.61
	Random	CodeT5	16.48	30.96	17.25	3.89
		RACE	16.63	31.3	17.85	4.19
		CodeReviewer	16.51	30.66	17.67	3.94
	Out-of-Filters	CodeT5	5.81	21.11	10.3	3.23
		RACE	5.74	20.83	10.81	3.04
		CodeReviewer	5.63	20.75	10.47	3.17
Diff	Filtered	CodeT5	19.17	32.28	16.08	4.36
		RACE	19.5	33.03	17.03	4.56
		CodeReviewer	18.88	32.26	16.1	4.24
	Random	CodeT5	14.91	28.74	10.89	3.07
		RACE	15.12	29.09	11.19	3.37
		CodeReviewer	14.85	28.65	10.58	2.96
	Out-of-Filters	CodeT5	6.09	20.34	9.98	3.11
		RACE	5.79	20.15	9.55	2.79
		CodeReviewer	6.22	20.51	9.86	3.16

Plots

⬆️ Filtered

⬆️ Random

⬆️ Out-of-Filters

25% context ratio

Metrics

Input Setting	Subset	Model	B-Norm	EditSim	EM@1	EM@2
Diff + History	Filtered	CodeT5	29.21	39.36	51.55	16.29
		RACE	29.39	40.25	52.35	17.0
		CodeReviewer	29.22	39.35	51.99	16.89
	Random	CodeT5	21.74	33.24	44.26	13.43
		RACE	22.16	33.81	45.12	13.72
		CodeReviewer	21.71	32.92	45.11	13.42
	Out-of-Filters	CodeT5	5.61	16.0	44.55	10.52
		RACE	5.52	15.54	44.48	10.02
		CodeReviewer	5.41	15.42	45.38	10.4
Diff	Filtered	CodeT5	22.77	35.11	48.51	15.55
		RACE	23.48	35.85	49.91	16.28
		CodeReviewer	22.88	35.56	49.04	15.67
	Random	CodeT5	17.66	30.48	44.88	12.93
		RACE	18.28	30.95	46.24	14.05
		CodeReviewer	18.04	30.94	45.51	13.82
	Out-of-Filters	CodeT5	6.21	16.44	47.41	10.98
		RACE	6.16	16.21	47.68	11.47
		CodeReviewer	6.37	16.71	48.79	11.64

Plots

⬆️ Filtered

⬆️ Random

⬆️ Out-of-Filters

50% context ratio

Metrics

Input Setting	Subset	Model	B-Norm	EditSim	EM@1	EM@2
Diff + History	Filtered	CodeT5	35.2	40.17	53.4	16.25
		RACE	35.78	41.36	54.34	17.25
		CodeReviewer	35.28	40.29	53.98	16.5
	Random	CodeT5	26.91	33.97	47.97	12.94
		RACE	27.21	34.81	47.93	13.06
		CodeReviewer	26.95	33.93	48.15	13.42
	Out-of-Filters	CodeT5	6.79	15.62	39.33	7.15
		RACE	6.89	15.9	39.09	7.4
		CodeReviewer	6.73	15.58	40.42	7.59
Diff	Filtered	CodeT5	30.76	38.39	57.3	16.9
		RACE	31.46	39.02	58.19	17.89
		CodeReviewer	30.82	38.9	57.92	17.2
	Random	CodeT5	24.09	32.75	50.29	14.29
		RACE	24.77	33.25	50.86	14.61
		CodeReviewer	24.43	33.31	51.16	14.92
	Out-of-Filters	CodeT5	7.84	17.32	40.4	8.8
		RACE	7.74	16.97	40.8	8.88
		CodeReviewer	8.04	17.78	42.67	9.32

Plots

⬆️ Filtered

⬆️ Random

⬆️ Out-of-Filters

From Commit Message Generation to History-Aware Commit Message Completion

Overview

Experimental Results

RQs and Key Findings

Full Results

Metrics

Plots

Metrics

Plots

Metrics

Plots

Metrics

Plots

Metrics

Plots

Citation