LLM-SR aims to generate a controllable and interpretable reasoning process by employing step-by-step inferences. In this task, we focus on a fine-grained analysis of the Chain-of-Thought (CoT) process, which enables a more detailed evaluation of LLMs and contributes to Process Reward Modeling, thereby enhancing the generation of more coherent and accurate reasoning processes.
To achieve this, the task requires generating "question_parsing" and "cot_parsing" results based on the content of "question" and "cot" (produced by Llama-3-8B-Instruct) for each given question. The question parsing process involves extracting all conditions necessary for solving the question. The CoT parsing process identifies all "statements" and their corresponding "evidence" within the context of the question conditions and the given CoT content. Subsequently, for each extracted statement-evidence pair, a conclusion is required to determine whether the evidence sufficiently supports the statement.
Focusing on the LLM’s capacity for fine-grained question analysis and deduction based on given conditions, we provide only 24 training examples to illustrate the data format and question types. Furthermore, participants can only use the Llama-3-8B-Instruct as their backbone model.
You're welcome to join our Slack community —feel free to ask questions and connect with us!
We introduce a fine-grained Chain-of-Thought (CoT) analysis dataset derived from LogiQA, comprising 24 annotated examples that constitute the training set. Each example is supplemented with question parsing and CoT parsing annotations.
A data sample:
{ "question": "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.The school has only 7 students participating in this activity, and each person happens to go to one of these two countries.Considering the specialty of each student, this activity must meet the following conditions? (1) If G goes to the UK, then H To the United States.(2) If L goes to the UK, both M and U go to the US.(3) The country W went to was different from the country Z went to.(4) The country where U goes is different from the country where G goes.(5) If Z goes to the UK, then H also goes to the UK.\nIf G goes to the United States, which of the following must be true?\nA.H go to the UK\nB.L go to America\nC.M go to the UK\nD.W go to America", "question_parsing": [ "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.", "each person happens to go to one of these two countries", "If G goes to the UK, then H To the United States", "If L goes to the UK, both M and U go to the US", "The country W went to was different from the country Z went to", "The country where U goes is different from the country where G goes", "If Z goes to the UK, then H also goes to the UK", "G goes to the United States" ], "answer": "b", "id": 162, "cot": "Since G goes to the United States, we need to analyze the conditions that follow. Condition (1) is not applicable since G is going to the US. Condition (2) is also not applicable since L's destination is not specified. Condition (3) does not provide any information about H, M, U, or W. Condition (4) states that U's destination is different from G's, which is the US, so U must go to the UK. Condition (5) is not applicable since Z's destination is not specified.", "cot_parsing": [ { "statement": "Condition (1) is not applicable", "evidence": "Condition (1): If G goes to the UK, then H To the United States. | G is going to the US", "Verification": "false" }, { "statement": "Condition (2) is also not applicable", "evidence": "Condition (2): If L goes to the UK, both M and U go to the US. | L's destination is not specified", "Verification": "false" }, { "statement": "Condition (3) does not provide any information about H, M, U, or W", "evidence": "Condition (3): The country W went to was different from the country Z went to.", "Verification": "false" }, { "statement": "U must go to the UK", "evidence": "Condition (4): The country where U goes is different from the country where G goes. | Condition (4) states that U's destination is different from G's, which is the US", "Verification": "true" }, { "statement": "Condition (5) is not applicable", "evidence": "Condition (5): If Z goes to the UK, then H also goes to the UK. | Z's destination is not specified", "Verification": "true" } ], "sel_idx": 92 },
If the "statement" can be logically deduced from the "evidence," then the "verification" is considered true; otherwise, the "verification" is false.
We’ve refined our training datasets. Please download the latest version from Google Drive (Final_Selection_Train_v2.json).
This task consists of two parts: Question parsing and CoT parsing.
Question parsing involves extracting all relevant conditions required to solve the problem. The Macro F1 score metric is used to evaluate question parsing performance.
The process of extracting statements and evidence is similar to Discourse Parsing. Correct extraction of statements or evidence from the COT is crucial at the outset. Next, the pairwise relationship between a specific statement and its corresponding evidence is assessed (a statement should be followed by its related evidence from the COT). Both semantic and lexical similarity are used to evaluate the accuracy of statements and evidence predictions. The final evaluation metric is the Macro F1 score, applied to both statement parsing and statement-evidence pair extraction.
Finally, once a statement-evidence pair is correctly extracted, it is evaluated to determine whether the evidence can logically deduce the statement. The Macro F1 score metric is again used for this evaluation.
The evaluation code is available on Github. Participants can use the evaluation script using the following command:
python eval.py --prediction '/path/to/prediction.json' --reference '/path/to/reference.json' --question_threshold 0.95 --statement_threshold 0.9 --relation_threshold 0.9
Participants can now access the sample file for the test set (Public_Test_A.json) via Google Drive. This file contains 50 sample test cases to aid in model evaluation and development. Participants are required to submit the prediction results for the 50 test set samples (results.json) to CodaBench, where the platform will automatically evaluate and score the submitted predictions. Additionally, we will use the model checkpoint files (*.ckpt/*.pt/*.safetensor/...) and executable scripts (*.py) provided by the participants to generate predictions for an additional set of 100 test samples.
Now, participants can already submit your prediction results for Public_Test_A.json to CodaBench to obtain evaluation scores.
We utilize several large language models that have not been fine-tuned, serving as baselines for comparison. The prediction results of these baseline models are available on Google Drive (test_result_XXX-icl.json). Their performance is summarized as follows:
Models | Question_F1 | Statement_F1 | Statement_Evidence_F1 | Reasoning_F1 |
---|---|---|---|---|
DeepSeek-R1-INT4 | 0.8187 | 0.4484 | 0.1242 | 0.1079 |
Llama-3-8B-Instruct | 0.7301 | 0.424 | 0.181 | 0.1032 |
Our challenge seeks to investigate the structure reasoning capabilities of large language models (LLMs), particularly in low-resource settings. To this end, we introduce a fine-grained Chain-of-Thought (CoT) analysis dataset derived from LogiQA, comprising 24 annotated examples that constitute the training set. Each example is supplemented with question parsing and CoT parsing annotations.
Participants may utilize the provided training set to develop their own structure reasoning models and make predictions on the test set. It is important to note that the use of additional data for training is permitted. And leveraging additional models to assist in data annotation or preprocessing is allowed! Participants are required to apply their trained models to generate predictions on the test set and present the results in the specified format.
Each test data point includes a question, answer, ID, and chain of thought (CoT). Participants need to predict the question_parsing and cot_parsing. Participants need to successfully submit all the following files to be considered valid submissions:
Please sumbit predicted results with a json files "results.json". (Please ensure that the submitted file is named "results.json".)
{ "question": "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.The school has only 7 students participating in this activity, and each person happens to go to one of these two countries.Considering the specialty of each student, this activity must meet the following conditions? (1) If G goes to the UK, then H To the United States.(2) If L goes to the UK, both M and U go to the US.(3) The country W went to was different from the country Z went to.(4) The country where U goes is different from the country where G goes.(5) If Z goes to the UK, then H also goes to the UK.\nIf G goes to the United States, which of the following must be true?\nA.H go to the UK\nB.L go to America\nC.M go to the UK\nD.W go to America", "question_parsing": [ "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.", "each person happens to go to one of these two countries", "If G goes to the UK, then H To the United States", "If L goes to the UK, both M and U go to the US", "The country W went to was different from the country Z went to", "The country where U goes is different from the country where G goes", "If Z goes to the UK, then H also goes to the UK", "G goes to the United States" ], "answer": "b", "id": 162, "cot": "Since G goes to the United States, we need to analyze the conditions that follow. Condition (1) is not applicable since G is going to the US. Condition (2) is also not applicable since L's destination is not specified. Condition (3) does not provide any information about H, M, U, or W. Condition (4) states that U's destination is different from G's, which is the US, so U must go to the UK. Condition (5) is not applicable since Z's destination is not specified.", "cot_parsing": [ { "statement": "Condition (1) is not applicable", "evidence": "Condition (1): If G goes to the UK, then H To the United States. | G is going to the US", "Verification": "false" }, { "statement": "Condition (2) is also not applicable", "evidence": "Condition (2): If L goes to the UK, both M and U go to the US. | L's destination is not specified", "Verification": "false" }, { "statement": "Condition (3) does not provide any information about H, M, U, or W", "evidence": "Condition (3): The country W went to was different from the country Z went to.", "Verification": "false" }, { "statement": "U must go to the UK", "evidence": "Condition (4): The country where U goes is different from the country where G goes. | Condition (4) states that U's destination is different from G's, which is the US", "Verification": "true" }, { "statement": "Condition (5) is not applicable", "evidence": "Condition (5): If Z goes to the UK, then H also goes to the UK. | Z's destination is not specified", "Verification": "true" } ], "sel_idx": 92 },
Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.
Training data and participant instruction release for all shared tasks | February 10, 2025 |
Evaluation deadline for all shared tasks | |
Notification of all shared tasks | April 5, 2025 |
Shared-task paper submission deadline | April 20, 2025 |
Acceptance notification of shared-task papers | April 30, 2025 |
Camera ready paper deadline | May 16, 2025 |
Top-ranked participants in this competition will receive a certificate of achievement and will be recommended to write a technical paper for submission to the XLLM Workshop of ACL 2025.
Zixia Jia (Beijing Institute for General Artificial Intelligence, Beijing, China)
Zilong Zheng (Beijing Institute for General Artificial Intelligence, Beijing, China)
Yang Liu (Beijing Institute for General Artificial Intelligence, Beijing, China)
Jiaqi Li (Beijing Institute for General Artificial Intelligence, Beijing, China)
Jun Bai (Beijing Institute for General Artificial Intelligence, Beijing, China)
[1] Zhang, Yunxiang, et al. "Small language models need strong verifiers to self-correct reasoning." arXiv preprint arXiv:2404.17140 (2024).
[2] Wan, Guangya, et al. "CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction." arXiv preprint arXiv:2408.13940 (2024).
[3] Xia, Shijie, et al. "Evaluating mathematical reasoning beyond accuracy." arXiv preprint arXiv:2404.05692 (2024).