Introduction

LLM-SR aims to generate a controllable and interpretable reasoning process by employing step-by-step inferences. In this task, we focus on a fine-grained analysis of the Chain-of-Thought (CoT) process, which enables a more detailed evaluation of LLMs and contributes to Process Reward Modeling, thereby enhancing the generation of more coherent and accurate reasoning processes.

To achieve this, the task requires generating "question_parsing" and "cot_parsing" results based on the content of "question" and "cot" (produced by Llama-3-8B-Instruct) for each given question. The question parsing process involves extracting all conditions necessary for solving the question. The CoT parsing process identifies all "statements" and their corresponding "evidence" within the context of the question conditions and the given CoT content. Subsequently, for each extracted statement-evidence pair, a conclusion is required to determine whether the evidence sufficiently supports the statement.

Focusing on the LLM’s capacity for fine-grained question analysis and deduction based on given conditions, we provide only 24 training examples to illustrate the data format and question types. Furthermore, participants can only use the Llama-3-8B-Instruct as their backbone model.

You're welcome to join our Slack community —feel free to ask questions and connect with us!

Dataset

We introduce a fine-grained Chain-of-Thought (CoT) analysis dataset derived from LogiQA, comprising 24 annotated examples that constitute the training set. Each example is supplemented with question parsing and CoT parsing annotations.

A data sample:

{
    "question": "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.The school has only 7 students participating in this activity, and each person happens to go to one of these two countries.Considering the specialty of each student, this activity must meet the following conditions? (1) If G goes to the UK, then H To the United States.(2) If L goes to the UK, both M and U go to the US.(3) The country W went to was different from the country Z went to.(4) The country where U goes is different from the country where G goes.(5) If Z goes to the UK, then H also goes to the UK.\nIf G goes to the United States, which of the following must be true?\nA.H go to the UK\nB.L go to America\nC.M go to the UK\nD.W go to America",
    "question_parsing": [
        "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.",
        "each person happens to go to one of these two countries",
        "If G goes to the UK, then H To the United States",
        "If L goes to the UK, both M and U go to the US",
        "The country W went to was different from the country Z went to",
        "The country where U goes is different from the country where G goes",
        "If Z goes to the UK, then H also goes to the UK",
        "G goes to the United States"
    ],
    "answer": "b",
    "id": 162,
    "cot": "Since G goes to the United States, we need to analyze the conditions that follow. Condition (1) is not applicable since G is going to the US. Condition (2) is also not applicable since L's destination is not specified. Condition (3) does not provide any information about H, M, U, or W. Condition (4) states that U's destination is different from G's, which is the US, so U must go to the UK. Condition (5) is not applicable since Z's destination is not specified.",
    "cot_parsing": [
        {
            "statement": "Condition (1) is not applicable",
            "evidence": "Condition (1): If G goes to the UK, then H To the United States. | G is going to the US",
            "Verification": "false"
        },
        {
            "statement": "Condition (2) is also not applicable",
            "evidence": "Condition (2): If L goes to the UK, both M and U go to the US. | L's destination is not specified",
            "Verification": "false"
        },
        {
            "statement": "Condition (3) does not provide any information about H, M, U, or W",
            "evidence": "Condition (3): The country W went to was different from the country Z went to.",
            "Verification": "false"
        },
        {
            "statement": "U must go to the UK",
            "evidence": "Condition (4): The country where U goes is different from the country where G goes. | Condition (4) states that U's destination is different from G's, which is the US",
            "Verification": "true"
        },
        {
            "statement": "Condition (5) is not applicable",
            "evidence": "Condition (5): If Z goes to the UK, then H also goes to the UK. | Z's destination is not specified",
            "Verification": "true"
        }
    ],
    "sel_idx": 92
},                        
                        

If the "statement" can be logically deduced from the "evidence," then the "verification" is considered true; otherwise, the "verification" is false.

We’ve refined our training datasets. Please download the latest version from Google Drive (Final_Selection_Train_v2.json).

Challenge Task Definition and Metrtics

This task consists of two parts: Question parsing and CoT parsing.

Question parsing involves extracting all relevant conditions required to solve the problem. The Macro F1 score metric is used to evaluate question parsing performance.

The process of extracting statements and evidence is similar to Discourse Parsing. Correct extraction of statements or evidence from the COT is crucial at the outset. Next, the pairwise relationship between a specific statement and its corresponding evidence is assessed (a statement should be followed by its related evidence from the COT). Both semantic and lexical similarity are used to evaluate the accuracy of statements and evidence predictions. The final evaluation metric is the Macro F1 score, applied to both statement parsing and statement-evidence pair extraction.

Finally, once a statement-evidence pair is correctly extracted, it is evaluated to determine whether the evidence can logically deduce the statement. The Macro F1 score metric is again used for this evaluation.

Baseline & Code

The evaluation code is available on Github. Participants can use the evaluation script using the following command:

                            
python eval.py --prediction '/path/to/prediction.json' --reference '/path/to/reference.json' --question_threshold 0.95 --statement_threshold 0.9 --relation_threshold 0.9
                        

Participants can now access the sample file for the test set (Public_Test_A.json) via Google Drive. This file contains 50 sample test cases to aid in model evaluation and development. Participants are required to submit the prediction results for the 50 test set samples (results.json) to CodaBench, where the platform will automatically evaluate and score the submitted predictions. Additionally, we will use the model checkpoint files (*.ckpt/*.pt/*.safetensor/...) and executable scripts (*.py) provided by the participants to generate predictions for an additional set of 100 test samples.

Now, participants can already submit your prediction results for Public_Test_A.json to CodaBench to obtain evaluation scores.

We utilize several large language models that have not been fine-tuned, serving as baselines for comparison. The prediction results of these baseline models are available on Google Drive (test_result_XXX-icl.json). Their performance is summarized as follows:

Models Question_F1 Statement_F1 Statement_Evidence_F1 Reasoning_F1
DeepSeek-R1-INT4 0.8187 0.4484 0.1242 0.1079
Llama-3-8B-Instruct 0.7301 0.424 0.181 0.1032

Submission

Our challenge seeks to investigate the structure reasoning capabilities of large language models (LLMs), particularly in low-resource settings. To this end, we introduce a fine-grained Chain-of-Thought (CoT) analysis dataset derived from LogiQA, comprising 24 annotated examples that constitute the training set. Each example is supplemented with question parsing and CoT parsing annotations.

Participants may utilize the provided training set to develop their own structure reasoning models and make predictions on the test set. It is important to note that the use of additional data for training is permitted. And leveraging additional models to assist in data annotation or preprocessing is allowed! Participants are required to apply their trained models to generate predictions on the test set and present the results in the specified format.

Each test data point includes a question, answer, ID, and chain of thought (CoT). Participants need to predict the question_parsing and cot_parsing. Participants need to successfully submit all the following files to be considered valid submissions:

  1. Prediction result file (results.json): This file should contain the prediction results of the model on the test set, formatted according to the specified requirements.
  2. Model weight file (*.ckpt, *.bin): Participants are required to provide the trained model weight file. The file should be uploaded to cloud storage, and the corresponding link must be recorded in the link.txt file.
  3. Executable script file (*.py): An executable script file must be provided, which will be used in conjunction with the submitted model weights to verify the correctness of the provided results. The file should be uploaded to cloud storage, and the corresponding link must be recorded in the link.txt file.
The prediction result file(results.json) and the link file (link.txt) must be submitted via CodaBench, which will be used to evaluate the submitted prediction results in real time. We do not restrict the file format of the weight files; participants can use formats such as .safetensor, .pt, or others. However, participants must ensure that the submitted executable script can successfully load the model weights and run correctly.

Please sumbit predicted results with a json files "results.json". (Please ensure that the submitted file is named "results.json".)

{
    "question": "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.The school has only 7 students participating in this activity, and each person happens to go to one of these two countries.Considering the specialty of each student, this activity must meet the following conditions? (1) If G goes to the UK, then H To the United States.(2) If L goes to the UK, both M and U go to the US.(3) The country W went to was different from the country Z went to.(4) The country where U goes is different from the country where G goes.(5) If Z goes to the UK, then H also goes to the UK.\nIf G goes to the United States, which of the following must be true?\nA.H go to the UK\nB.L go to America\nC.M go to the UK\nD.W go to America",
    "question_parsing": [
            "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.",
            "each person happens to go to one of these two countries",
            "If G goes to the UK, then H To the United States",
            "If L goes to the UK, both M and U go to the US",
            "The country W went to was different from the country Z went to",
            "The country where U goes is different from the country where G goes",
            "If Z goes to the UK, then H also goes to the UK",
            "G goes to the United States"
        ],
    "answer": "b",
    "id": 162,
    "cot": "Since G goes to the United States, we need to analyze the conditions that follow. Condition (1) is not applicable since G is going to the US. Condition (2) is also not applicable since L's destination is not specified. Condition (3) does not provide any information about H, M, U, or W. Condition (4) states that U's destination is different from G's, which is the US, so U must go to the UK. Condition (5) is not applicable since Z's destination is not specified.",
    "cot_parsing": [
        {
            "statement": "Condition (1) is not applicable",
            "evidence": "Condition (1): If G goes to the UK, then H To the United States. | G is going to the US",
            "Verification": "false"
        },
        {
            "statement": "Condition (2) is also not applicable",
            "evidence": "Condition (2): If L goes to the UK, both M and U go to the US. | L's destination is not specified",
            "Verification": "false"
        },
        {
            "statement": "Condition (3) does not provide any information about H, M, U, or W",
            "evidence": "Condition (3): The country W went to was different from the country Z went to.",
            "Verification": "false"
        },
        {
            "statement": "U must go to the UK",
            "evidence": "Condition (4): The country where U goes is different from the country where G goes. | Condition (4) states that U's destination is different from G's, which is the US",
            "Verification": "true"
        },
        {
            "statement": "Condition (5) is not applicable",
            "evidence": "Condition (5): If Z goes to the UK, then H also goes to the UK. | Z's destination is not specified",
            "Verification": "true"
        }
    ],
    "sel_idx": 92
},                        

Timeline

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Training data and participant instruction release for all shared tasks February 10, 2025
Evaluation deadline for all shared tasks March 30Apr. 10, 2025
Notification of all shared tasks April 5, 2025
Shared-task paper submission deadline April 20, 2025
Acceptance notification of shared-task papers April 30, 2025
Camera ready paper deadline May 16, 2025

Award of Top-ranking Participants

Top-ranked participants in this competition will receive a certificate of achievement and will be recommended to write a technical paper for submission to the XLLM Workshop of ACL 2025.

Organizers

Zixia Jia (Beijing Institute for General Artificial Intelligence, Beijing, China)

Zilong Zheng (Beijing Institute for General Artificial Intelligence, Beijing, China)

Yang Liu (Beijing Institute for General Artificial Intelligence, Beijing, China)

Jiaqi Li (Beijing Institute for General Artificial Intelligence, Beijing, China)

Jun Bai (Beijing Institute for General Artificial Intelligence, Beijing, China)

References

[1] Zhang, Yunxiang, et al. "Small language models need strong verifiers to self-correct reasoning." arXiv preprint arXiv:2404.17140 (2024).

[2] Wan, Guangya, et al. "CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction." arXiv preprint arXiv:2408.13940 (2024).

[3] Xia, Shijie, et al. "Evaluating mathematical reasoning beyond accuracy." arXiv preprint arXiv:2404.05692 (2024).