Introduction

LLM-SR aims to generate a controllable and interpretable reasoning process by employing step-by-step inferences. In this task, we focus on a fine-grained analysis of the Chain-of-Thought (CoT) process, which enables a more detailed evaluation of LLMs and contributes to Process Reward Modeling, thereby enhancing the generation of more coherent and accurate reasoning processes.

To achieve this, the task requires generating "question_parsing" and "cot_parsing" results based on the content of "question" and "cot" (produced by Llama-3-8B-Instruct) for each given question. The question parsing process involves extracting all conditions necessary for solving the question. The CoT parsing process identifies all "statements" and their corresponding "evidence" within the context of the question conditions and the given CoT content. Subsequently, for each extracted statement-evidence pair, a conclusion is required to determine whether the evidence sufficiently supports the statement.

Focusing on the LLM’s capacity for fine-grained question analysis and deduction based on given conditions, we provide only 24 training examples to illustrate the data format and question types. Furthermore, participants can only use the Llama-3-8B-Instruct as their backbone model.

Dataset

We introduce a fine-grained Chain-of-Thought (CoT) analysis dataset derived from LogiQA, comprising 24 annotated examples that constitute the training set. Each example is supplemented with question parsing and CoT parsing annotations.

A data sample:

{
    "question": "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.The school has only 7 students participating in this activity, and each person happens to go to one of these two countries.Considering the specialty of each student, this activity must meet the following conditions? (1) If G goes to the UK, then H To the United States.(2) If L goes to the UK, both M and U go to the US.(3) The country W went to was different from the country Z went to.(4) The country where U goes is different from the country where G goes.(5) If Z goes to the UK, then H also goes to the UK.\nIf G goes to the United States, which of the following must be true?\nA.H go to the UK\nB.L go to America\nC.M go to the UK\nD.W go to America",
    "question_parsing": [
        "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.",
        "each person happens to go to one of these two countries",
        "If G goes to the UK, then H To the United States",
        "If L goes to the UK, both M and U go to the US",
        "The country W went to was different from the country Z went to",
        "The country where U goes is different from the country where G goes",
        "If Z goes to the UK, then H also goes to the UK",
        "G goes to the United States"
    ],
    "answer": "b",
    "id": 162,
    "cot": "Since G goes to the United States, we need to analyze the conditions that follow. Condition (1) is not applicable since G is going to the US. Condition (2) is also not applicable since L's destination is not specified. Condition (3) does not provide any information about H, M, U, or W. Condition (4) states that U's destination is different from G's, which is the US, so U must go to the UK. Condition (5) is not applicable since Z's destination is not specified.",
    "cot_parsing": [
        {
            "statement": "Condition (1) is not applicable",
            "evidence": "Condition (1): If G goes to the UK, then H To the United States. | G is going to the US",
            "Verification": "false"
        },
        {
            "statement": "Condition (2) is also not applicable",
            "evidence": "Condition (2): If L goes to the UK, both M and U go to the US. | L's destination is not specified",
            "Verification": "false"
        },
        {
            "statement": "Condition (3) does not provide any information about H, M, U, or W",
            "evidence": "Condition (3): The country W went to was different from the country Z went to.",
            "Verification": "false"
        },
        {
            "statement": "U must go to the UK",
            "evidence": "Condition (4): The country where U goes is different from the country where G goes. | Condition (4) states that U's destination is different from G's, which is the US",
            "Verification": "true"
        },
        {
            "statement": "Condition (5) is not applicable",
            "evidence": "Condition (5): If Z goes to the UK, then H also goes to the UK. | Z's destination is not specified",
            "Verification": "true"
        }
    ],
    "sel_idx": 92
},                        
                        

If the "statement" can be logically deduced from the "evidence," then the "verification" is considered true; otherwise, the "verification" is false.

The datasets are avaliable on Google Drive

Challenge Task Definition and Metrtics

The Challenge Task Definition and Metrtics will be comming as soon as possible.

Evaluation

Accuracy

Accuracy is an evaluation indicator that directly reflects whether LLM can accurately answer specific questions through CoT:

$$ Accuracy = \frac{TP+FN}{ TP + FN+ TN + FP} $$

F1 score

Although the accuracy rate can determine the overall correctness, it cannot be used as a good indicator to measure the results when the samples are unbalanced. Therefore, in addition to the accuracy rate, we need to use the F1 score to comprehensively evaluate the reasoning ability of the model. The F1 score is calculated by Precision \(P\) and Recall \(R\):

$$ P = \frac{TP}{ TP + FP } $$

$$ R = \frac{TP}{TP + FN} $$

$$ F1 = \frac{2 \times P \times R }{ P + R } $$

  • \(TP\): True Positive ratio, which indicates the number of samples that were correctly predicted by the model.
  • \(FP\): False Positive ratio, which indicates the number of samples that were incorrectly predicted by the model.
  • \(FN\): False Negative ratio, which the number of samples that the model's prediction are inconsistent with Ground True .

Baseline & Code

Baseline code and full data will publish as soon as possible.

Submission

Our challenge seeks to investigate the structure reasoning capabilities of large language models (LLMs), particularly in low-resource settings. To this end, we introduce a fine-grained Chain-of-Thought (CoT) analysis dataset derived from LogiQA, comprising 24 annotated examples that constitute the training set. Each example is supplemented with question parsing and CoT parsing annotations.

Participants may utilize the provided training set to develop their own structure reasoning models and make predictions on the test set. It is important to note that the use of additional data for training is permitted. Participants are required to apply their trained models to generate predictions on the test set and present the results in the specified format.

Participants need to successfully submit all the following files to be considered valid submissions:

  1. Prediction result file (result.json): This file should contain the prediction results of the model on the test set, formatted according to the specified requirements.
  2. Model weight file (*.ckpt, *.bin): Participants are required to provide the trained model weight file. The file should be uploaded to cloud storage, and the corresponding link must be recorded in the link.txt file.
  3. Executable script file (*.py): An executable script file must be provided, which will be used in conjunction with the submitted model weights to verify the correctness of the provided results. The file should be uploaded to cloud storage, and the corresponding link must be recorded in the link.txt file.
The prediction result file(result.json) and the link file (link.txt) must be submitted via CodaBench, which will be used to evaluate the submitted prediction results in real time.

Please sumbit predicted results with a json files "results.json".

{
    "question": "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.The school has only 7 students participating in this activity, and each person happens to go to one of these two countries.Considering the specialty of each student, this activity must meet the following conditions? (1) If G goes to the UK, then H To the United States.(2) If L goes to the UK, both M and U go to the US.(3) The country W went to was different from the country Z went to.(4) The country where U goes is different from the country where G goes.(5) If Z goes to the UK, then H also goes to the UK.\nIf G goes to the United States, which of the following must be true?\nA.H go to the UK\nB.L go to America\nC.M go to the UK\nD.W go to America",
    "question_parsing": [
            "There are 7 outstanding students G, H, L, M, U, W and Z in a school.During the summer vacation, the school will send them to the United Kingdom and the United States for inspection.",
            "each person happens to go to one of these two countries",
            "If G goes to the UK, then H To the United States",
            "If L goes to the UK, both M and U go to the US",
            "The country W went to was different from the country Z went to",
            "The country where U goes is different from the country where G goes",
            "If Z goes to the UK, then H also goes to the UK",
            "G goes to the United States"
        ],
    "answer": "b",
    "id": 162,
    "cot": "Since G goes to the United States, we need to analyze the conditions that follow. Condition (1) is not applicable since G is going to the US. Condition (2) is also not applicable since L's destination is not specified. Condition (3) does not provide any information about H, M, U, or W. Condition (4) states that U's destination is different from G's, which is the US, so U must go to the UK. Condition (5) is not applicable since Z's destination is not specified.",
    "cot_parsing": [
        {
            "statement": "Condition (1) is not applicable",
            "evidence": "Condition (1): If G goes to the UK, then H To the United States. | G is going to the US",
            "Verification": "false"
        },
        {
            "statement": "Condition (2) is also not applicable",
            "evidence": "Condition (2): If L goes to the UK, both M and U go to the US. | L's destination is not specified",
            "Verification": "false"
        },
        {
            "statement": "Condition (3) does not provide any information about H, M, U, or W",
            "evidence": "Condition (3): The country W went to was different from the country Z went to.",
            "Verification": "false"
        },
        {
            "statement": "U must go to the UK",
            "evidence": "Condition (4): The country where U goes is different from the country where G goes. | Condition (4) states that U's destination is different from G's, which is the US",
            "Verification": "true"
        },
        {
            "statement": "Condition (5) is not applicable",
            "evidence": "Condition (5): If Z goes to the UK, then H also goes to the UK. | Z's destination is not specified",
            "Verification": "true"
        }
    ],
    "sel_idx": 92
},                        

Timeline

Please note: The submission deadline is at 11:59 p.m. (Anywhere on Earth) of the stated deadline date.

Training data and participant instruction release for all shared tasks February 10, 2025
Evaluation deadline for all shared tasks March 30, 2025
Notification of all shared tasks April 5, 2025
Shared-task paper submission deadline April 12, 2025
Acceptance notification of shared-task papers April 30, 2025
Camera ready paper deadline May 16, 2025

Award of Top-ranking Participants

Top-ranked participants in this competition will receive a certificate of achievement and will be recommended to write a technical paper for submission to the XLLM Workshop of ACL 2025.

Organizers

Zixia Jia (Beijing Institute for General Artificial Intelligence, Beijing, China)

Zilong Zheng (Beijing Institute for General Artificial Intelligence, Beijing, China)

Shuyi Zhang (Beijing Institute for General Artificial Intelligence, Beijing, China)

Zhenbin Chen (Beijing Institute for General Artificial Intelligence, Beijing, China)

References

[1] Yao, Yuan, et al. "DocRED: A large-scale document-level relation extraction dataset." arXiv preprint arXiv:1906.06127 (2019).

[2] Tan, Qingyu, et al. "Revisiting DocRED--Addressing the False Negative Problem in Relation Extraction." arXiv preprint arXiv:2205.12696 (2022).

[3] Li, Junpeng, Zixia Jia, and Zilong Zheng. "Semi-automatic data enhancement for document-level relation extraction with distant supervision from large language models." arXiv preprint arXiv:2311.07314 (2023).

[4] Gui, Honghao, et al. "Iepile: Unearthing large-scale schema-based information extraction corpus." arXiv preprint arXiv:2402.14710 (2024).

[5] Xue, Lilong, et al. "Autore: Document-level relation extraction with large language models." arXiv preprint arXiv:2403.14888 (2024).