about the train data

Question

about the train data

fghjdj opened this issue 5 months ago · 0 comments

Sorry I'm confused about how to get the start_token_idx and the end_token_idx.For example(line 9204):
{
"text": "Q: A group of students wants to distribute 237 pencils and 894 erasers evenly among themselves. What is the maximum number of students in the group that can receive the same number of pencils and erasers?\nA: The maximum number of students is the greatest common divisor of 237 and 894, which is 3.",
"start_token_idx": [
84
],
"end_token_idx": [
85
],
"tar_eq": [
"(237,894)=3"
],
"tar_number": [
"3"
]
}
0: Q
1: :
2: (space)
3: A
4: (space)
5: group
6: (space)
7: of
8: (space)
9: students
10: (space)
11: wants
12: (space)
13: to
14: (space)
15: distribute
16: (space)
17: 237
18: (space)
19: pencils
20: (space)
21: and
22: (space)
23: 894
24: (space)
25: erasers
26: (space)
27: evenly
28: (space)
29: among
30: (space)
31: themselves
32: (period)
33: (space)
34: What
35: (space)
36: is
37: (space)
38: the
39: (space)
40: maximum
41: (space)
42: number
43: (space)
44: of
45: (space)
46: students
47: (space)
48: in
49: (space)
50: the
51: (space)
52: group
53: (space)
54: that
55: (space)
56: can
57: (space)
58: receive
59: (space)
60: the
61: (space)
62: same
63: (space)
64: number
65: (space)
66: of
67: (space)
68: pencils
69: (space)
70: and
71: (space)
72: erasers
73: (question mark)
74: (space)
75: A
76: (colon)
77: (space)
78: The
79: (space)
80: maximum
81: (space)
82: number
83: (space)
84: of
85: (space)
86: students
87: (space)
88: is
89: (space)
90: the
91: (space)
92: greatest
93: (space)
94: common
95: (space)
96: divisor
97: (space)
98: of
99: (space)
100: 237
101: (space)
102: and
103: (space)
104: 894
105: (comma)
106: (space)
107: which
108: (space)
109: is
110: (space)
111: 3
I don't think "of" is the start token.Maybe llama has a diiferent way to divide sentence into token.Would you please tell me more details about how to get the correct data, such as the code about it.