hikopensource/DAVAR-Lab-OCR

生成的表格

Opened this issue · 0 comments

["", "<td", " colspan="2"", ">", "", "<td", " colspan="2"", ">", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "<td", " rowspan="2"", ">", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "<td", " rowspan="2"", ">", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""]

我生成的表格是没有thead 和tbody 符号,这个符号一定需要?导致:
def get_headbody(html_str):
"""Calculating number of bboxes belonging to "t-head" and "t-body" respectively

Args:
    html_str(str): html representing table structure

Returns:
    int: number of bboxes belonging to "t-head"
    int: number of bboxes belonging to "t-body"
"""
# html_code = ''.join(html_str)
# html_str = list('''<html><body><table>%s</table></body></html>''' % html_code)

s_h, e_h = html_str.index('<thead>'), html_str.index('</thead>')
s_b, e_b = html_str.index('<tbody>'), html_str.index('</tbody>')
num_h = html_str[s_h + 1:e_h].count('</td>')
num_b = html_str[s_b + 1:e_b].count('</td>')
return num_h, num_b

这个函数转换失败