flat_headings_list is not flat
Opened this issue · 1 comments
isoboroff commented
soboroff$ ipython3
Python 3.6.6 (default, Jun 28 2018, 05:43:53)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from trec_car import read_data
In [2]: fp = open('/home/collections/news-track-2018-wikipedia/all-enwiki-201708
...: 20/all-enwiki-20170820.cbor', 'rb')
In [3]: a = read_data.iter_annotations(fp)
In [4]: page = a.__next__()
In [5]: page.flat_headings_list()
Out[5]:
[[<trec_car.read_data.Section at 0x106b684e0>,
<trec_car.read_data.Section at 0x106b6b278>],
[<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b6bf98>],
[<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73160>],
[<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73208>],
[<trec_car.read_data.Section at 0x106b6bf28>],
[<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d780b8>],
[<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d7b160>],
[<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d80080>],
[<trec_car.read_data.Section at 0x106d78080>],
[<trec_car.read_data.Section at 0x106d803c8>],
[<trec_car.read_data.Section at 0x106d80e10>],
[<trec_car.read_data.Section at 0x106d80e48>],
[<trec_car.read_data.Section at 0x106d80e80>],
[<trec_car.read_data.Section at 0x106d83080>]]
In [6]: import itertools
In [7]: itertools.chain.from_iterable(page.flat_headings_list())
Out[7]: <itertools.chain at 0x106d9e0b8>
In [8]: list(itertools.chain.from_iterable(page.flat_headings_list()))
Out[8]:
[<trec_car.read_data.Section at 0x106b684e0>,
<trec_car.read_data.Section at 0x106b6b278>,
<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b6bf98>,
<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73160>,
<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73208>,
<trec_car.read_data.Section at 0x106b6bf28>,
<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d780b8>,
<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d7b160>,
<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d80080>,
<trec_car.read_data.Section at 0x106d78080>,
<trec_car.read_data.Section at 0x106d803c8>,
<trec_car.read_data.Section at 0x106d80e10>,
<trec_car.read_data.Section at 0x106d80e48>,
<trec_car.read_data.Section at 0x106d80e80>,
<trec_car.read_data.Section at 0x106d83080>]
laura-dietz commented
Hi Ian,
I am sorry it does not do what you expect.
If in doubt: don't use the convenience functions such as flat
headings. Instead get the page skeletons and traverse it yourself.
I regret ever offering those and I don't have the man power to
maintain them properly.
BTW: it would be easier to help you if would provide useful output
such as str(section) -- sadly python's list/map str does not call
str(element).
Laura
On 07/20/2018 09:54 AM, Ian Soboroff wrote:
soboroff$ ipython3
Python 3.6.6 (default, Jun 28 2018, 05:43:53)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from trec_car import read_data
In [2]: fp = open('/home/collections/news-track-2018-wikipedia/all-enwiki-201708
...: 20/all-enwiki-20170820.cbor', 'rb')
In [3]: a = read_data.iter_annotations(fp)
In [4]: page = a.__next__()
In [5]: page.flat_headings_list()
Out[5]:
[[<trec_car.read_data.Section at 0x106b684e0>,
<trec_car.read_data.Section at 0x106b6b278>],
[<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b6bf98>],
[<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73160>],
[<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73208>],
[<trec_car.read_data.Section at 0x106b6bf28>],
[<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d780b8>],
[<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d7b160>],
[<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d80080>],
[<trec_car.read_data.Section at 0x106d78080>],
[<trec_car.read_data.Section at 0x106d803c8>],
[<trec_car.read_data.Section at 0x106d80e10>],
[<trec_car.read_data.Section at 0x106d80e48>],
[<trec_car.read_data.Section at 0x106d80e80>],
[<trec_car.read_data.Section at 0x106d83080>]]
In [6]: import itertools
In [7]: itertools.chain.from_iterable(page.flat_headings_list())
Out[7]: <itertools.chain at 0x106d9e0b8>
In [8]: list(itertools.chain.from_iterable(page.flat_headings_list()))
Out[8]:
[<trec_car.read_data.Section at 0x106b684e0>,
<trec_car.read_data.Section at 0x106b6b278>,
<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b6bf98>,
<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73160>,
<trec_car.read_data.Section at 0x106b68518>,
<trec_car.read_data.Section at 0x106b73208>,
<trec_car.read_data.Section at 0x106b6bf28>,
<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d780b8>,
<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d7b160>,
<trec_car.read_data.Section at 0x106b73fd0>,
<trec_car.read_data.Section at 0x106d80080>,
<trec_car.read_data.Section at 0x106d78080>,
<trec_car.read_data.Section at 0x106d803c8>,
<trec_car.read_data.Section at 0x106d80e10>,
<trec_car.read_data.Section at 0x106d80e48>,
<trec_car.read_data.Section at 0x106d80e80>,
<trec_car.read_data.Section at 0x106d83080>]
—
You are receiving this because you are subscribed to this
thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/TREMA-UNH/trec-car-tools","title":"TREMA-UNH/trec-car-tools","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/TREMA-UNH/trec-car-tools"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"flat_headings_list is not flat (#23)"}],"action":{"name":"View Issue","url":"#23"}}}
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "#23",
"url": "#23",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
},
{
"@type": "MessageCard",
"@context": "http://schema.org/extensions",
"hideOriginalBody": "false",
"originator": "AF6C5A86-E920-430C-9C59-A73278B5EFEB",
"title": "flat_headings_list is not flat (#23)",
"sections": [
{
"text": "",
"activityTitle": "**Ian Soboroff**",
"activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png",
"activitySubtitle": "@isoboroff",
"facts": [
{
"name": "Repository: ",
"value": "TREMA-UNH/trec-car-tools"
},
{
"name": "Issue #: ",
"value": 23
}
]
}
],
"potentialAction": [
{
"name": "Add a comment",
"@type": "ActionCard",
"inputs": [
{
"isMultiLine": true,
"@type": "TextInput",
"id": "IssueComment",
"isRequired": false
}
],
"actions": [
{
"name": "Comment",
"@type": "HttpPOST",
"target": "https://api.github.com",
"body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"TREMA-UNH/trec-car-tools\",\n\"issueId\": 23,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}"
}
]
},
{
"name": "Close issue",
"@type": "HttpPOST",
"target": "https://api.github.com",
"body": "{\n\"commandName\": \"IssueClose\",\n\"repositoryFullName\": \"TREMA-UNH/trec-car-tools\",\n\"issueId\": 23\n}"
},
{
"targets": [
{
"os": "default",
"uri": "#23"
}
],
"@type": "OpenUri",
"name": "View on GitHub"
},
{
"name": "Unsubscribe",
"@type": "HttpPOST",
"target": "https://api.github.com",
"body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 358632221\n}"
}
],
"themeColor": "26292E"
}
]