google-research-datasets/natural-questions

How to get "Long Answer Candidates" from wikipedia source code.

anshkumar opened this issue · 0 comments

I want to convert the source code of the a wikipedia webpage into the format provided by you in the competition. For an example if we look at this webpage, I want to convert it into following format (as provided by you in the competition):

{
  "example_id": "-1220107454853145579",
  "question_text": "who is the south african high commissioner in london",
  "document_text": "High Commission of South Africa , London - wikipedia <H1> High Commission of South Africa , London </H1> <Table> <Tr> <Th_colspan=\"2\"> High Commission of South Africa in London </Th> </Tr> <Tr> <Td_colspan=\"2\"> </Td> </Tr> <Tr> <Th> Location </Th> <Td> Trafalgar Square , London </Td> </Tr> <Tr> <Th> Address </Th> <Td> Trafalgar Square , London , WC2N 5DP </Td> </Tr> <Tr> <Th> Coordinates </Th> <Td> 51 ° 30 ′ 30 '' N 0 ° 07 ′ 37 '' W  /  51.5082 ° N 0.1269 ° W  / 51.5082 ; - 0.1269 Coordinates : 51 ° 30 ′ 30 '' N 0 ° 07 ′ 37 '' W  /  51.5082 ° N 0.1269 ° W  / 51.5082 ; - 0.1269 </Td> </Tr> <Tr> <Th> High Commissioner </Th> <Td> Vacant </Td> </Tr> </Table> Balcony of South Africa House <P> The High Commission of South Africa in London is the diplomatic mission from South Africa to the United Kingdom . It is located at South Africa House , a building on Trafalgar Square , London . As well as containing the offices of the High Commissioner , the building also hosts the South African consulate . It has been a Grade II * Listed Building since 1982 . </P> <H2> Contents </H2> <Ul> <Li> 1 History </Li> <Li> 2 See also </Li> <Li> 3 References </Li> <Li> 4 External links </Li> </Ul> <H2> History ( edit ) </H2> <P> South Africa House was built by Holland , Hannen & Cubitts in the 1930s on the site of what had been Morley 's Hotel until it was demolished in 1936 . The building was designed by Sir Herbert Baker , with architectural sculpture by Coert Steynberg and Sir Charles Wheeler , and opened in 1933 . The building was acquired by the government of South Africa as its main diplomatic presence in the UK . During World War II , Prime Minister Jan Smuts lived there while conducting South Africa 's war plans . </P> <P> In 1961 , South Africa became a republic , and withdrew from the Commonwealth due to its policy of racial segregation . Accordingly , the building became an Embassy , rather than a High Commission . During the 1980s , the building , which was one of the only South African diplomatic missions in a public area , was targeted by protesters from around the world . During the 1990 Poll Tax Riots , the building was set alight by rioters , although not seriously damaged . </P> <P> The first fully free democratic elections in South Africa were held on the 27 April 1994 , and 4 days later , the country rejoined the Commonwealth , 33 years to the day after it withdrew upon becoming a republic . Along with country 's diplomatic missions in other Commonwealth countries , the mission once again became a High Commission . </P> <P> Today , South Africa House is no longer a controversial site , and is the focal point of South African culture in the UK . South African President Nelson Mandela appeared on the balcony of South Africa House in 1996 , as part of his official UK state visit . In 2001 , Mandela again appeared on the balcony of South Africa House to mark the seventh anniversary of Freedom Day , when the apartheid system was officially abolished . </P> <H2> See also ( edit ) </H2> <Ul> <Li> List of diplomatic missions of South Africa </Li> <Li> High Commission of Canada to the United Kingdom </Li> <Li> High Commission of Uganda , London </Li> </Ul> <H2> References ( edit ) </H2> <Table> <Tr> <Td> </Td> <Td> Wikimedia Commons has media related to South Africa House , London . </Td> </Tr> </Table> <Ol> <Li> ^ Jump up to : `` The London Diplomatic List '' ( PDF ) . 14 December 2013 . Archived from the original ( PDF ) on 11 December 2013 . </Li> <Li> Jump up ^ Historic England . `` Details from listed building database ( 1066238 ) '' . National Heritage List for England . Retrieved 28 September 2015 . </Li> <Li> Jump up ^ Cubitts 1810 -- 1975 , published 1975 </Li> <Li> Jump up ^ `` The east side of Trafalgar Square '' . BHO . Retrieved 22 November 2015 . </Li> <Li> Jump up ^ Palliser , David Michael ; Clark , Peter ; Daunton , Martin J. ( 2000 ) . The Cambridge Urban History of Britain : 1840 -- 1950 . Cambridge University Press . p. 126 . </Li> <Li> ^ Jump up to : South Africa returns to the Commonwealth fold , The Independent , 31 May 1994 </Li> <Li> Jump up ^ Burns , Danny ( 1992 ) . Poll tax rebellion . AK Press . p. 90 . </Li> <Li> Jump up ^ United Kingdom of Great Britain and Northern Ireland , Department of International Relations and Cooperation </Li> <Li> Jump up ^ Hero 's welcome for Mandela at concert . BBC News . April 30 , 2001 . </Li> </Ol> <H2> External links ( edit ) </H2> <Ul> <Li> Official site </Li> </Ul> <Table> <Tr> <Th_colspan=\"2\"> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> Diplomatic missions in the United Kingdom </Th> </Tr> <Tr> <Th> Africa </Th> <Td> <Ul> <Li> Algeria </Li> <Li> Angola </Li> <Li> Botswana </Li> <Li> Burundi </Li> <Li> Cameroon </Li> <Li> Democratic Republic of the Congo </Li> <Li> Egypt </Li> <Li> Equatorial Guinea </Li> <Li> Eritrea </Li> <Li> Ethiopia </Li> <Li> Gabon </Li> <Li> The Gambia </Li> <Li> Ghana </Li> <Li> Guinea </Li> <Li> Ivory Coast </Li> <Li> Kenya </Li> <Li> Lesotho </Li> <Li> Liberia </Li> <Li> Libya </Li> <Li> Malawi </Li> <Li> Mauritania </Li> <Li> Mauritius </Li> <Li> Morocco </Li> <Li> Mozambique </Li> <Li> Namibia </Li> <Li> Nigeria </Li> <Li> Rwanda </Li> <Li> Senegal </Li> <Li> Seychelles </Li> <Li> Sierra Leone </Li> <Li> South Africa </Li> <Li> South Sudan </Li> <Li> Sudan </Li> <Li> Swaziland </Li> <Li> Tanzania </Li> <Li> Togo </Li> <Li> Tunisia </Li> <Li> Uganda </Li> <Li> Zambia </Li> <Li> Zimbabwe </Li> </Ul> </Td> </Tr> <Tr> <Th> Americas </Th> <Td> <Ul> <Li> Antigua and Barbuda </Li> <Li> Argentina </Li> <Li> The Bahamas </Li> <Li> Barbados </Li> <Li> Belize </Li> <Li> Bolivia </Li> <Li> Brazil </Li> <Li> Canada </Li> <Li> Chile </Li> <Li> Colombia </Li> <Li> Costa Rica </Li> <Li> Cuba </Li> <Li> Dominica </Li> <Li> Dominican Republic </Li> <Li> Ecuador </Li> <Li> El Salvador </Li> <Li> Grenada </Li> <Li> Guatemala </Li> <Li> Guyana </Li> <Li> Haiti </Li> <Li> Honduras </Li> <Li> Jamaica </Li> <Li> Mexico </Li> <Li> Nicaragua </Li> <Li> Panama </Li> <Li> Paraguay </Li> <Li> Peru </Li> <Li> Saint Kitts and Nevis </Li> <Li> Saint Lucia </Li> <Li> Saint Vincent and the Grenadines </Li> <Li> Trinidad and Tobago </Li> <Li> United States of America </Li> <Li> Uruguay </Li> <Li> Venezuela </Li> </Ul> </Td> </Tr> <Tr> <Th> Asia </Th> <Td> <Ul> <Li> Afghanistan </Li> <Li> Armenia </Li> <Li> Azerbaijan </Li> <Li> Bahrain </Li> <Li> Bangladesh </Li> <Li> Brunei </Li> <Li> Cambodia </Li> <Li> China </Li> <Li> East Timor </Li> <Li> Georgia </Li> <Li> India </Li> <Li> Indonesia </Li> <Li> Iran </Li> <Li> Iraq </Li> <Li> Israel </Li> <Li> Japan </Li> <Li> Jordan </Li> <Li> Kazakhstan </Li> <Li> Kuwait </Li> <Li> Kyrgyzstan </Li> <Li> Laos </Li> <Li> Lebanon </Li> <Li> Malaysia </Li> <Li> Maldives </Li> <Li> Mongolia </Li> <Li> Myanmar </Li> <Li> Nepal </Li> <Li> North Korea </Li> <Li> Oman </Li> <Li> Pakistan </Li> <Li> The Philippines </Li> <Li> Qatar </Li> <Li> Saudi Arabia </Li> <Li> Singapore </Li> <Li> South Korea </Li> <Li> Sri Lanka </Li> <Li> Syria </Li> <Li> Tajikistan </Li> <Li> Thailand </Li> <Li> Turkey </Li> <Li> Turkmenistan </Li> <Li> United Arab Emirates </Li> <Li> Uzbekistan </Li> <Li> Vietnam </Li> <Li> Yemen </Li> </Ul> </Td> </Tr> <Tr> <Th> Europe </Th> <Td> <Ul> <Li> Albania </Li> <Li> Austria </Li> <Li> Belarus </Li> <Li> Belgium </Li> <Li> Bosnia and Herzegovina </Li> <Li> Bulgaria </Li> <Li> Croatia </Li> <Li> Cyprus </Li> <Li> Czech Republic </Li> <Li> Denmark </Li> <Li> Estonia </Li> <Li> Finland </Li> <Li> France </Li> <Li> Germany </Li> <Li> Greece </Li> <Li> Hungary </Li> <Li> Iceland </Li> <Li> Ireland </Li> <Li> Italy </Li> <Li> Kosovo </Li> <Li> Latvia </Li> <Li> Lithuania </Li> <Li> Luxembourg </Li> <Li> Macedonia </Li> <Li> Malta </Li> <Li> Moldova </Li> <Li> Monaco </Li> <Li> Montenegro </Li> <Li> The Netherlands </Li> <Li> Norway </Li> <Li> Poland </Li> <Li> Portugal </Li> <Li> Romania </Li> <Li> Russia </Li> <Li> Serbia </Li> <Li> Slovakia </Li> <Li> Slovenia </Li> <Li> Spain </Li> <Li> Sweden </Li> <Li> Switzerland </Li> <Li> Ukraine </Li> <Li> Vatican City ( Apostolic Nunciature ) </Li> </Ul> </Td> </Tr> <Tr> <Th> Oceania </Th> <Td> <Ul> <Li> Australia </Li> <Li> Fiji </Li> <Li> New Zealand </Li> <Li> Papua New Guinea </Li> <Li> Tonga </Li> </Ul> </Td> </Tr> <Tr> <Th> States with limited recognition </Th> <Td> <Ul> <Li> North Cyprus </Li> <Li> Palestine </Li> <Li> Taiwan </Li> </Ul> </Td> </Tr> <Tr> <Th> De facto independent states </Th> <Td> <Ul> <Li> Somaliland </Li> </Ul> </Td> </Tr> <Tr> <Th> British Overseas Territories </Th> <Td> <Ul> <Li> Anguilla </Li> <Li> Bermuda </Li> <Li> British Virgin Islands </Li> <Li> Cayman Islands </Li> <Li> Falkland Islands </Li> <Li> Gibraltar </Li> <Li> Montserrat </Li> <Li> Saint Helena </Li> <Li> Tristan da Cunha </Li> <Li> Turks and Caicos Islands </Li> </Ul> </Td> </Tr> <Tr> <Th> Other economies with their own representations </Th> <Td> Hong Kong </Td> </Tr> <Tr> <Th> International organisations </Th> <Td> <Ul> <Li> Arab League </Li> <Li> European Union </Li> <Li> International Organisation for Migration </Li> <Li> United Nations <Ul> <Li> UNHCR </Li> <Li> World Food Programme </Li> </Ul> </Li> <Li> World Bank </Li> </Ul> </Td> </Tr> </Table> <Table> <Tr> <Th_colspan=\"3\"> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> Trafalgar Square , London </Th> </Tr> <Tr> <Th> Buildings </Th> <Td> <Table> <Tr> <Th> Current </Th> <Td> <Ul> <Li> Clockwise from North : National Gallery </Li> <Li> St Martin - in - the - Fields </Li> <Li> South Africa House </Li> <Li> Drummonds Bank </Li> <Li> Admiralty Arch </Li> <Li> Uganda House <Ul> <Li> Embassy of Burundi </Li> <Li> High Commission of Uganda </Li> </Ul> </Li> <Li> Canadian Pacific building </Li> <Li> Admiralty ( pub ) </Li> <Li> Canada House </Li> </Ul> </Td> </Tr> <Tr> <Th> Former </Th> <Td> <Ul> <Li> Morley 's Hotel </Li> <Li> Northumberland House </Li> <Li> Royal Mews </Li> </Ul> </Td> </Tr> </Table> </Td> <Td> </Td> </Tr> <Tr> <Th> Statues </Th> <Td> <Table> <Tr> <Th> Plinths </Th> <Td> <Ul> <Li> SE : Henry Havelock </Li> <Li> SW : Charles Napier </Li> <Li> NE : George IV </Li> <Li> NW : Fourth plinth </Li> </Ul> </Td> </Tr> <Tr> <Th> Busts </Th> <Td> <Ul> <Li> Lord Beatty </Li> <Li> Lord Jellicoe </Li> <Li> Lord Cunningham </Li> </Ul> </Td> </Tr> <Tr> <Th> Other </Th> <Td> <Ul> <Li> Charles I <Ul> <Li> Charing Cross </Li> </Ul> </Li> <Li> Nelson 's Column </Li> <Li> James II </Li> <Li> George Washington </Li> </Ul> </Td> </Tr> </Table> </Td> </Tr> <Tr> <Th> Adjacent streets </Th> <Td> <Ul> <Li> Charing Cross Road </Li> <Li> Cockspur Street </Li> <Li> Northumberland Avenue </Li> <Li> Strand </Li> <Li> Whitehall </Li> </Ul> </Td> </Tr> <Tr> <Th> People </Th> <Td> <Table> <Tr> <Th> Architects </Th> <Td> <Ul> <Li> Charles Barry </Li> <Li> Norman Foster </Li> <Li> Edwin Lutyens </Li> <Li> John Nash </Li> </Ul> </Td> </Tr> <Tr> <Th> Fourth plinth sculptors </Th> <Td> <Ul> <Li> Elmgreen and Dragset </Li> <Li> Katharina Fritsch <Ul> <Li> Hahn / Cock </Li> </Ul> </Li> <Li> Antony Gormley <Ul> <Li> One & Other </Li> </Ul> </Li> <Li> Marc Quinn </Li> <Li> Thomas Schütte </Li> <Li> Yinka Shonibare </Li> <Li> Mark Wallinger </Li> <Li> Rachel Whiteread </Li> <Li> Bill Woodrow </Li> </Ul> </Td> </Tr> </Table> </Td> </Tr> <Tr> <Th> Events </Th> <Td> <Ul> <Li> Poll Tax Riots </Li> </Ul> </Td> </Tr> <Tr> <Th> Miscellaneous </Th> <Td> <Ul> <Li> Christmas tree </Li> </Ul> </Td> </Tr> <Tr> <Td_colspan=\"3\"> <Ul> <Li> </Li> <Li> Commons </Li> </Ul> </Td> </Tr> </Table> Retrieved from `` https://en.wikipedia.org/w/index.php?title=High_Commission_of_South_Africa,_London&oldid=850142361 '' Categories : <Ul> <Li> Diplomatic missions in London </Li> <Li> Trafalgar Square </Li> <Li> Diplomatic missions of South Africa </Li> <Li> Herbert Baker buildings and structures </Li> <Li> South Africa -- United Kingdom relations </Li> <Li> South Africa and the Commonwealth of Nations </Li> <Li> Grade II * listed buildings in the City of Westminster </Li> <Li> Buildings and structures completed in 1933 </Li> </Ul> <Ul> <Li> </Li> <Li> </Li> </Ul> <H2> </H2> <H3> </H3> <Ul> <Li> </Li> <Li> Talk </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> </Ul> <H3> </H3> <H3> </H3> <Ul> <Li> </Li> <Li> Contents </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> About Wikipedia </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> </Li> </Ul> <H3> </H3> <Ul> <Li> Afrikaans </Li> </Ul> Edit links <Ul> <Li> This page was last edited on 13 July 2018 , at 22 : 10 ( UTC ) . </Li> <Li> </Li> </Ul> <Ul> <Li> </Li> <Li> About Wikipedia </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> <Li> </Li> </Ul> <Ul> <Li> </Li> <Li> </Li> </Ul>",
  "long_answer_candidates": [
    {
      "end_token": 136,
      "start_token": 18,
      "top_level": true
    },
    {
      "end_token": 30,
      "start_token": 19,
      "top_level": false
    },
    {
      "end_token": 45,
      "start_token": 34,
      "top_level": false
    },
    {
      "end_token": 59,
      "start_token": 45,
      "top_level": false
    },
    {
      "end_token": 126,
      "start_token": 59,
      "top_level": false
    },
    {
      "end_token": 135,
      "start_token": 126,
      "top_level": false
    },
    {
      "end_token": 211,
      "start_token": 141,
      "top_level": true
    },
    {
      "end_token": 336,
      "start_token": 240,
      "top_level": true
    },
    {
      "end_token": 425,
      "start_token": 336,
      "top_level": true
    },
    {
      "end_token": 488,
      "start_token": 425,
      "top_level": true
    },
    {
      "end_token": 570,
      "start_token": 488,
      "top_level": true
    }
  ]
}

My main question is, can I get the script to convert the source code of the webpage into "document_text" and "long_answer_candidates" ?