barrust/mediawiki

Getting data from table

SalvatoreRa opened this issue · 10 comments

Very nice package.

I am trying to write a script that for a tv series extract the content of the season episodes:

from mediawiki import MediaWiki
wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')
p.sections
p.content

In the content there is not the text (in the page is inside a table), and I also have tried
p.table_of_contents['Episodes']['Season 1 (2022)']

which returns an empty structure

Thank you very much for your help

I am glad that you find the package useful! I haven't been able to find an API to pull information from tables directly from the wiki api, but you could use beautifulsoup to parse the html directly.

Something like:

from bs4 import BeautifulSoup
from mediawiki import MediaWiki

wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')

soup = BeautifulSoup(p.html, "html.parser")
episodes = soup.find("table", {"class": "wikiepisodetable"})

# Do something to parse the table as per the documentation on bs4

I hope this is helpful!

Thank you for your reply,

I have used beautifulsoup:

def text_recovery(url):
    # Make a  request to the URL
    response = requests.get(str(url))

    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table that contains the episode summary

    table = soup.find('table', {'class': 'wikiepisodetable'})
    text = []

    # Iterate over the rows in the table
    for row in table.find_all('tr'):
        # Find the cells in each row

        cells = row.find_all('td')

        # If the row contains episode data
        if len(cells) == 1:
            # Extract the episode number, title, and summary
            episode_summary = cells[0].text

            # Print the episode data
            text.append(episode_summary)
    return text

This work for the Andor page, however I have realized not all the page are the same, and I was wondering if there a way to extra the same information in agnostic way, something that taking a series x it provide you the text of episodes' description.

Sadly, not that I know of as I haven't been able to find an MediaWiki API that can help with that.

I will have to look at the contents or wikitext output that could help.

There is a maybe a way to interact with the database? like information is a sort SQL or GrapSQL db of wiki?

Not though this python package as it is just a wrapper for the API and doesn't have access to the back-end system, just what is provided through the API.

I understand, thank you for your help

The p.wikitext property might be helpful as it has this type of information:

===Season 1 (2022)===
{{Episode table |background=#804A41 |overall= |title= |director= |writer= |airdate= |released=y |episodes=
{{Episode list
 |EpisodeNumber   = 1
 |Title           = Kassa 
 |DirectedBy      = [[Toby Haynes]]
 |WrittenBy       = [[Tony Gilroy]]
 |OriginalAirDate = {{Start date|2022|9|21}}
 |ShortSummary    = Five years before the Battle of Yavin, Cassian Andor looks for his missing sister in the industrial planet of Morlana One. While investigating, Cassian is antagonized by two officers. An altercation ensues, leading to Cassian accidentally killing one officer and murdering the other. He flees to the planet Ferrix and attempts to hide his involvement by convincing his adopted mother Maarva's droid, B2EMO, and his friend, Brasso, to cover for him. Having a Starpath Unit (a valuable piece of Imperial navigation technology), Cassian asks his friend Bix to connect him with a black market buyer. Bix agrees and contacts the buyer. Meanwhile, Bix's boyfriend, Timm, is suspicious of Andor. To improve his report to the Imperial authorities, Morlana One's chief inspector of security elects to cover up the murders. However, his deputy, the dutiful Syril Karn, is determined to solve the case. He identifies Cassian's ship, traces it to Ferrix and learns that the fugitive is from the planet Kenari. In a flashback, a younger Cassian, known as Kassa, and his tribe on Kenari decide to investigate a crashed ship. Kassa rebuffs his younger sister's efforts to join them, leaving her behind to guard their encampment. 
 |LineColor       = 804A41
}}
...
}}

Which means that could also be used to parse the text; I still haven't seen an API to pull tables directly from the API.

I would try with p.wikitest!

However, I still have to find a way when the episodes (and the table) is in another page. The problem with the wiki pages is that the format is not uniform

Yes, that is the one draw back is that it isn't always standardized.

Good luck!

yes, it is a pity, since there is so much interesting information in wiki for model training or doing apps.

thank you very much for your help!