Getting data from table
SalvatoreRa opened this issue · 10 comments
Very nice package.
I am trying to write a script that for a tv series extract the content of the season episodes:
from mediawiki import MediaWiki
wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')
p.sections
p.content
In the content there is not the text (in the page is inside a table), and I also have tried
p.table_of_contents['Episodes']['Season 1 (2022)']
which returns an empty structure
Thank you very much for your help
I am glad that you find the package useful! I haven't been able to find an API to pull information from tables directly from the wiki api, but you could use beautifulsoup to parse the html directly.
Something like:
from bs4 import BeautifulSoup
from mediawiki import MediaWiki
wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')
soup = BeautifulSoup(p.html, "html.parser")
episodes = soup.find("table", {"class": "wikiepisodetable"})
# Do something to parse the table as per the documentation on bs4
I hope this is helpful!
Thank you for your reply,
I have used beautifulsoup:
def text_recovery(url):
# Make a request to the URL
response = requests.get(str(url))
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table that contains the episode summary
table = soup.find('table', {'class': 'wikiepisodetable'})
text = []
# Iterate over the rows in the table
for row in table.find_all('tr'):
# Find the cells in each row
cells = row.find_all('td')
# If the row contains episode data
if len(cells) == 1:
# Extract the episode number, title, and summary
episode_summary = cells[0].text
# Print the episode data
text.append(episode_summary)
return text
This work for the Andor page, however I have realized not all the page are the same, and I was wondering if there a way to extra the same information in agnostic way, something that taking a series x it provide you the text of episodes' description.
Sadly, not that I know of as I haven't been able to find an MediaWiki API that can help with that.
I will have to look at the contents
or wikitext
output that could help.
There is a maybe a way to interact with the database? like information is a sort SQL or GrapSQL db of wiki?
Not though this python package as it is just a wrapper for the API and doesn't have access to the back-end system, just what is provided through the API.
I understand, thank you for your help
The p.wikitext
property might be helpful as it has this type of information:
===Season 1 (2022)===
{{Episode table |background=#804A41 |overall= |title= |director= |writer= |airdate= |released=y |episodes=
{{Episode list
|EpisodeNumber = 1
|Title = Kassa
|DirectedBy = [[Toby Haynes]]
|WrittenBy = [[Tony Gilroy]]
|OriginalAirDate = {{Start date|2022|9|21}}
|ShortSummary = Five years before the Battle of Yavin, Cassian Andor looks for his missing sister in the industrial planet of Morlana One. While investigating, Cassian is antagonized by two officers. An altercation ensues, leading to Cassian accidentally killing one officer and murdering the other. He flees to the planet Ferrix and attempts to hide his involvement by convincing his adopted mother Maarva's droid, B2EMO, and his friend, Brasso, to cover for him. Having a Starpath Unit (a valuable piece of Imperial navigation technology), Cassian asks his friend Bix to connect him with a black market buyer. Bix agrees and contacts the buyer. Meanwhile, Bix's boyfriend, Timm, is suspicious of Andor. To improve his report to the Imperial authorities, Morlana One's chief inspector of security elects to cover up the murders. However, his deputy, the dutiful Syril Karn, is determined to solve the case. He identifies Cassian's ship, traces it to Ferrix and learns that the fugitive is from the planet Kenari. In a flashback, a younger Cassian, known as Kassa, and his tribe on Kenari decide to investigate a crashed ship. Kassa rebuffs his younger sister's efforts to join them, leaving her behind to guard their encampment.
|LineColor = 804A41
}}
...
}}
Which means that could also be used to parse the text; I still haven't seen an API to pull tables directly from the API.
I would try with p.wikitest!
However, I still have to find a way when the episodes (and the table) is in another page. The problem with the wiki pages is that the format is not uniform
Yes, that is the one draw back is that it isn't always standardized.
Good luck!
yes, it is a pity, since there is so much interesting information in wiki for model training or doing apps.
thank you very much for your help!