Incomplete list of categories
Fetzii opened this issue · 1 comments
- spikex version: 0.5.2
- Python version: 3.9.7
- Operating System: Windows 10
Description
I want to get all categories of a page, but most categories are missing
What I Did
from spikex.wikigraph import load as wg_load
page = "Peking_2022"
categories = wg.get_categories(page, distance=1)
What I get: ['Category:Olympische_Winterspiele_2022']
The output I expect: ['Austragung der Olympischen Winterspiele', 'Olympische Winterspiele 2022', 'Sport (Hebei)', 'Sportveranstaltung 2022', 'Sportveranstaltung in Peking', 'Wikipedia:Veraltet nach Jahr 2022', 'Zukünftige Sportveranstaltung']
Prove: https://de.wikipedia.org/wiki/Olympische_Winterspiele_2022
I created a categorylinks dictionary from the categorylinks.sql.gz, so that the keys are the page_ids and under each key is the list of categories. I used your functions to get the page_id: page_id = self.get_pageid(self.redirect(page))
and my categorylinks dictionary . With this method I get the expected output. If this behaviour is not desired, I would like to think that there is a problem with the processing of categorylinks.sql.gz on your side.
I'm facing the same problem in with a ptwiki_core