Status visible, old, and deleted fics
htnyquist opened this issue · 5 comments
Hello,
I noticed that every fic in the index.json has 'status': 'visible'
, even the ones that have been deleted or that are not accessible publicly.
The submitted
and published
fields are also always true as far as I can tell.
But the data for some of those stories is pretty inconsistent with the rest of the archive. I think most of those are deleted stories, but I have no way to exclude them since every story is marked visible and published in the archive.
Some of the problems with not being able to exclude deleted stories:
- Stories on the site all have a 'series' tag (e.g. MLP-FiM or EQG), but there's ~45k fics in the archive that don't have one. Many of those seem to be old deleted stories, but there's no reliable way to know.
- Note that there are also non-deleted stories that have inconsistent tags! Story 31718 has the MLP:FiM tag on the site, but not in the archive
- The non-story data won't be up to date: the author object will be full of NULL values on some stories, but not others (even though the author's account is still active).
- It makes it harder to use fimfarchive as a data source in general. For example I saw the search GUI that works offline, but if I wanted something like this as a webpage that links to the real site, I'd need a way to filter dead links.
So, is it intended that status
, submitted
and published
are always truthy?
Is there a way to filter out deleted fics that I missed, and is it normal that some of the non-deleted fics' tags don't match what's on the site?
All of the metadata (except for archive
) come directly from Fimfiction. So, when something goes missing it's essentially frozen in time. It's not really by design, but I just haven't done all that much when it comes to cleaning things up.
There is a way for you to filter out stories that are no longer available though! You can do this by looking in to the (somewhat poorly named) timestamps in archive
. Note that some of these are null
since I haven't retroactively added things such as creation dates.
date_checked
: When did we last try to update the story?date_created
: When was the story added to the archive?date_fetched
: When was the last update of the metadata?date_updated
: When was the last update of the content?
The other timestamps will exactly match date_checked
if a change happened for that version of the archive. So, to check if a story was publicly available on Fimfiction you could compare date_checked
to date_fetched
.
>>> from fimfarchive.fetchers import FimfarchiveFetcher
>>>
>>> def was_available(story):
... archive = story.meta['archive']
... date_checked = archive['date_checked']
... date_fetched = archive['date_fetched']
...
... return date_checked == date_fetched
...
>>>
>>> fetcher = FimfarchiveFetcher('fimfarchive.zip')
>>> sum(1 for story in fetcher if was_available(story))
131322
Hope that helps!
Also, I'll be looking into story 31718.
It seems you might have found a rather significant bug, so thank you!
Thanks, date_fetched and date_checked sound like exactly what I need!
I am curious about the case of 31718, but ultimately if it's limited to the series tag that's something I can deal with by just assuming any fic without one is automatically 'MLP-FiM' (I assume that's what Fimfic did retroactively).
I kept track of 31718
during the last archive update, and it seems to have updated fine without me doing anything. I believe what happened was that the story had been unpublished for a while. It probably got published again some time after the previous release.
There seem to be comments hinting at that as well!