JockeTF/fimfarchive

Status visible, old, and deleted fics

htnyquist opened this issue · 5 comments

Hello,
I noticed that every fic in the index.json has 'status': 'visible', even the ones that have been deleted or that are not accessible publicly.
The submitted and published fields are also always true as far as I can tell.

But the data for some of those stories is pretty inconsistent with the rest of the archive. I think most of those are deleted stories, but I have no way to exclude them since every story is marked visible and published in the archive.
Some of the problems with not being able to exclude deleted stories:

  • Stories on the site all have a 'series' tag (e.g. MLP-FiM or EQG), but there's ~45k fics in the archive that don't have one. Many of those seem to be old deleted stories, but there's no reliable way to know.
    • Note that there are also non-deleted stories that have inconsistent tags! Story 31718 has the MLP:FiM tag on the site, but not in the archive
  • The non-story data won't be up to date: the author object will be full of NULL values on some stories, but not others (even though the author's account is still active).
  • It makes it harder to use fimfarchive as a data source in general. For example I saw the search GUI that works offline, but if I wanted something like this as a webpage that links to the real site, I'd need a way to filter dead links.

So, is it intended that status, submitted and published are always truthy?
Is there a way to filter out deleted fics that I missed, and is it normal that some of the non-deleted fics' tags don't match what's on the site?

All of the metadata (except for archive) come directly from Fimfiction. So, when something goes missing it's essentially frozen in time. It's not really by design, but I just haven't done all that much when it comes to cleaning things up.

There is a way for you to filter out stories that are no longer available though! You can do this by looking in to the (somewhat poorly named) timestamps in archive. Note that some of these are null since I haven't retroactively added things such as creation dates.

  • date_checked: When did we last try to update the story?
  • date_created: When was the story added to the archive?
  • date_fetched: When was the last update of the metadata?
  • date_updated: When was the last update of the content?

The other timestamps will exactly match date_checked if a change happened for that version of the archive. So, to check if a story was publicly available on Fimfiction you could compare date_checked to date_fetched.

>>> from fimfarchive.fetchers import FimfarchiveFetcher
>>> 
>>> def was_available(story):
...     archive = story.meta['archive']
...     date_checked = archive['date_checked']
...     date_fetched = archive['date_fetched']
... 
...     return date_checked == date_fetched
... 
>>> 
>>> fetcher = FimfarchiveFetcher('fimfarchive.zip')
>>> sum(1 for story in fetcher if was_available(story))
131322

Hope that helps!

Also, I'll be looking into story 31718.

It seems you might have found a rather significant bug, so thank you!

Thanks, date_fetched and date_checked sound like exactly what I need!
I am curious about the case of 31718, but ultimately if it's limited to the series tag that's something I can deal with by just assuming any fic without one is automatically 'MLP-FiM' (I assume that's what Fimfic did retroactively).

@htnyquist

I kept track of 31718 during the last archive update, and it seems to have updated fine without me doing anything. I believe what happened was that the story had been unpublished for a while. It probably got published again some time after the previous release.

There seem to be comments hinting at that as well!