kurtmckee/feedparser

What is your recommended way to convert feedparser s date representation to datetime object?

slidenerd opened this issue · 1 comments

I think this question belongs here and not on stackoverflow because as the library author you would be able to answer this best

Issues I referenced before asking
#212
#51

Problem

  • feedparser returns a string representation of published date under published and a struct_time representation of the same
  • I am not able to store either of these directly to Postgres because it needs a datetime when working via asyncpg

How to reproduce this problem


def md5(text):
    import hashlib
    return hashlib.md5(text.encode('utf-8')).hexdigest()

def fetch():
    import feedparser
    data = feedparser.parse('https://cointelegraph.com/rss')
    return data

async def insert(rows):
    import asyncpg
    async with asyncpg.create_pool(user='postgres', database='postgres') as pool:
        async with pool.acquire() as conn:
            results = await conn.executemany('INSERT INTO test (feed_item_id, pubdate) VALUES($1, $2)', rows)
            print(results)

async def main():
    data = fetch()
    first_entry = data.entries[0]
    await insert([(md5(first_entry.guid), first_entry.published)])
    await insert([(md5(first_entry.guid), first_entry.published_parsed)])

import asyncio
asyncio.run(main())

Both insert statements above will fail

What have I found so far?

I found 3 methods but they seem to have a limitation each

Method 1

Convert it with strptime

import feedparser
data = feedparser.parse('https://cointelegraph.com/rss')
pubdate = data.entries[0].published
pubdate_parsed = data.entries[0].published_parsed


>>> pubdate
'Thu, 04 Aug 2022 06:53:42 +0100'

I could do this


>>> method1 = datetime.strptime(pubdate, '%a, %d %b %Y %H:%M:%S %z')
>>> method1
datetime.datetime(2022, 8, 4, 6, 53, 42, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))

I am guessing this would raise an error if some feed returns an incorrect format and also I am not sure if this works when an extra leapsecond gets added

Method 2


>>> datetime.fromtimestamp(mktime(pubdate_parsed))
datetime.datetime(2022, 8, 4, 5, 53, 42)

This seems to completely lose out the timezone information or am I wrong about it? What happens here if there is a DST

Method 3
Requires a third party library called dateutil and shown below
https://stackoverflow.com/a/18726020/5371505

Question

  • What is the most robust way to convert the published or published_parsed output that feedparser generates into datetime object?
  • Can it be done without a third party library such as dateutil
  • Is there any native undocumented approach to get a datetime object directly from feedparser that I am not aware of?

Thank you for your time

I'm not the developer, but they do document it here: https://feedparser.readthedocs.io/en/latest/date-parsing.html#advanced-date

Different feed types and versions use wildly different date formats. Universal Feed Parser will attempt to auto-detect the date format used in any date element, and parse it into a standard Python 9-tuple in UTC

So I believe to create a timezone aware datetime object, you would do something like:

from time import mktime
from datetime import datetime, timezone
datetime.fromtimestamp(mktime(pubdate_parsed), timezone.utc)