What is your recommended way to convert feedparser s date representation to datetime object?
slidenerd opened this issue · 1 comments
I think this question belongs here and not on stackoverflow because as the library author you would be able to answer this best
Issues I referenced before asking
#212
#51
Problem
- feedparser returns a string representation of published date under published and a struct_time representation of the same
- I am not able to store either of these directly to Postgres because it needs a datetime when working via asyncpg
How to reproduce this problem
def md5(text):
import hashlib
return hashlib.md5(text.encode('utf-8')).hexdigest()
def fetch():
import feedparser
data = feedparser.parse('https://cointelegraph.com/rss')
return data
async def insert(rows):
import asyncpg
async with asyncpg.create_pool(user='postgres', database='postgres') as pool:
async with pool.acquire() as conn:
results = await conn.executemany('INSERT INTO test (feed_item_id, pubdate) VALUES($1, $2)', rows)
print(results)
async def main():
data = fetch()
first_entry = data.entries[0]
await insert([(md5(first_entry.guid), first_entry.published)])
await insert([(md5(first_entry.guid), first_entry.published_parsed)])
import asyncio
asyncio.run(main())
Both insert statements above will fail
What have I found so far?
I found 3 methods but they seem to have a limitation each
Method 1
Convert it with strptime
import feedparser
data = feedparser.parse('https://cointelegraph.com/rss')
pubdate = data.entries[0].published
pubdate_parsed = data.entries[0].published_parsed
>>> pubdate
'Thu, 04 Aug 2022 06:53:42 +0100'
I could do this
>>> method1 = datetime.strptime(pubdate, '%a, %d %b %Y %H:%M:%S %z')
>>> method1
datetime.datetime(2022, 8, 4, 6, 53, 42, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))
I am guessing this would raise an error if some feed returns an incorrect format and also I am not sure if this works when an extra leapsecond gets added
Method 2
>>> datetime.fromtimestamp(mktime(pubdate_parsed))
datetime.datetime(2022, 8, 4, 5, 53, 42)
This seems to completely lose out the timezone information or am I wrong about it? What happens here if there is a DST
Method 3
Requires a third party library called dateutil and shown below
https://stackoverflow.com/a/18726020/5371505
Question
- What is the most robust way to convert the published or published_parsed output that feedparser generates into datetime object?
- Can it be done without a third party library such as dateutil
- Is there any native undocumented approach to get a datetime object directly from feedparser that I am not aware of?
Thank you for your time
I'm not the developer, but they do document it here: https://feedparser.readthedocs.io/en/latest/date-parsing.html#advanced-date
Different feed types and versions use wildly different date formats. Universal Feed Parser will attempt to auto-detect the date format used in any date element, and parse it into a standard Python 9-tuple in UTC
So I believe to create a timezone aware datetime object, you would do something like:
from time import mktime
from datetime import datetime, timezone
datetime.fromtimestamp(mktime(pubdate_parsed), timezone.utc)