Implement background jobs for updating database
Closed this issue · 4 comments
Right now I think the request has to do too many things and so heroku timesout. Maybe use redirects, to different pages to stop it from timing out.
So Heroku's timeout limit is 30 seconds. This can be shortened but not changed.
Possible solutions:
1. Pagination / Ajax
So that the data is loaded in pages. (This is more of JavaScript solution. Not sure you can really do this in Python - the closest thing that comes to mind is the streaming_template
module in Flask.)
2. Yield + Response
Maybe be utilising yield, heroku will consider the request completed even though the data is being gradually fed through
3. After request function
4. Background jobs
- Read this article on background jobs (seems highly relevant.).
- This guide is specifically for Python implementation using RQ.
- Perhaps this is relevant for the zip file download as well. Might be worth giving some thought to what requests/tasks there are in my program that take a significant amount of time to complete, and requests/tasks that will take longer as I scale the program.
5. Websockets (not sure if this is the right solution)
These articles may be relevant:
Alternatively...
Use a totally different approach for this audio player - i.e. use a youtube embed and just add custom controls and use display: none on the element.
To do
- Fix this issue when fetching beat details
- Is this going to be an issue when downloading stems too? I don't think I've tested that
It seems the yield + response approach might work according to this article on Heroku's website, which says:
An application has an initial 30 second window to respond with a single byte back to the client. However, each byte transmitted thereafter (either received from the client or sent by your application) resets a rolling 55 second window. If no data is sent during the 55 second window, the connection will be terminated.
If you’re sending a streaming response, such as with server-sent events, you’ll need to detect when the client has hung up, and make sure your app server closes the connection promptly. If the server keeps the connection open for 55 seconds without sending any data, you’ll see a request timeout.
Just ensure that the yield/response method closes the connection once all the data has been sent.
Look into using yield to stream the response back to the client: https://flask.palletsprojects.com/en/2.1.x/patterns/streaming/#basic-usage
EDIT:
It's possible I didn't implement it correctly, but it didn't seem to work.
Okay, I have tested this out, the background jobs approach works. I tested this with:
- RQ
- Heroku Reddis
- Heroku PostgreSQL
I need to split add uploads to database down into small chunks that we can queue individually.
Ideally we'll build our program up in stages. E.g. 3 different functions
1. Clear database (no need to queue)
clear_database()
2. Fetch video IDs
Look into how long this process is, but I can't see a way to easily break this up into queueable functions. I don't think it's a big issue though, if I recall correctly, the api is reasonably quick. Incorporate some retrys in case it doesn't work for whatever reason.
def get_videos():
video_id_list = fetch_upload_ids()
keep_looping = True
while keep_looping:
if len(video_id_list) < 50:
request = youtube.videos().list(
part="snippet,contentDetails",
id=video_id_list[0:len(video_id_list)]
)
keep_looping = False
else:
request = youtube.videos().list(
part="snippet,contentDetails",
id=video_id_list[0:50]
)
video_id_list = video_id_list[50:]
try:
response = request.execute()
except HTTPError as e:
print('Error response status code : {0}, reason : {1}'.format(e.status_code, e.error_details))
# This takes those details and adds them to our database.
for video in response['items']:
video_to_add = Videos(
video_id = video['id'],
video_title = video['snippet']['title'],
video_publishedAt = video['snippet']['publishedAt'],
video_thumbnail = video['snippet']['thumbnails']['medium']['url'],
video_description = video['snippet']['description'],
video_beat_name = process_description(video['snippet']['description'], 'Beat name'),
video_tags = process_description(video['snippet']['description'], 'Tags')
)
db.session.add(video_to_add)
db.session.commit()
3. Fetch audio URLs
def get_audio_url(video_id):
print(f'\nFetching audio for {video_id}...')
verfied = False
while not verfied: # Seems like Pafy sometimes produces dead links, this while loop ensures that a valid link is always returned.
video_object = pafy_modified.new(video_id)
audio_url = video_object.getbestaudio().url_https
response = requests.head(audio_url)
if response.status_code == 200:
print(f'Status code: {response.status_code} (Link works)')
verfied = True
elif response.status_code == 403:
print(f'Status code: {response.status_code} (Dead link, generate a new one.)')
else:
print(f'Status code: {response.status_code} (Other http error.)')
return audio_url
def fetch_audio_urls():
for video in Videos.query.all():
video.audio_url = get_audio_url(video.video_id)
try:
db.session.commit()
except:
db.session.rollback()
4. Fetch URL for mixdown and stems files.
beat_folder_id = return_directory(start_folder_id)
for i in dict.keys(beat_folder_id):
if 'Mixdown' in return_directory(beat_folder_id[i]):
try:
video = Videos.query.filter_by(video_beat_name=i).first()
video.beat_mixdowns = return_directory(beat_folder_id[i])['Mixdown']
db.session.commit()
except:
db.session.rollback()
print('Error 1: Video not in database. Or other error.')
if 'Stems' in return_directory(beat_folder_id[i]):
try:
video = Videos.query.filter_by(video_beat_name=i).first()
video.beat_stems = return_directory(beat_folder_id[i])['Stems']
db.session.commit()
except:
db.session.rollback()
print('Error 2: Video not in database. Or other error.')
ORIGINAL:
def add_uploads_to_database():
clear_database()
fetch_videos()
fetch_audio_urls()
fetch_urls_for_files()
Ideally, each job will be broken down to the iteration level, wherever possible.