miguelgrinberg/microblog

Exporting a huge number of database records

Tes3awy opened this issue · 3 comments

Hi @miguelgrinberg,

I am trying to export a huge dataset of around 800,000 database records that will be extended to 1 million records any time soon.

microblog/app/tasks.py

Lines 28 to 56 in b0d5e84

def export_posts(user_id):
try:
user = db.session.get(User, user_id)
_set_task_progress(0)
data = []
i = 0
total_posts = db.session.scalar(sa.select(sa.func.count()).select_from(
user.posts.select().subquery()))
for post in db.session.scalars(user.posts.select().order_by(
Post.timestamp.asc())):
data.append({'body': post.body,
'timestamp': post.timestamp.isoformat() + 'Z'})
time.sleep(5)
i += 1
_set_task_progress(100 * i // total_posts)
send_email(
'[Microblog] Your blog posts',
sender=app.config['ADMINS'][0], recipients=[user.email],
text_body=render_template('email/export_posts.txt', user=user),
html_body=render_template('email/export_posts.html', user=user),
attachments=[('posts.json', 'application/json',
json.dumps({'posts': data}, indent=4))],
sync=True)
except Exception:
_set_task_progress(100)
app.logger.error('Unhandled exception', exc_info=sys.exc_info())
finally:
_set_task_progress(100)

I have the same function, export_posts. But, as you have guessed already, it is taking way to much time to export the database records.

Do you have any suggestions for speeding up the export process?

Also, how can I make the export downloadable from the browser directly instead of being sent via email? I am exporting the data to a CSV file instead of a JSON file.

For such a large export anything you do in Python is going to be painfully slow. The most performant option is to use an export tool offered by your database.

You can write the exported file to a designated directory, maybe using the user ID as filename. Then you can add a download route under authentication that uses send_file to return this file as a download.

Can you help me with an example of your suggestion?

I don't have anything to share. For serving the csv file you can see the send_file() docs from Flask. For generating a CSV efficiently you have to look at what options your database provides. Postgres has a pg_dump command, and also a COPY statement in their SQL implementation. There is a mysqldump as well for MySQL.