Sentiment calculation takes too long.
Closed this issue · 1 comments
uclatommy commented
The bottleneck is usage of DataFrames.from_records
. Here's a line profile run:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def test():
6 1 8 8.0 0.0 binsize=timedelta(seconds=60)
7 1 679937 679937.0 1.3 start = tesla_feels._feels.start
8 1 36364 36364.0 0.1 end = tesla_feels._feels.end
9 1 7 7.0 0.0 second = timedelta(seconds=1)
10 1 258377 258377.0 0.5 df = tesla_feels._feels.tweet_dates
11 1 16114 16114.0 0.0 df = df.groupby(pd.TimeGrouper(freq=f'{int(binsize/second)}S')).size()
12 1 3457 3457.0 0.0 df = df[df != 0]
13 1 116 116.0 0.0 conn = sqlite3.connect(tesla_feels._feels._db, detect_types=sqlite3.PARSE_DECLTYPES)
14 1 4 4.0 0.0 c = conn.cursor()
15 1 1 1.0 0.0 c.execute(
16 1 1 1.0 0.0 "SELECT * FROM tweets WHERE created_at >= ? AND created_at <= ?",
17 1 228 228.0 0.0 (start, end)
18 )
19 1 121 121.0 0.0 print(f'fetchbin from {start} to {end}')
20 13085 20477 1.6 0.0 for i in range(len(df)):
21 13084 3537510 270.4 6.8 d = c.fetchmany(df.iloc[i])
22 13084 30744 2.3 0.1 cols = tesla_feels._feels.fields
23 13084 47730430 3648.0 91.2 yld = pd.DataFrame.from_records(data=d, columns=cols)
24 13084 23556 1.8 0.0 yield yld
25 1 3 3.0 0.0 c.close()
uclatommy commented
Note: difference between using a generator on a single sql query vs using one query per bin is a factor of at least 10x.
Here is a profile for using the generator:
Timer unit: 1e-06 s
Total time: 79.4734 s
File: <ipython-input-3-2c6bc1357af2>
Function: sentiment at line 27
Line # Hits Time Per Hit % Time Line Contents
==============================================================
27 def sentiment():
28 1 146423 146423.0 0.2 end = tesla_feels._feels.end
29 1 7 7.0 0.0 sentiments = tesla_feels.sentiments(delta_time=tesla_feels._bin_size)
30 13085 79301469 6060.5 99.8 for s in sentiments:
31 13084 25507 1.9 0.0 tesla_feels._sentiment = s
32 1 2 2.0 0.0 tesla_feels._latest_calc = end
Here it is using tweets_between
iteratively:
Timer unit: 1e-06 s
Total time: 1009.1 s
File: <ipython-input-3-7b4a334764ce>
Function: sentiment2 at line 34
Line # Hits Time Per Hit % Time Line Contents
==============================================================
34 def sentiment2():
35 1 102277 102277.0 0.0 end = tesla_feels._feels.end
36 1 40708 40708.0 0.0 start = tesla_feels._feels.start
37 1 2 2.0 0.0 cur = start
38 15909 22207 1.4 0.0 while cur<end:
39 15908 971749792 61085.6 96.3 tweets = tesla_feels._feels.tweets_between(cur, cur+tesla_feels._bin_size)
40 15908 37122133 2333.6 3.7 tesla_feels._sentiment = tesla_feels.model_sentiment(tweets, tesla_feels._sentiment)
41 15908 65102 4.1 0.0 cur = cur + tesla_feels._bin_size
42 1 1 1.0 0.0 tesla_feels._latest_calc = end
Generator seems pretty optimized in comparison.