uclatommy/tweetfeels

Sentiment calculation takes too long.

Closed this issue · 1 comments

The bottleneck is usage of DataFrames.from_records. Here's a line profile run:


Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5                                           def test():
     6         1            8      8.0      0.0      binsize=timedelta(seconds=60)
     7         1       679937 679937.0      1.3      start = tesla_feels._feels.start
     8         1        36364  36364.0      0.1      end = tesla_feels._feels.end
     9         1            7      7.0      0.0      second = timedelta(seconds=1)
    10         1       258377 258377.0      0.5      df = tesla_feels._feels.tweet_dates
    11         1        16114  16114.0      0.0      df = df.groupby(pd.TimeGrouper(freq=f'{int(binsize/second)}S')).size()
    12         1         3457   3457.0      0.0      df = df[df != 0]
    13         1          116    116.0      0.0      conn = sqlite3.connect(tesla_feels._feels._db, detect_types=sqlite3.PARSE_DECLTYPES)
    14         1            4      4.0      0.0      c = conn.cursor()
    15         1            1      1.0      0.0      c.execute(
    16         1            1      1.0      0.0          "SELECT * FROM tweets WHERE created_at >= ? AND created_at <= ?",
    17         1          228    228.0      0.0          (start, end)
    18                                                   )
    19         1          121    121.0      0.0      print(f'fetchbin from {start} to {end}')
    20     13085        20477      1.6      0.0      for i in range(len(df)):
    21     13084      3537510    270.4      6.8          d = c.fetchmany(df.iloc[i])
    22     13084        30744      2.3      0.1          cols = tesla_feels._feels.fields
    23     13084     47730430   3648.0     91.2          yld = pd.DataFrame.from_records(data=d, columns=cols)
    24     13084        23556      1.8      0.0          yield yld
    25         1            3      3.0      0.0      c.close()

Note: difference between using a generator on a single sql query vs using one query per bin is a factor of at least 10x.

Here is a profile for using the generator:

Timer unit: 1e-06 s

Total time: 79.4734 s
File: <ipython-input-3-2c6bc1357af2>
Function: sentiment at line 27

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    27                                           def sentiment():
    28         1       146423 146423.0      0.2      end = tesla_feels._feels.end
    29         1            7      7.0      0.0      sentiments = tesla_feels.sentiments(delta_time=tesla_feels._bin_size)
    30     13085     79301469   6060.5     99.8      for s in sentiments:
    31     13084        25507      1.9      0.0          tesla_feels._sentiment = s
    32         1            2      2.0      0.0      tesla_feels._latest_calc = end

Here it is using tweets_between iteratively:

Timer unit: 1e-06 s

Total time: 1009.1 s
File: <ipython-input-3-7b4a334764ce>
Function: sentiment2 at line 34

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    34                                           def sentiment2():
    35         1       102277 102277.0      0.0      end = tesla_feels._feels.end
    36         1        40708  40708.0      0.0      start = tesla_feels._feels.start
    37         1            2      2.0      0.0      cur = start
    38     15909        22207      1.4      0.0      while cur<end:
    39     15908    971749792  61085.6     96.3          tweets = tesla_feels._feels.tweets_between(cur, cur+tesla_feels._bin_size)
    40     15908     37122133   2333.6      3.7          tesla_feels._sentiment = tesla_feels.model_sentiment(tweets, tesla_feels._sentiment)
    41     15908        65102      4.1      0.0          cur = cur + tesla_feels._bin_size
    42         1            1      1.0      0.0      tesla_feels._latest_calc = end

Generator seems pretty optimized in comparison.