patrickfrey/strus

Check how expensive it is to calculate or estimate combined df's of features (number of documents, where all expression features occurr)

Closed this issue · 2 comments

Check how expensive it is to calculate or estimate combined df's of features (number of documents, where all expression features occurr)

Currently the df's of expression features are just inherited from the rarest (AND) or most frequent (OR) child expression. Maybe the iterators on the set of documents (as ranges) where the feature occurs could be used for calculating or estimating a more accurate value.

Estimated df calculation is too expensive because it requires a statistically relevant number of random access. Random access kills.