lareferencia/dspace-stats-collector

Posible optimizacion en consultas a Solr

Closed this issue · 3 comments

Estimados, las consultas que se hacen a solr son del tipo q=time:$INICIO TO *&sort:time asc&rows=$COUNT lo que hace que solr antes de devolver los resultados deba ordenar todos los resultados que coincidan (que pueden ser muchos millones) y retornar los primeros $COUNT items.
En nuestro caso el total son casi 18.000.000 que deben ser ordenados y de los cuales sólo usamos los primeros 1000.

Creo que si en lugar de hacer un q=time:$INICIO TO *:* se hiciese un q=time:$INICIO TO $INICIO+30DAYS o, si esto no funciona, un q=time:$INICIO TO $FIN dónde $FIN=$INICIO +30 días.
De esta forma la lógica de la aplicación seguiria igual pero las consultas serían mucho mas livianas, ya que ordenarían menos de 100.000 items (en nuestro caso).

A continuación incluyo un log ejemplo de consultas consecutivas en una misma ejecución del script. La primera demora mucho más, y las que siguen tardan bastante pero mucho menos que la primera.

INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T08%3A12%3A32.603Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 42.967 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T08%3A56%3A58.637Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 7.497 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T09%3A27%3A30.269Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 8.001 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T09%3A52%3A25.991Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 14.053 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T10%3A19%3A18.154Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 10.029 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T10%3A40%3A34.113Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 11.473 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T10%3A58%3A47.606Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 7.854 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T11%3A19%3A16.405Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 8.726 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T11%3A37%3A55.804Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 11.113 seconds, with status 200
INFO: Finished 'http://my_solr_url:8080/solr/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T11%3A56%3A12.844Z%22+TO+%2A%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 9.229 seconds, with status 200

Se agregó la expresión "date math" +30DAYS en la consulta a SOLR cuando no se especifica un untilDate

Hice una prueba con esta version 0.5.5 y un tamaño de página de solr de 1000 rows y los resultados son muy buenos. Ejemplo

INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T17%3A22%3A55.298Z%22+TO+2020-11-16T17%3A22%3A55.298Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 30.213 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T17%3A36%3A12.999Z%22+TO+2020-11-16T17%3A36%3A12.999Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 0.607 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T17%3A49%3A26.524Z%22+TO+2020-11-16T17%3A49%3A26.524Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 1.095 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T18%3A02%3A21.081Z%22+TO+2020-11-16T18%3A02%3A21.081Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 3.070 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T18%3A15%3A50.009Z%22+TO+2020-11-16T18%3A15%3A50.009Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 3.275 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T18%3A29%3A31.618Z%22+TO+2020-11-16T18%3A29%3A31.618Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 0.616 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T18%3A42%3A38.118Z%22+TO+2020-11-16T18%3A42%3A38.118Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 2.484 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T18%3A56%3A08.805Z%22+TO+2020-11-16T18%3A56%3A08.805Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 0.725 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T19%3A10%3A09.69Z%22+TO+2020-11-16T19%3A10%3A09.69Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 3.156 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T19%3A24%3A52.742Z%22+TO+2020-11-16T19%3A24%3A52.742Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=1000' (get) with body '' in 3.522 seconds, with status 200
INFO: Finished 'http://solr_url/statistics/select/?q=%2A+%2Btime%3A%7B%222020-11-16T19%3A38%3A46.337Z%22+TO+2020-11-16T19%3A38%3A46.337Z%2B30DAYS%5D&sort=time+asc&start=0&wt=json&fq=%2Bstatistics_type%3A%22view%22+%2Btype%3A%280+OR+2%29&fl=id%2Cip%2CowningItem%2Creferrer%2Ctime%2Ctype%2CuserAgent&rows=0' (get) with body '' in 0.583 seconds, with status 200
INFO: DSpace Stats Collector finished processing 9539 events from 2020-11-16T17:22:55.298Z to 2020-11-16T19:38:46.337Z. Breakdown: 9531 events sent succesfully, 8 events discarted as robot

Es decir, de un promedio de 13 segundos por request que tenía antes ahora bajamos a casi 1.9 (numeros calculados en el aire).
Ahora creo otro ticket para subir el rows size de solr :D
Muchas gracias!