markusressel/py-image-dedup

README overhaul

markusressel opened this issue · 7 comments

Some things about how py-image-dedup works changed since v1.0.0 and the README needs some guidance on how to use the docker-compose file. A big overhaul of the README is necessary.

Specify

  • since a fork of image-match supporting elasticsearch v6 as well as v7 is now used the cumbersome package dependency section for it can be removed
  • how the daemon works
  • how and what statistics are exposed
  • how to use with docker-compose

Any chance for a quick update that just shows what the index creation call should be with v7? For those of us who aren't seasoned ElasticSearch users it is totally not clear how we need to change it.

For v7 just omit the image node :

curl -X PUT "192.168.2.115:9200/images?pretty" -H "Content-Type: application/json" -d "
    {
      \"mappings\": {
        \"properties\": {
          \"path\": {
            \"type\": \"keyword\",
            \"ignore_above\": 256
          }
        }
      }
    }"

Otherwise you simply have to insert a image node for v6 and _doc node otherwise like this:

curl -X PUT "192.168.2.115:9200/images?pretty" -H "Content-Type: application/json" -d "
    {
      \"mappings\": {
        \"_doc\": {
          \"properties\": {
            \"path\": {
              \"type\": \"keyword\",
              \"ignore_above\": 256
            }
          }
        }
      }
    }"

The WIP version of py-image-dedup is able to create such an index automatically. I currently just do not have the time to get to it :(

Awesome. That worked. Very helpful for those of us who have never used ElasticSearch.

Actually, while it seemed to create it, after running the script processing thousands of images I still seem to have an empty index. It looks like it is fetching against it but not inserting into it. Any best way to debug why it might not be adding images to the index?

curl 'localhost:9200/images/_stats'
{"_shards":{"total":2,"successful":1,"failed":0},"_all":{"primaries":{"docs":{"count":0,"deleted":0},"store":{"size_in_bytes":283},"indexing":{"index_total":0,"index_time_in_millis":0,"index_current":0,"index_failed":0,"delete_total":0,"delete_time_in_millis":0,"delete_current":0,"noop_update_total":0,"is_throttled":false,"throttle_time_in_millis":0},"get":{"total":0,"time_in_millis":0,"exists_total":0,"exists_time_in_millis":0,"missing_total":0,"missing_time_in_millis":0,"current":0},"search":{"open_contexts":0,"query_total":4874494,"query_time_in_millis":125150,"query_current":0,"fetch_total":4874494,"fetch_time_in_millis":33225,"fetch_current":0,"scroll_total":249803,"scroll_time_in_millis":32582,"scroll_current":0,"suggest_total":0,"suggest_time_in_millis":0,"suggest_current":0},"merges":{"current":0,"current_docs":0,"current_size_in_bytes":0,"total":0,"total_time_in_millis":0,"total_docs":0,"total_size_in_bytes":0,"total_stopped_time_in_millis":0,"total_throttled_time_in_millis":0,"total_auto_throttle_in_bytes":20971520},"refresh":{"total":2,"total_time_in_millis":0,"external_total":2,"external_total_time_in_millis":0,"listeners":0},"flush":{"total":1,"periodic":0,"total_time_in_millis":0},"warmer":{"current":0,"total":1,"total_time_in_millis":0},"query_cache":{"memory_size_in_bytes":0,"total_count":0,"hit_count":0,"miss_count":0,"cache_size":0,"cache_count":0,"evictions":0},"fielddata":{"memory_size_in_bytes":0,"evictions":0},"completion":{"size_in_bytes":0},"segments":{"count":0,"memory_in_bytes":0,"terms_memory_in_bytes":0,"stored_fields_memory_in_bytes":0,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":0,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":0,"index_writer_memory_in_bytes":0,"version_map_memory_in_bytes":0,"fixed_bit_set_memory_in_bytes":0,"max_unsafe_auto_id_timestamp":-1,"file_sizes":{}},"translog":{"operations":0,"size_in_bytes":110,"uncommitted_operations":0,"uncommitted_size_in_bytes":110,"earliest_last_modified_age":0},"request_cache":{"memory_size_in_bytes":1387,"evictions":0,"hit_count":10,"miss_count":2},"recovery":{"current_as_source":0,"current_as_target":0,"throttle_time_in_millis":0}},"total":{"docs":{"count":0,"deleted":0},"store":{"size_in_bytes":283},"indexing":{"index_total":0,"index_time_in_millis":0,"index_current":0,"index_failed":0,"delete_total":0,"delete_time_in_millis":0,"delete_current":0,"noop_update_total":0,"is_throttled":false,"throttle_time_in_millis":0},"get":{"total":0,"time_in_millis":0,"exists_total":0,"exists_time_in_millis":0,"missing_total":0,"missing_time_in_millis":0,"current":0},"search":{"open_contexts":0,"query_total":4874494,"query_time_in_millis":125150,"query_current":0,"fetch_total":4874494,"fetch_time_in_millis":33225,"fetch_current":0,"scroll_total":249803,"scroll_time_in_millis":32582,"scroll_current":0,"suggest_total":0,"suggest_time_in_millis":0,"suggest_current":0},"merges":{"current":0,"current_docs":0,"current_size_in_bytes":0,"total":0,"total_time_in_millis":0,"total_docs":0,"total_size_in_bytes":0,"total_stopped_time_in_millis":0,"total_throttled_time_in_millis":0,"total_auto_throttle_in_bytes":20971520},"refresh":{"total":2,"total_time_in_millis":0,"external_total":2,"external_total_time_in_millis":0,"listeners":0},"flush":{"total":1,"periodic":0,"total_time_in_millis":0},"warmer":{"current":0,"total":1,"total_time_in_millis":0},"query_cache":{"memory_size_in_bytes":0,"total_count":0,"hit_count":0,"miss_count":0,"cache_size":0,"cache_count":0,"evictions":0},"fielddata":{"memory_size_in_bytes":0,"evictions":0},"completion":{"size_in_bytes":0},"segments":{"count":0,"memory_in_bytes":0,"terms_memory_in_bytes":0,"stored_fields_memory_in_bytes":0,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":0,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":0,"index_writer_memory_in_bytes":0,"version_map_memory_in_bytes":0,"fixed_bit_set_memory_in_bytes":0,"max_unsafe_auto_id_timestamp":-1,"file_sizes":{}},"translog":{"operations":0,"size_in_bytes":110,"uncommitted_operations":0,"uncommitted_size_in_bytes":110,"earliest_last_modified_age":0},"request_cache":{"memory_size_in_bytes":1387,"evictions":0,"hit_count":10,"miss_count":2},"recovery":{"current_as_source":0,"current_as_target":0,"throttle_time_in_millis":0}}},"indices":{"images":{"uuid":"h_EB5_h6SoKo1_Ls4zFj3w","primaries":{"docs":{"count":0,"deleted":0},"store":{"size_in_bytes":283},"indexing":{"index_total":0,"index_time_in_millis":0,"index_current":0,"index_failed":0,"delete_total":0,"delete_time_in_millis":0,"delete_current":0,"noop_update_total":0,"is_throttled":false,"throttle_time_in_millis":0},"get":{"total":0,"time_in_millis":0,"exists_total":0,"exists_time_in_millis":0,"missing_total":0,"missing_time_in_millis":0,"current":0},"search":{"open_contexts":0,"query_total":4874494,"query_time_in_millis":125150,"query_current":0,"fetch_total":4874494,"fetch_time_in_millis":33225,"fetch_current":0,"scroll_total":249803,"scroll_time_in_millis":32582,"scroll_current":0,"suggest_total":0,"suggest_time_in_millis":0,"suggest_current":0},"merges":{"current":0,"current_docs":0,"current_size_in_bytes":0,"total":0,"total_time_in_millis":0,"total_docs":0,"total_size_in_bytes":0,"total_stopped_time_in_millis":0,"total_throttled_time_in_millis":0,"total_auto_throttle_in_bytes":20971520},"refresh":{"total":2,"total_time_in_millis":0,"external_total":2,"external_total_time_in_millis":0,"listeners":0},"flush":{"total":1,"periodic":0,"total_time_in_millis":0},"warmer":{"current":0,"total":1,"total_time_in_millis":0},"query_cache":{"memory_size_in_bytes":0,"total_count":0,"hit_count":0,"miss_count":0,"cache_size":0,"cache_count":0,"evictions":0},"fielddata":{"memory_size_in_bytes":0,"evictions":0},"completion":{"size_in_bytes":0},"segments":{"count":0,"memory_in_bytes":0,"terms_memory_in_bytes":0,"stored_fields_memory_in_bytes":0,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":0,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":0,"index_writer_memory_in_bytes":0,"version_map_memory_in_bytes":0,"fixed_bit_set_memory_in_bytes":0,"max_unsafe_auto_id_timestamp":-1,"file_sizes":{}},"translog":{"operations":0,"size_in_bytes":110,"uncommitted_operations":0,"uncommitted_size_in_bytes":110,"earliest_last_modified_age":0},"request_cache":{"memory_size_in_bytes":1387,"evictions":0,"hit_count":10,"miss_count":2},"recovery":{"current_as_source":0,"current_as_target":0,"throttle_time_in_millis":0}},"total":{"docs":{"count":0,"deleted":0},"store":{"size_in_bytes":283},"indexing":{"index_total":0,"index_time_in_millis":0,"index_current":0,"index_failed":0,"delete_total":0,"delete_time_in_millis":0,"delete_current":0,"noop_update_total":0,"is_throttled":false,"throttle_time_in_millis":0},"get":{"total":0,"time_in_millis":0,"exists_total":0,"exists_time_in_millis":0,"missing_total":0,"missing_time_in_millis":0,"current":0},"search":{"open_contexts":0,"query_total":4874494,"query_time_in_millis":125150,"query_current":0,"fetch_total":4874494,"fetch_time_in_millis":33225,"fetch_current":0,"scroll_total":249803,"scroll_time_in_millis":32582,"scroll_current":0,"suggest_total":0,"suggest_time_in_millis":0,"suggest_current":0},"merges":{"current":0,"current_docs":0,"current_size_in_bytes":0,"total":0,"total_time_in_millis":0,"total_docs":0,"total_size_in_bytes":0,"total_stopped_time_in_millis":0,"total_throttled_time_in_millis":0,"total_auto_throttle_in_bytes":20971520},"refresh":{"total":2,"total_time_in_millis":0,"external_total":2,"external_total_time_in_millis":0,"listeners":0},"flush":{"total":1,"periodic":0,"total_time_in_millis":0},"warmer":{"current":0,"total":1,"total_time_in_millis":0},"query_cache":{"memory_size_in_bytes":0,"total_count":0,"hit_count":0,"miss_count":0,"cache_size":0,"cache_count":0,"evictions":0},"fielddata":{"memory_size_in_bytes":0,"evictions":0},"completion":{"size_in_bytes":0},"segments":{"count":0,"memory_in_bytes":0,"terms_memory_in_bytes":0,"stored_fields_memory_in_bytes":0,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":0,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":0,"index_writer_memory_in_bytes":0,"version_map_memory_in_bytes":0,"fixed_bit_set_memory_in_bytes":0,"max_unsafe_auto_id_timestamp":-1,"file_sizes":{}},"translog":{"operations":0,"size_in_bytes":110,"uncommitted_operations":0,"uncommitted_size_in_bytes":110,"earliest_last_modified_age":0},"request_cache":{"memory_size_in_bytes":1387,"evictions":0,"hit_count":10,"miss_count":2},"recovery":{"current_as_source":0,"current_as_target":0,"throttle_time_in_millis":0}}}}}

I suspect it is because there are 400 errors on insert although it isn't clear why -
POST http://localhost:9200/images/image?refresh=false [status:400 request:0.004s]
POST http://localhost:9200/images/image?refresh=false [status:400 request:0.004s]

py-image-dedup probably doesnt use the correct request format for your version of elasticsearch. v1.0.0 can not work around this without changing the code. You can try with the latest version from master which should detect your EL version automatically. There is no release for that version yet, its on my TODO list.

@jasontitus I have invested a couple hours, updated dependencies and fixed related stuff. I have not yet released a new version since it doesn't feel polished enough yet, but you can try the latest version from master or dockerhub if you want to give it a try.