ankane/pdscan

Not working with elasticsearch aliases

Opened this issue · 2 comments

Hi, I'm using filtered aliases to segment the data and they should work like indices , but for some reason are being ignored.

Go to Kibana Devtools and run the following:

DELETE test_animals

POST test_animals/_doc
{
  "name": "Peter Parker",
  "type": "dog"
}
POST test_animals/_doc
{
  "name": "Michael Jordan",
  "type": "cat"
}
POST _aliases 
{
  "actions": [
    {
      "add": {
        "index": "test_animals",
        "alias": "dogs",
        "filter": {
          "term": {
            "type.keyword": "dog"
          }
        }
      }
    },
    {
      "add": {
        "index": "test_animals",
        "alias": "cats",
        "filter": {
          "term": {
            "type.keyword": "cat"
          }
        }
      }
    }
  ]
}

This will generate two virtual indices based on one original indices, giving the the ability to run pdscan against segments of the data:

./pdscan elasticsearch+https://gustavo:gustavo@my-deployment-xxx:9200/cats --show-all --format ndjson --show-data
It should report only "michael jordan" but reports against all test_animals index.

Found 1 index to scan, sampling 10000 documents from each...

{"identifier":"test_animals.name","name":"surname","match_type":"value","confidence":"low","matches":["Michael Jordan","Peter Parker"],"matches_count":2}

Is this expected? how can I segment by a certain field?

Thanks!

I refactored FetchTables() to this:

func (a ElasticsearchAdapter) FetchTables() ([]table, error) {
	tables := []table{}

	es := a.DB

	// try to fetch aliases first
	res, err := es.Cat.Aliases(
		es.Cat.Aliases.WithName([]string{a.indices}...),
		es.Cat.Aliases.WithFormat("json"),
	)

	if err != nil {
		return nil, err
	}
	defer res.Body.Close()

	err = checkResult(res)
	if err != nil {
		return nil, err
	}

	var r []interface{}
	if err := json.NewDecoder(res.Body).Decode(&r); err != nil {
		return nil, fmt.Errorf("error parsing the response body: %s", err)
	}

	aliasMap := make(map[string]bool)

	for _, alias := range r {
		aliasName := alias.(map[string]interface{})["alias"].(string)
		aliasMap[aliasName] = true
	}

	if len(aliasMap) > 0 {
		for aliasName := range aliasMap {
			tables = append(tables, table{Schema: "", Name: aliasName})
		}
		return tables, nil
	}

	// fallback to fetching indices
	res, err = es.Cat.Indices(
		es.Cat.Indices.WithIndex([]string{a.indices}...),
		es.Cat.Indices.WithS("index"),
		es.Cat.Indices.WithFormat("json"),
	)
	if err != nil {
		return nil, err
	}
	defer res.Body.Close()

	err = checkResult(res)
	if err != nil {
		return nil, err
	}

	if err := json.NewDecoder(res.Body).Decode(&r); err != nil {
		return nil, fmt.Errorf("error parsing the response body: %s", err)
	}

	for _, index := range r {
		indexName := index.(map[string]interface{})["index"].(string)

		// skip system indices
		if indexName[0] != '.' {
			tables = append(tables, table{Schema: "", Name: indexName})
		}
	}

	return tables, nil
}

And now it can handle aliases (if it find aliases uses those directly, and if it not, it will fallback to the default behavior).

It is nice that also uses the alias name in the detection

.../cats --show-all
catsFound 1 index to scan, sampling 10000 documents from each...

cats.name: found last names (1 document, low confidence)

Probably not the best code (I'm not a go dev) but maybe its useful to start a discussion

This approach will not work with data-streams ,

This one will:

// try to fetch alias first
	res, err := es.Indices.GetAlias(es.Indices.GetAlias.WithName([]string{a.indices}...))
	if err != nil {
...

Basically if there is an alias will use it, ofc a more elegant solution should be implemented to support multi indices, but it is enough for me