winkjs/wink-bm25-text-search

addDoc for duplicate documents

ittaboba opened this issue · 5 comments

I have seen that if I try to add a document with an existing id, it throws this error

if ( documents[ id ] !== undefined ) {
  throw Error( 'winkBM25S: Duplicate document encountered: ' + JSON.stringify( id ) );
}

I need to handle document updates. I am wondering if it's possible to update the model if I know both the original and updated versions of a document.

Thank you!

The documents can not be added post consolidation. During add document, duplicated doc id is detected to prevent any unintentional mistake in providing doc id. If you wish to update then first reset and upload the documents again.

Thanks for replying @sanjayaksaxena

Yes, I meant before consolidation. It could be very expensive for me to reset on every upload. I am considering writing a custom removeDoc(id) before doing a new addDoc(id) for the same id.

In your opinion, if I start with the following docs and JSON

[
  {"id": "id_0", "title": "hey, how are you?", "body": "good"},
  {"id": "id_1", "title": "hey", "body": "hey"},
  {"id": "id_2", "title": "hey", "body": "bad"}
]
[
  {
    "fldWeights": {
      "title": 1,
      "body": 2
    },
    "bm25Params": {
      "k1": 1.2,
      "b": 0.75,
      "k": 1
    },
    "ovFldNames": []
  },
  {
    "totalCorpusLength": 9,
    "totalDocs": 3,
    "consolidated": false
  },
  {
    "id_0": {
      "freq": {
        "0": 0.1335,
        "1": 1.3486
      },
      "fieldValues": {},
      "length": 3
    },
    "id_1": {
      "freq": {
        "0": 0.2098
      },
      "fieldValues": {},
      "length": 3
    },
    "id_2": {
      "freq": {
        "0": 0.1335,
        "2": 1.3486
      },
      "fieldValues": {},
      "length": 3
    }
  },
  [
    [
      "id_0",
      "id_1",
      "id_2"
    ],
    [
      "id_0"
    ],
    [
      "id_2"
    ]
  ],
  3,
  {
    "hey": 0,
    "good": 1,
    "bad": 2
  },
  {},
  [],
  []
]

Then remove id_0

[
  {"id": "id_1", "title": "hey", "body": "hey"},
  {"id": "id_2", "title": "hey", "body": "bad"}
]

Are there any drawbacks to have a JSON like the following when I consolidate or add a new doc? (e.g. the empty array in invertedIdx and the "good" token in the token2Index object even if it doesn't appear in any document)

[
  {
    "fldWeights": {
      "title": 1,
      "body": 2
    },
    "bm25Params": {
      "k1": 1.2,
      "b": 0.75,
      "k": 1
    },
    "ovFldNames": []
  },
  {
    "totalCorpusLength": 6,
    "totalDocs": 2,
    "consolidated": false
  },
  {
    "id_1": {
      "freq": {
        "0": 0.2098
      },
      "fieldValues": {},
      "length": 3
    },
    "id_2": {
      "freq": {
        "0": 0.1335,
        "2": 1.3486
      },
      "fieldValues": {},
      "length": 3
    }
  },
  [
    [
      "id_1",
      "id_2"
    ],
    [
      
    ],
    [
      "id_2"
    ]
  ],
  3,
  {
    "hey": 0,
    "good": 1,
    "bad": 2
  },
  {},
  [],
  []
]

P.S I know it should be a minimum of 3 documents to work

Thanks a lot!

Hello @ittaboba

Deleting before consolidation requires all the operations in addDoc() to be undone except token2Index — something that we may leave untouched: bit of overloading of data may not matter much. I think invertedIdx must be handled i.e. remove doc id from the arrays mapped to each token contained in the document pertaining to doc id.

Ideally we should allow incremental updates even after consolidation — its something that has been there on our minds but there is so much of backlog to handle!

Best,
Sanjaya

Thanks @sanjayaksaxena for your hard work, this is super cool!

Here's the simple function I wrote:

var removeDoc = function ( id ) {

    if ( consolidated ) {
      throw Error( 'winkBM25S: post consolidation removing is not possible!' );
    }

    const cl = documents[id].length;

    totalCorpusLength -= cl;
    totalDocs -= 1;

    delete documents[id];

    for ( var i = 0, imax = invertedIdx.length; i < imax; i += 1 ) {
      invertedIdx[ i ] = invertedIdx[ i ].filter((idx) => idx !== id);
    }
};

I get the same scores if I try to remove a document with this function or run addDoc from scratch on all the documents minus the one removed.

Empty invertedIdx arrays and token2Index don't seem to affect scores. In fact, you actually iterate over documents to calculate documents[ id ].freq[ t ] at line 423. So idf[ t ] which is calculated using invertedIdx[ i ].length shouldn't affect the score when it's 0 if you have also removed the document from the json like I do.

Please let me know if you see any drawbacks I have missed.

Thank you,
Lorenzo

Hi @ittaboba

Not able to spot any drawbacks.

Best,
Sanjaya