I found that some source files were not properly indexed.

Question

I found that some source files were not properly indexed.

RosePasta opened this issue 5 years ago · 4 comments

First of all, thank you very much for sharing your package.

In shared indexing file, two fields such as "contents" and "path" were used.

In the eclipse.jdt.core project, the "contents" do not appear in the 1.java file.
(another file showed the same problem.)

Below is a snippet of code I used to verify this.

IndexReader reader = DirectoryReader.open(FSDirectory.open((new File(indexFolder)).toPath()));
// Check the fields
Fields fields = MultiFields.getFields(reader);
Iterator<String> iterator = fields.iterator();
while(iterator.hasNext()) {
    System.out.println(iterator.next());
}

for(int i = 0 ; i< reader.numDocs(); i++) {
    Document doc = reader.document(i);
    System.out.println(i+" "+doc.get("path")+" "+doc.get("contents"));
    Terms tfVector = reader.getTermVector(i, "contents");
    TermsEnum iter =  tfVector.iterator();
    for(int j = 0 ; j< tfVector.size(); j++) {
        System.out.print(iter.next().utf8ToString()+" ");					
    }
    System.out.println();
}

The output is below.

0 F:\MyWorks\Thesis Works\Crowdsource_Knowledge_Base\M4CPBugs\experiment\corpus\norm-class\eclipse.jdt.core\1.java null

The 1.java had many terms, so the "contents" should be not null.

Could you check if I was wrong?

Answer 1 · 2019-06-21T06:15:12.000Z

Thanks for notifying, @RosePasta . I will take a look into it.

Answer 2 · 2019-07-03T00:31:14.000Z

Hi @RosePasta

Lucene often changes its data structure, and code goes obsolete frequently.
But if you are looking for TF or IDF calculation, here is the code that might help you.

public static final String FIELD_CONTENTS = "contents";
		IndexReader reader = null;
			reader = DirectoryReader.open(FSDirectory
					.open(new File(indexFolder).toPath()));
			Fields fields = MultiFields.getFields(reader);

			for (String field : fields) {
				Terms terms = fields.terms(field);
				TermsEnum termsEnum = terms.iterator();
				BytesRef bytesRef;
				while ((bytesRef = termsEnum.next()) != null) {
					if (termsEnum.seekExact(bytesRef)) {
						String term = bytesRef.utf8ToString();
						this.keys.add(term);
					}
				}
			}

			// now get their DF
			int N = reader.numDocs();
			int sumTotalTermFreq = 0;
			for (String term : this.keys) {
				Term t = new Term(FIELD_CONTENTS, term);
				// calculating the TF
				long totalTermFreq = reader.totalTermFreq(t);
				if (!tfMap.containsKey(term)) {
					tfMap.put(term, totalTermFreq);
					sumTotalTermFreq += totalTermFreq;
				}
				
                      // calculating the IDF
				int docFreq = reader.docFreq(t);
				double idf = getIDF(N, docFreq);

Answer 3 · 2019-07-09T06:12:10.000Z

Thank you. 👍

Answer 4 · 2019-11-18T02:37:13.000Z

OK closing as resolved then.