I found that some source files were not properly indexed.
RosePasta opened this issue · 4 comments
First of all, thank you very much for sharing your package.
I found that some source files were not properly indexed.
In shared indexing file, two fields such as "contents" and "path" were used.
In the eclipse.jdt.core project, the "contents" do not appear in the 1.java file.
(another file showed the same problem.)
Below is a snippet of code I used to verify this.
IndexReader reader = DirectoryReader.open(FSDirectory.open((new File(indexFolder)).toPath()));
// Check the fields
Fields fields = MultiFields.getFields(reader);
Iterator<String> iterator = fields.iterator();
while(iterator.hasNext()) {
System.out.println(iterator.next());
}
for(int i = 0 ; i< reader.numDocs(); i++) {
Document doc = reader.document(i);
System.out.println(i+" "+doc.get("path")+" "+doc.get("contents"));
Terms tfVector = reader.getTermVector(i, "contents");
TermsEnum iter = tfVector.iterator();
for(int j = 0 ; j< tfVector.size(); j++) {
System.out.print(iter.next().utf8ToString()+" ");
}
System.out.println();
}
The output is below.
0 F:\MyWorks\Thesis Works\Crowdsource_Knowledge_Base\M4CPBugs\experiment\corpus\norm-class\eclipse.jdt.core\1.java null
The 1.java had many terms, so the "contents" should be not null.
Could you check if I was wrong?
Thanks for notifying, @RosePasta . I will take a look into it.
Hi @RosePasta
Lucene often changes its data structure, and code goes obsolete frequently.
But if you are looking for TF or IDF calculation, here is the code that might help you.
public static final String FIELD_CONTENTS = "contents";
IndexReader reader = null;
reader = DirectoryReader.open(FSDirectory
.open(new File(indexFolder).toPath()));
Fields fields = MultiFields.getFields(reader);
for (String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator();
BytesRef bytesRef;
while ((bytesRef = termsEnum.next()) != null) {
if (termsEnum.seekExact(bytesRef)) {
String term = bytesRef.utf8ToString();
this.keys.add(term);
}
}
}
// now get their DF
int N = reader.numDocs();
int sumTotalTermFreq = 0;
for (String term : this.keys) {
Term t = new Term(FIELD_CONTENTS, term);
// calculating the TF
long totalTermFreq = reader.totalTermFreq(t);
if (!tfMap.containsKey(term)) {
tfMap.put(term, totalTermFreq);
sumTotalTermFreq += totalTermFreq;
}
// calculating the IDF
int docFreq = reader.docFreq(t);
double idf = getIDF(N, docFreq);
Thank you. 👍
OK closing as resolved then.