0_LLM-based_entity_extraction_CySec

We use LLMs to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context.

Installation

To install the dependencies when cuda is available python3 -m pip install -r requirements_cuda.txt --extra-index-url https://pypi.nvidia.com

Otherwise python3 -m pip install -r requirements.txt

Usage

To extract the keywords

python3 pipeline_eeke.py

To draw the manifolds with the different embeddings

python3 manifold.py

and

python3 manifold_plotly.py

for interactive plot.

To plot the cross correlation of the models

python3 cross_correlation_matrix_model.py

Remarks

When using the scripts on large data, you can have some issue with the scripts. For example: Take too long, use too much memory, ...
There are constant on top of the files to edit some parameters of the scripts.

Requirements

Python3.10

Model name

Yake (KPE): https://pypi.org/project/yake/
KeyBERT (KPE): https://pypi.org/project/keybert/
Electra-base conll03 (NER): https://huggingface.co/bhadresh-savani/electra-base-discriminator-finetuned-conll03-english
XLM-RoBERTa-base OntoNotes5 (NER + NUM): https://huggingface.co/asahi417/tner-xlm-roberta-base-ontonotes5
BERT COCA-docusco (TokC): https://huggingface.co/browndw/docusco-bert
BERT-large-cased conll03 (NER): https://huggingface.co/dslim/bert-large-NER
DistilBERT-base-uncased conll03 (NER): https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english
RoBERTa-large conll03 (NER): https://huggingface.co/Jean-Baptiste/roberta-large-ner-english
BERT-large-uncased conll03 (NER): https://huggingface.co/Jorgeutd/bert-large-uncased-finetuned-ner
KBIR inspec (KPE): https://huggingface.co/ml6team/keyphrase-extraction-kbir-inspec
KBIR kpcrowd (KPE): https://huggingface.co/ml6team/keyphrase-extraction-kbir-kpcrowd
XLM-RoBERTa-large conll03 (NER): https://huggingface.co/xlm-roberta-large-finetuned-conll03-english
BERT-base-uncased (NER + CON R): https://huggingface.co/yanekyuk/bert-uncased-keyword-discriminator
BERT-base-uncased (KPE): https://huggingface.co/yanekyuk/bert-uncased-keyword-extractor
Spacy-large OntoNotes5 (NnE): https://pypi.org/project/spacy/
Spacy-transformer OntoNotes5 (NnE): https://pypi.org/project/spacy-transformers/

Plot data

All the plot generated by the scripts can be found in the results folder.

Here some information about the contents

cluster_coeff: Cluster coefficient of an extractor with a specific embedding
manifolds: 2D Projection of the extractors with a specific embedding
manifolds_html: Same as before, but the plots are interactive
model_correlation_*: hierarchical clustering on the average cosine similarity between each extractor