We use LLMs to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context.
To install the dependencies when cuda is available python3 -m pip install -r requirements_cuda.txt --extra-index-url https://pypi.nvidia.com
Otherwise python3 -m pip install -r requirements.txt
To extract the keywords
python3 pipeline_eeke.py
To draw the manifolds with the different embeddings
python3 manifold.py
and
python3 manifold_plotly.py
for interactive plot.
To plot the cross correlation of the models
python3 cross_correlation_matrix_model.py
- When using the scripts on large data, you can have some issue with the scripts. For example: Take too long, use too much memory, ...
- There are constant on top of the files to edit some parameters of the scripts.
- Python3.10
- Yake (KPE): https://pypi.org/project/yake/
- KeyBERT (KPE): https://pypi.org/project/keybert/
- Electra-base conll03 (NER): https://huggingface.co/bhadresh-savani/electra-base-discriminator-finetuned-conll03-english
- XLM-RoBERTa-base OntoNotes5 (NER + NUM): https://huggingface.co/asahi417/tner-xlm-roberta-base-ontonotes5
- BERT COCA-docusco (TokC): https://huggingface.co/browndw/docusco-bert
- BERT-large-cased conll03 (NER): https://huggingface.co/dslim/bert-large-NER
- DistilBERT-base-uncased conll03 (NER): https://huggingface.co/elastic/distilbert-base-uncased-finetuned-conll03-english
- RoBERTa-large conll03 (NER): https://huggingface.co/Jean-Baptiste/roberta-large-ner-english
- BERT-large-uncased conll03 (NER): https://huggingface.co/Jorgeutd/bert-large-uncased-finetuned-ner
- KBIR inspec (KPE): https://huggingface.co/ml6team/keyphrase-extraction-kbir-inspec
- KBIR kpcrowd (KPE): https://huggingface.co/ml6team/keyphrase-extraction-kbir-kpcrowd
- XLM-RoBERTa-large conll03 (NER): https://huggingface.co/xlm-roberta-large-finetuned-conll03-english
- BERT-base-uncased (NER + CON R): https://huggingface.co/yanekyuk/bert-uncased-keyword-discriminator
- BERT-base-uncased (KPE): https://huggingface.co/yanekyuk/bert-uncased-keyword-extractor
- Spacy-large OntoNotes5 (NnE): https://pypi.org/project/spacy/
- Spacy-transformer OntoNotes5 (NnE): https://pypi.org/project/spacy-transformers/
All the plot generated by the scripts can be found in the results
folder.
Here some information about the contents
cluster_coeff
: Cluster coefficient of an extractor with a specific embeddingmanifolds
: 2D Projection of the extractors with a specific embeddingmanifolds_html
: Same as before, but the plots are interactivemodel_correlation_*
: hierarchical clustering on the average cosine similarity between each extractor