/0_LLM-based_entity_extraction_CySec

We use LLMs to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context.

Primary LanguageHTML

0_LLM-based_entity_extraction_CySec

We use LLMs to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context.

Installation

To install the dependencies when cuda is available python3 -m pip install -r requirements_cuda.txt --extra-index-url https://pypi.nvidia.com

Otherwise python3 -m pip install -r requirements.txt

Usage

To extract the keywords

python3 pipeline_eeke.py

To draw the manifolds with the different embeddings

python3 manifold.py

and

python3 manifold_plotly.py

for interactive plot.

To plot the cross correlation of the models

python3 cross_correlation_matrix_model.py

Remarks

  • When using the scripts on large data, you can have some issue with the scripts. For example: Take too long, use too much memory, ...
  • There are constant on top of the files to edit some parameters of the scripts.

Requirements

  • Python3.10

Model name

Plot data

All the plot generated by the scripts can be found in the results folder.

Here some information about the contents

  • cluster_coeff: Cluster coefficient of an extractor with a specific embedding
  • manifolds: 2D Projection of the extractors with a specific embedding
  • manifolds_html: Same as before, but the plots are interactive
  • model_correlation_*: hierarchical clustering on the average cosine similarity between each extractor