Towards Effective Paraphrasing for Information Disguise

This repository contains the code for our ECIR 2023 accepted paper: Towards Effective Paraphrasing for Information Disguise.

Repository structure:

  • code/beam_search_code/Disguise Text.ipynb : Shows the disguise of a true sentence (query) via our model
  • code/beam_search_code/beam_helper: contains all the helper modules for our model
    • beam_utils.py: contains the code dealing with single level phrase substitution, Beam Search, Constituency Parse Tree creation etc.
    • synonyms_store.py: contains the code to get synonyms of a term in Counterfitting synonyms vector space
    • faiss_fetch.py: Contains the code for initializing DPR and fetching top K relevant documents
    • perplexity_calculation.py: contains the code initiating the perplexity calculation
    • fetch_use_scores.py: contains the code to create Universal Sentence Encoding for a given piece of text
  • code/beam_search_code/counter-fitted-vectors.txt: Counterfitting vectors used for fetching synonyms
  • data/all_syns.json: Contains the 10 nearest neighbours for all terms in the dictionary (the nearest neighbours were calcuated by using Facebook AI Similarity Search (FAISS)) on the vectors in counter-fitted-vectors.txt
  • sql_lite_dbs/<name>.db: expects the database containing the metadata and contents of the document store (to be used by DPR)
  • code/faiss_indexes/<name>.faiss: expects the vectors for the documents in the document store
  • code/faiss_indexes/exp_with_two_thou_short.json: expects the configuration file containing the parameters describing how to read ".faiss"

Requirements

Details of the conda environment for the above codebase is present in adversarial_search.yaml. We use Haystack's DPR implementation.

Attack parameters which can be modified/passed to Class BeamSearch in beam_utils.py

Parameter Name Description
MAX_DEPTH Number of levels in the beam search tree ie the MAXIMUM number of phrase substitutions allowed to be made in the query
ALPHA_VAL
  • Weighing parameter (to weight semantic similarity to the original query and locatibility differently).
  • It is used in the calculation of score for a node in the BeamSearchTree.
  • See the section `Algorithm Explanations` in the paper for the details.
NUM_PERPLEXITY_NODES_TO_EXPAND
  • Number of nodes in the Constituency Parse Tree to be considered for attacking.
  • Corresponds to the parameter "P" in STEP 3 of Section 3.1 of the paper.
BeamWidth Max number of nodes at each level of the beam tree.
NUM_FAISS_DOCS_TO_RETRIEVE Max relevant documents to be fetched for the query in which the source document's presence needs to be checked.
SIMILARITY_CUT_OFF_THRESHOLD
  • Candidates which have a similarity of less than `SIMILARITY_CUT_OFF_THRESHOLD` with the original sentence are filtered out.
  • Corresponds to `epsilon` in the paper