Building your own phrasebank. β¨
This repository provides an accessible phrase bank, which is a collection of frequently used phrases that can be utilized, for example, in the auto-complete function of an IDE. (Note: This library does not provide IDE or auto-complete functions but offers ready-to-use phrasebanks)
Moreover, this repository includes features for constructing a phrase bank from a provided text or an open corpus.
You can further customize the phrasebank according to your needs, e.g. for certain disciplines, for certain styles (descriptive, analytical, persuasive and critical), for certain sections (abstract, body text), as long as you can find good ingredients.
Elsevier OA CC-BY contains 40k articles from Elsevier's journals, including from Arts, Business, STEM to Social Sciences1.
No. | Phrasebank | Source | N of grams | Lines | Comments |
---|---|---|---|---|---|
1 | πacademic_phrasebank | Book Academic Phrasebank 2014 | 2-5 | 2,190 | Extract from pdf (Zhihao, 2024) |
2 | πelsevier_phrasebank | Corpus Elsevier OA CC-BY 2020 | 2-6 | 3,792 | Extract by n-gram (Zhihao 2024) |
3 | πbawe_1000.csv | Corpus British Academic Written English | 4-6 | 1,000 | Due to inaccessible, only most frequent 1000 list here. (Zhihao, 2024) |
4 | πacademic_word_list | Academic Word List Coxhead (2000) | 1 | 570 | The 570 word for academic English (exclude frequent 2000 words) |
5 | πelsevier_awl | 2,4 | 2-6 | 994 | The Elsevier phrasebank that contains AWL (Zhihao, 2024) |
6 | πelsevier_ENVI_EART | 2 | 2-7 | 3,700 | Environment & Earth Science 3700 collection (Zhihao 2024) |
7 | πelsevier_PSYC_SOCI | 2 | 2-7 | 3,700 | Social Science & Psychology 3700 collection (Zhihao 2024) |
8 | πelsevier_MEDI | 2 | 2-7 | 3,700 | Medicine 3700 collection (Zhihao 2024) |
No. | Phrasebank | Source | N-gram Length | Lines | Comments |
---|---|---|---|---|---|
1 | πgoogle-10000-english | Google Books Corpus | 1 | 10,000 | The 10,000 most common English words from Google Books Corpus |
2 | πWordlist 1200.txt | Internet | 1 | 2,000 | The 2,000 most common English words |
No. | Phrasebank | Source | N-gram Length | Lines | Comments |
---|---|---|---|---|---|
1 | πemoji | 1 | 745 | (Zhihao 2024) |
You can download the pre-made phrasebank from the table. If you do require a custom one, go forward.
pip install openphrasebank
Below is an example based on n-gram frequency. More examples, e.g. extract from PDF, are available in documents.
import openphrasebank as opb
tokens_gen = opb.load_and_tokenize_data (dataset_name="orieg/elsevier-oa-cc-by",
subject_areas=['PSYC','SOCI'],
keys=['title', 'abstract','body_text'],
save_cache=True,
cache_file='temp_tokens.json')
n_values = [1,2,3,4,5,6,7,8]
opb.generate_multiple_ngrams(tokens_gen, n_values)
# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}
# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
phrases[n], freqs[n] = opb.filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)
# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))
# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_PSYC_SOCI.txt', 'w') as file:
for line in sorted_phrases:
file.write(line + '\n')
You can either contribute the phrasebank or the code. Check out our contributing.
Phrasebank | Issues |
---|---|
academic_phrasebank | Due to the table in the PDF file not being properly handled, many sentences were not extracted correctly. (zhihao) |
elsevier_phrasebank |
Footnotes
-
Over 20 disciplines orieg/elsevier-oa-cc-by Β· Datasets at Hugging Face β©