NCEAS-DF-Semantics-Project
- Contributors: Samantha Csik
- Contact: scsik@nceas.ucsb.edu
Overview
In order to improve data discoverablity within the Arctic Data Center (ADC), we are beginning to incorporate semantic annotations into the data curation process. A current need is to evaluate metadata across the ADC's data holdings for commonly used (and perhaps "semantically important") terms, which may provide useful for constructing and/or expanding upon currently referenced ontologies.
This repository provides code for:
- querying Arctic Data Center datapackage metadata (titles, keywords, abstracts, and entity- & attribute-level information)
- text mining and data wrangling necessary for extracting commonly used terms across various metadata fields
- visualizing term frequencies
Getting Started
Scripts are numbered in the order they are to be run.
Repository Structure
NCEAS-DF-Semantics-Project
|_code
|_old
|_reports
|_data
|_ADC_semantic_annotations_review
|_attributes_query_eatocsv
|_extracted_attributes
|_fullQuery2020-09-13
|_xml
|_identifiers
|_queries
|_old
|_text_mining
|_filtered_token_counts
|_unnested_tokens
|_weighted_scores
|_figures
Code
0_libraries.R
: packages required in subsequent scripts0_functions.R
: custom functions for data wrangling & plotting; information regarding function purpose and arguments is included in the script1a_queries.R
: uses solr query to extract package identifiers, titles, keywords, abstracts, authors from ADC data holdings1b_download_EA_metadata_by_identifier.R
: uses package identifiers to extract attribute-level information from ADC data holdings and returns data in tidy format (one attribute per row)2_unnest_tokens.R
: uses the tidytext package to separate titles, keywords, and abstracts into individual tokens, bigrams, and trigrams3_filterStopWords_count_tokens.R
: removes stop_words (commonly used words from established lexicons) and missing rows of information (NAs); calculates term (individual word, bigram, trigram) frequency counts, the number of unique identifiers those terms are found in, and the number of unique authors that use those terms4_plot_token_frequencies.R
: visualizes term frequencies, arranged by count and alphabetically5_calculate_weighted_scores.R
: calculates a single score to represent the "importance" of each term, taking into accout term frequency, prevalence across data packages, and number of unqiue authors using that term
Data
data/queries/fullQuery_titleKeywordsAbstractAuthors2020-09-28.csv
:
* solr query of all most recently updated Arctic Data Center holdings, as of 2020-09-28
identifier
:rightsHolder
:abstract
:author
:title
:keywords
:
data/attributes_query_eatocsv/extracted_attributes/fullQuery2020-09-13_attributes.csv
:
* data/text_mining/unnested_tokens
:
* data/text_mining/filtered_token_counts
:
* data/text_mining/weighted_scores
:
* Software
These analyses were performed in R (version 3.6.3). See SessionInfo for dependencies.
Acknowledgements
Work on this project was supported by: ...