howisonlab/software-mentions-dataset-analysis

change uuid to paper_id (and other naming changes)

Opened this issue · 8 comments

For the parquet files, can we standardized the column names a little.

e.g., mentions uses uuid, but papers uses paperId

Prefer all column names to be snake_case as well.

And can we change the columns like documentContextAttributes.created.value to purpose_created_doc_context and purpose_created_mention_context

Although if we are going to follow the ERD we would have a new table

purpose_assessments(
  mention_id,
  purpose[used,created,shared],
  context[local|document],
  certainty_percent)

which raises the question of what the mention_id is, I guess combination of paper_id and mention_index

Done in data model spec:

Proposed tables are in light blue. Other boxes are potential tables, but for now are essentially just primary keys.

SoftCite Data Model(3)

Actually ... accommodating parse_type:
SoftCite Data Model(4)

Now part of the spec; closing this and will export and upload once done with #18 .

The "Parse" table above is debatable... won't include for now.

Looking at options, we could create a nested type for PurposeAssessment. Not going to pursue for now.

A Paper can have multiple Parses, each for a different source representation (pdf, jats, xml, etc). A Parse results in identifying multiple SoftwareMentions (with full text context, raw and normalized name, version, url), each of which has a number of PurposeAssessments (used, created, etc. in both a local and a document context).

Paper has_many SourceFile
SourceFile has_many Parse
Parse has_many SoftwareMention
SoftwareMention has_many PurposeAssessment

A Paper can have multiple SourceFile (e.g., pdf, jats, xml), each of which was parsed resulting in multiple SoftwareMentions (with full text context, raw and normalized name, version, url). Each SoftwareMention has a number of PurposeAssessment, resulting from assessments about whether the software in the mention was used, created, etc. Some assessments draw on only the local context of the single mention, while others draw on all the mentions of that piece of software in the document.

A Paper (doi, title, ...) has many identified SoftwareMentions (software_name, full_text_where_found, ...), each of which have six PurposeAssessments (used, created, shared in local or document context).

Papers have metadata like doi, title, journal name.
SoftwareMentions have metadata like raw and normalized software name, url, version, as well as full text context snippet in which the mention was identified.
PurposeAssessments are about whether the software in the mention was assessed to be used, created or shared by the paper. Each of these three were assessed in two ways: just the local context of the single mention, and across the document, drawing on all the mentions of that piece of software in the paper.

(used, created, shared in both a local mention and document context

We should likely have descriptions of the columns stored as metadata. Need to look into how languages other than Go access this.

For ID columns, use the form [table]_id.

A difficult bit here is communicating what the primary key is for a table, as all of these have composite primary keys. We're going to merge these into a single column but keep the original columns as well, that way we have a nice single primary key column. We're keeping the other columns (e.g. paper_id) even though they contain duplicate information to the new software_mention_id column as they'll be a common GROUP_BY target.

Some open questions:

  • What is the default parse type? I recognize it by the one that lacks a specific label.
  • Is there a consistent threshold at which certainty_percent becomes "true" or "false" for "is_purpose"? If so, don't need to keep is_purpose. Otherwise, do need to keep as its own column.

Change "parse_type" to "source_file_type" since it's not really different parses, but different source files which happen to represent the same paper.