change uuid to paper_id (and other naming changes)

Question

change uuid to paper_id (and other naming changes)

Opened this issue 2 months ago · 8 comments

For the parquet files, can we standardized the column names a little.

e.g., mentions uses uuid, but papers uses paperId

Prefer all column names to be snake_case as well.

And can we change the columns like documentContextAttributes.created.value to purpose_created_doc_context and purpose_created_mention_context

Although if we are going to follow the ERD we would have a new table

purpose_assessments(
  mention_id,
  purpose[used,created,shared],
  context[local|document],
  certainty_percent)

which raises the question of what the mention_id is, I guess combination of paper_id and mention_index

Answer 1 · 2024-12-09T20:24:53.000Z

Done in data model spec:

Proposed tables are in light blue. Other boxes are potential tables, but for now are essentially just primary keys.

Answer 2 · 2024-12-09T20:37:07.000Z

Actually ... accommodating parse_type:

Answer 3 · 2024-12-09T20:44:14.000Z

Now part of the spec; closing this and will export and upload once done with #18 .

Answer 4 · 2024-12-09T20:45:08.000Z

The "Parse" table above is debatable... won't include for now.

Answer 5 · 2024-12-11T15:19:06.000Z

Looking at options, we could create a nested type for PurposeAssessment. Not going to pursue for now.

Answer 6 · 2024-12-12T16:28:06.000Z

A Paper can have multiple Parses, each for a different source representation (pdf, jats, xml, etc). A Parse results in identifying multiple SoftwareMentions (with full text context, raw and normalized name, version, url), each of which has a number of PurposeAssessments (used, created, etc. in both a local and a document context).

Paper has_many SourceFile
SourceFile has_many Parse
Parse has_many SoftwareMention
SoftwareMention has_many PurposeAssessment

A Paper can have multiple SourceFile (e.g., pdf, jats, xml), each of which was parsed resulting in multiple SoftwareMentions (with full text context, raw and normalized name, version, url). Each SoftwareMention has a number of PurposeAssessment, resulting from assessments about whether the software in the mention was used, created, etc. Some assessments draw on only the local context of the single mention, while others draw on all the mentions of that piece of software in the document.

A Paper (doi, title, ...) has many identified SoftwareMentions (software_name, full_text_where_found, ...), each of which have six PurposeAssessments (used, created, shared in local or document context).

Papers have metadata like doi, title, journal name.
SoftwareMentions have metadata like raw and normalized software name, url, version, as well as full text context snippet in which the mention was identified.
PurposeAssessments are about whether the software in the mention was assessed to be used, created or shared by the paper. Each of these three were assessed in two ways: just the local context of the single mention, and across the document, drawing on all the mentions of that piece of software in the paper.

(used, created, shared in both a local mention and document context

Answer 7 · 2024-12-12T16:28:07.000Z

We should likely have descriptions of the columns stored as metadata. Need to look into how languages other than Go access this.

For ID columns, use the form [table]_id.

A difficult bit here is communicating what the primary key is for a table, as all of these have composite primary keys. We're going to merge these into a single column but keep the original columns as well, that way we have a nice single primary key column. We're keeping the other columns (e.g. paper_id) even though they contain duplicate information to the new software_mention_id column as they'll be a common GROUP_BY target.

Some open questions:

What is the default parse type? I recognize it by the one that lacks a specific label.
Is there a consistent threshold at which certainty_percent becomes "true" or "false" for "is_purpose"? If so, don't need to keep is_purpose. Otherwise, do need to keep as its own column.

Answer 8 · 2024-12-12T16:39:45.000Z

Change "parse_type" to "source_file_type" since it's not really different parses, but different source files which happen to represent the same paper.