ncihtan/htan-portal

Enable display and search of channel names

adamjtaylor opened this issue · 6 comments

Objective:

Implement a feature on the HTAN portal to display harmonized target names for multiplexed tissue imaging data. This aims to assist researchers in easily locating and identifying datasets with specific antibody markers.

User stories:

As a cancer researcher interested in HTAN multiplexed tissue imaging data, I want to view a list of antibody targets and channels for images on the HTAN portal and use filters to search for datasets based on these attributes, so that I can easily locate datasets with specific markers relevant to my research.

As a cancer researcher, I want to identify HTAN imaging datasets where antibodies CD45, CD8, and CD4 were targeted, so that I can specifically identify cytotoxic and helper T cell populations for my studies.

Background:

Currently, channel metadata is not easily exposed or searchable by users. Additionally it was not validated at ingestion so is poorly structured. @adamjtaylor is exploring an LLM approach with Lama3 for harmonizing target names that seems promising.
To support this work, and provide a MVP solution for users, this issue focuses on creating a method to display these names effectively on the portal.

For the MVP:

  • Mapping File Creation
    • We'll need a file that links entity IDs to channel names.
  • Portal Integration:
    • Add a new column in the file tab called Targets.
    • This column will list targets as a single string, like ['DNA', 'CD45', 'CD8', 'CD4'].

Looking Ahead:

Eventually, we want to incorporate these target names directly into the dataset metadata. Starting with this simpler display feature will help us lay the groundwork for future enhancements.

@inodb lets have a quick think about what mapping file setup would be best and think about any backend changes needed to enable this - I am hoping this is simply a join operation between the mapping file and the master JSON

One option would be a mapping file like this

{
  "syn1234": ["Target1","Target2"] 
  "syn53284675": ["DNA", "CD8", "CD45"."CD4", "Ki-67"],
},

I think this seems extensible enough to start with the original as provided target names and switch to harmonized ones in due course.

The following Big Query gets us a table close to what we need:

SELECT 
    e.entityId,
    cm.Channel_Metadata_ID, 
    STRING_AGG(attribute.attributeValue, ", ") AS channel_names,
    
FROM 
    `htan-dcc.ISB_CGC_r5.channel_metadata` cm,
    UNNEST(cm.channel_attributes) AS attribute
INNER JOIN 
    `htan-dcc.released.entities_v5_1` e ON cm.Channel_Metadata_ID = e.channel_metadata_synapseId
WHERE 
    attribute.attributeName = 'Channel Name'
AND attribute.attributeValue NOT IN  ('Red','Green','Blue')
GROUP BY 
    cm.Channel_Metadata_ID, e.entityId
Screenshot 2024-05-08 at 4 14 47 PM

@inodb I'd like to move forward with discussing how to implement this portal side so I can ensure outputs are prepared correctly.

@adamjtaylor the bigquery table looks good to me! We already have a way to pull from BigQuery directly and store it, so I don't think you need to provide anything else

OK. So I will look to push back a new table to BQ that has entityId, Channel_Metadata_ID, and a new column harmonized_channel_names

I'll point you to that once complete