Classify website visitor queries into measurable categories

Chart and respond to trends in information seeking, by classifying your web visitor queries to a taxonomy and an ontology, using completely anonymized data

Analytics data for search for large web sites is often too verbose and too inharmonious to analyze. One "portal" site studied receives around 150,000 "clicks" per month from search-engine results screens, and around 100,000 queries per month from internal site search. Examining the visitor queries reveals many variations on the same conceptual ideas, making the content difficult to analyze and summarize. For this reason, many web managers are not looking for meaning in the search terms their site visitors are using.

We should put our completely anonymized search queries into "buckets" of broader topics, so subject matter experts (SMEs) have a way to understand how customers are currently seeking information within that SME's topic area. Having this, the SMEs can examine whether existing content should be modified to improve its findability, and whether new content should be added to fill gaps in customer needs. Health/medical analytics managers can use the Unified Medical Language System (UMLS) Semantic Network and Semantic Groups, to do this to serve customer information needs better.

Search represents a direct expression of our customers’ intent. We should use this data to improve our staff’s awareness of what our customers need from us.

Use cases

  1. A web analyst could say to a product owner, "Did you know that last month, 30 percent of your home page searches were in some way about drugs? Should we take action on this? How might we improve task completion and reduce time on task, for this type of information need?
  2. We should cluster and analyze trends we know about. For multi-faceted topics that directly relate to our mission, we should create customized analyses to collect the disparate keywords people might search for into a single bucket. How can we create a better match between user interest and the content we manage for this topic? Where might we improve our site structure and navigation?
  3. We should focus staff work on new trends, as the trends emerge. When something new starts to happen that can be matched to our mission statement, we should deploy social media posts on the new topic immediately, and start new content projects to address the emerging information need.
  4. On a longer time scale, anyone publishing to the web might want to ask, how are we preparing to support voice search? Understanding how people search for information will help understand how to adapt for this possible next-generation technology.

Pilot project results

72% of search volume (for October 2019) is tagged with broader-topic names within 3 minutes, after multiple iterations that updated the tagging files. This was 205,633 of 282,387 searches (72%), and (because the logs are already aggregated) 30,604 of 89,476 rows (34%) were tagged. What are untagged are terms searched less than a monthly average of once per day, that are often multiple-concept searches of low frequency.

During the pilot we did not create supplemental files for the MetaMapLite or CSpell tools. This would improve results.

Screenshots

screen to upload file

UMLS Semantic Types Categories

Workflow

Only partially implemented during this Codeathon.

Workflow

Dependencies

Tools

Yet to be integrated; may be useful:

  • Medical language abbreviations
  • Scikit-Learn multi-class classifier

Additional output

Search Strings input used for MetaMap and FuzzyWuzzy alt text

Future work

  • Implement tagging interface that provides suggestions for untagged queries above a frequently threshold, to facilitate manual tagging.

Influences and thanks

People

  • Dan Wendling, team lead, NLM/LO/PSD
  • Dmitry Revoe, NLM/NCBI/MGV
  • Victor Cid, NLM/LHC/CgSB
  • Laritza Rodriguez, NLM/LHC/CSB
  • Wenya Rowe, NLM/NCBCI/CBB
  • Rachit Bhatia, NLM/OCCS/STB

Past work

https://github.com/NCBI-Hackathons/Semantic-search-log-analysis-pipeline