datalad-datasets/ohbm2020-posters

Allow keyword search somehow?

nicholst opened this issue · 5 comments

I just created this CSV file of all of the key words for each abstract for my own use.

Not sure if it's possible to integrate somehow, but thought I'd throw it your way.

Bash scratching below

wget -q -O AllPosters.html 'https://ww4.aievolution.com/hbm2001/index.cfm?do=abs.pubSearchAbstracts'
grep abstractnumber AllPosters.html | sed 's/^.*abstractnumber">//;s/<.*$//' > AbsNum.txt
grep javascript:previewAbs AllPosters.html | sed 's/^.*previewAbstract(//;s/).*$//' > PosterNum.txt
echo "AbsNum,PostNum,Keywords" > Keywords.csv
paste -d , AbsNum.txt PosterNum.txt | while read In ; do
    echo $In
    AbsNum=${In%,*}
    PostNum=${In#*,}
    wget -q -O out.html 'https://ww4.aievolution.com/hbm2001/index.cfm?do=abs.viewAbs&abs='$PostNum
    echo -n "$AbsNum,$PostNum," >> Keywords.csv
    gawk '$0~/<br .>/{On=0};{sub(/^[ \t]+/, "");sub(/[ \t]+$/, "");if(On){print $0}};$0~/>Keywords:</{On=1}' out.html | grep -Ev 'div>|^$'|gsed -z 's/\n/|/g;s/|$//' >> Keywords.csv
    echo "" >> Keywords.csv
done

Adding a small-font tag list underneath each line would be useful at least from a Ctrl-F. Not sure how easy it would be to add it as a filter in search.

Just add a column (field in .json) and it will participate in search!

Should be really easy to add, thank you @nicholst for the script! we could save that additional tsv and since poster number is there -- would be easy to add as already done for PDFs as of #27.

Anyone (@effigies @nicholst @rmarkello ) has interest/time to push this forward or should I (but later a bit)?

I'll post here if I start on this. Have a couple other things I need to get to first. :-)

@yarikoptic can the additional field in the json be a list-of-string or does it need to be a single string to be searchable?

Closed by #54 ... great work team!!