Literature Search helper
Opened this issue · 44 comments
Started work on this in wos_pubsearch.R
- Using package
rwos
to access the Web of Science API- due to recent changes to the Web of Science API, the more built out package
wosr
no longer works. - If we wanted to work on the API from scratch (Even the lite version), we have to get in contact with Clarivate and see if we can get a key (not free)
- due to recent changes to the Web of Science API, the more built out package
- I tried making queries to match the doc in the google drive Lesley linked to us (https://drive.google.com/drive/folders/1ItdWmm83GI9sYnASZxdqFJEOngcDjN0W), and am consistently getting fewer results (only ~20 per practice).
- I just asked Lesley for an example query for one of the practices, so I'll be able to see if I'm missing something.
- We only have access to the "lite" version of the API, which is missing a few Indexes when compared to the website query. If my queries are correct, then this might explain the discrepancy
- After querying, I combined the results into a giant dataframe
- columns are: uid, journal, issue, volume, pages, date, year, authors, keywords, doi, article_no, isi_id, issn, isbn
- doi has a lot of missingness, and I'm not really sure why, since the doi is available on the website for most of these.
- The google drive also has
references_for_app.csv
, so I wanted to make sure that all the papers in here were present in our giant data frame.- this
csv
only has the citation (which isn't available in the big data frame), so I thought the best thing to do would be to pull out the doi from the citation and try to match the dois - Only about half of the dois in
references_for_app
were in the giant data frame, which is pretty bad. The reason for a lot of it is doi missingness in the results data frame - 2nd attempt was to pull out the title, first author, and year of publication from the citation. The idea was to first try matching by doi, and if that doesn't work, try to match by these three columns. This is where I'm stopping off for now, because there are a few observations in the
references_for_app.csv
that have the same title, first author, and year.
- this
Question: Is there are a reason these duplicates are in references_for_app.csv
? See the Paper_id pairs: (129, 204) , (45, 310), (142, 95) for exact matches, and (27, 315), (55,189), (231,128) for almost exact matches
@LesleyAtwood do you have thoughts on why there are duplicates?
@nathanhwangbo it sounds like automatically searching the API--either with rwos
or another R tool--might not be feasible for free? It seems like it's pretty straightforward in Python: https://pypi.org/project/wos/. Do you think that's a more robust way to go? We'll only need this to pull a .bib file to match against the initial reference list so we could in theory run the search with Python, export to R and do filtering, etc in R if that made more sense.
@swood-ecology & @nathanhwangbo , the 'duplicates' are a product of the four separate reviews. These papers satisfied the criteria for more than one review meaning they had both tillage treatments and cover crop treatments, per se.
Should we manage that by updating the literature search for each review separately, or as one big search?
For the reviewers' sake it makes more sense to update each review separately, but we will need to figure out a way to match papers across reviews so we don't double count the papers.
@nathanhwangbo , to help match references I added a refs_all_expanded.csv to google drive. Paper titles may be the best way to match provided the matching process isn't case sensitive. Also, we had to manually add a majority of the DOIs because they weren't included in the references.
Thanks for adding refs_all_expanded
. I added it to the repo for easier use.
Sorry if my original post wasn't clear. We will be able to automatically search the api using the r package rwos
-- the limitation is that we are only able to access the "lite" version, which only contains 4 out of the 10 editions that Web of Science indexes. The same limitation would apply to the python package wos
. However, this doesn't seem to be too much of an issue:
My current process to match the query with the doc is as follows:
- query web of science using
rwos
(using the queries from the doc in the google drive) - match doi (between
refs_all_expanded.csv
and the query) - for those without a doi match, I match exact titles
- for those without a doi match or a title match, I fuzzy-match the titles (using Levenshtein distance)
- note: a lot of papers had small typos in the titles, so fuzzy-matching did a lot of work. I used a cutoff at distance < 30, which means it'll find a match if it finds titles that are the same plus/minus 30 characters. 30 was chosen by increasing the distance until I didn't see any more matches. alternatively, we can try a higher threshold, but also require that the two matches must have the same year.
After this process, there are only 3 papers in refs_all_expanded.csv
that aren't in our query.
Only one of these papers is excluded because we're using the "lite" version of the API. (it's this paper: http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=17&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=1 )
The other two are indexed by the API, so the problem has something to do with the query, not the fact that we're using the lite API. These are the two papers:
http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=12&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=2
http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=7&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=7.
To confirm that the query is the issue, I tried passing the query into the web of science website, and was unable to find either of these two papers. I'm wondering if maybe these papers have changed keywords since the last time you guys ran the query.
Questions:
- Can you double check that my process for searching the website matches what you guys did? I'm copy/pasting the text from the google doc into the fields. This example is for tillage, where I was looking for this paper: http://apps.webofknowledge.com/full_record.do?product=WOS&search_mode=GeneralSearch&qid=7&SID=8FpzdxFs27mqsXgFSaU&page=1&doc=7
- What's the process for narrowing the query results down to what you guys use in the report? We can expand the query to get these two papers, but it might not be worth it if it makes the filtering step more painful
@nathanhwangbo , will you please send me list of the titles with typos. I'd like to fix those before the database is freely accessible.
@nathanhwangbo , I can't access any of the three links you sent because I'm off campus. Can you send me the paper titles?
From the image above, it looks like you searched WoS like we did. I'm not sure why there are two papers excluded from the query list. Once you send me those papers I can investigate.
We narrowed the query results in two stages. First, we reviewed all titles and abstracts and excluded ones that gave any indication that the paper would not match our criteria. If the paper passed the title and abstract review, then we downloaded the entire pdf and read the paper to determine if it matched our criteria. Data were extracted from the papers that met our criteria. It was a long process.
I don't think expanding the query to include the two rogue papers is worth it. There will already be quite a few papers to filter through, we don't want to add to that part of the process.
Here's the file with the names I was able to fuzzy-match: https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/blob/master/title_fuzzymatch.csv.
I guess calling them typos is a little misleading -- most of them are just small formatting differences.
original_title
has the titles as they are inrefs_all_expanded.csv
matched_title
has the titles are they come originally in the query- The two columns that end in
_lower
are the titles, but with everything lowercase. These were the actual columns used to do the match (so that the match would be case-insensitive)
Here are the paper titles for the three links, in the same order as above:
"Changes in water-extractable organic carbon with cover crop planting under continuous corn silage production" (doi: 10.4137/ASWR.S30708)
"Impact of corn residue removal on soil aggregates and particulate organic matter" (doi: 10.1007/s12155-014-9413-0)
"Site-specific nitrogen management of irrigated maize: yield and soil residual nitrate effects" (doi: 10.2136/sssaj2002.0544)
Based on the keywords, "Impact of corn residue removal on soil aggregates and particulate organic matter" wouldn't show up in the search. However, it does fit our criteria so we'll keep it.
I'm surprised "Site-specific nitrogen management of irrigated maize: yield and soil residual nitrate effects" doesn't show up in your search because it includes "variable rate application" as a keyword which is one of the Nutrient Management search terms. It also fits our criteria, so we'll keep it in the database.
When you run the fuzzy-match search do you get the same number of "Papers returned from initial search" as I report in the table I sent you?
No, they're slightly lower. I originally assumed that was just because we're not looking at the entire web of science library, but maybe something's going on with the query.
For reference, here's how the numbers compare:
practice | doc | query through r | query through website |
---|---|---|---|
cover crops | 354 | 351 | 363 |
tillage | 1130 | 1004 | 1029 |
pest mgmt | 134 | 139 | 140 |
nutrient mgmt | 738 | 722 | 738 |
I just found a new feature within Colandr that calculates the number of unique papers included in the search. While it didn't change much for 3 of the reviews, the # of papers included in the cover crop review matches your number (351).
I think the discrepancy comes down to how I initially searched the papers. If you recall, I accidentally excluded Illinois from my list of states and then had to run a search specifically for Illinois at a later date. When I merged the bib files in Colandr it didn't always remove duplicate papers (possibly due to slight differences in title cases or spaces). The numbers I report in the table are based off the paper counts in Colandr.
@Steve, do you think we're okay to move forward when there are a few papers missing in @nathanhwangbo search? I honestly think it's something with Colandr.
On a slightly different note, Julien made a function in the past to find DOIs using article titles (it uses the package rcrossref
to do it).
Using this function, I was able to correctly find the DOI for 135 out of the 158 references in refs_all_expanded.csv
.
- Among the remaining 23, a few of them are the same DOI, but with slightly different URL's (for example, you can add a 0 to certain parts of a DOI and have them both link to the same place).
- 7 out of the 23 have typos in the original DOI, so that the link doesn't work. See the following table with the original DOI (as it's found in
refs_all_expanded.csv
) and what CrossRef found as the correct version
Original Title | Original DOI | Corrected DOI |
---|---|---|
Crop rotation and tillage system effects on weed seedbanks | 10.1614/0043-1745(2002)050[0448:CRATSE]2.0CO;2 | 10.1614/0043-1745(2002)050[0448:cratse]2.0.co;2 |
Long-term tillage and drainage influences on soil organic carbon dynamics, aggregate stability and corn yield | 10.1080/0038768.2013.878643 | 10.1080/00380768.2013.878643 |
Soil microaggregate and macroaggregate decay over time and soil carbon change as influenced by different tillage systems | 10.2489/jswc.69.9.574 | 10.2489/jswc.69.6.574 |
Long-term tillage and drainage influences on greenhouse gas fluxes from a poorly drained soil of central Ohio | 10.2489/jswc.69.9.574 | 10.2489/jswc.69.6.553 |
Tillage and crop rotation impacts on greenhouse gas fluxes from soil at two long-term agronomic experimental sites in Ohio | 10.2489/jswc.69.9.543 | 10.2489/jswc.69.6.543 |
Tillage and cover cropping effects on soil properties and crop production in Illinois | 0.2134/agronj2016.10.0613 | 10.2134/agronj2016.10.0613 |
Soil organic carbon changes impacted by crop rotation diversity under no-till farming in South Dakota, USA | 10.2136/sssaj2016.14.0121 | 10.2136/sssaj2016.04.0121 |
(note: refs_all_expanded.csv
has 272 unique DOIs. We were able to match 114 of them using the DOI from the automated query. the 158 number comes from 272 - 114)
The only concern are the papers that the function is linking to totally different papers. It'll be hard to catch these when we are looking at new papers. I modified Julien's function to include a tolerance parameter (ie letting us choose how close two titles have to match before we decide they're the same paper), and I've been playing around with trying to find a pretty "safe" parameter value without having to manually search for all the DOIs.
@LesleyAtwood and @nathanhwangbo I think moving forward is fine when it's just a few papers. The tillage search looked like it had a pretty big difference (1130 to 1004). Can that be chalked up to the Illinois search issue?
If you think Colandr is the issue, could we use the .bib file that came directly from WoS to match, rather than the .csv generated by Colandr?
Also, @nathanhwangbo I think what matters the most is getting the new search to match the old, whether that's by DOI or title. You shouldn't worry about correcting DOIs (unless it improves search matching) because it's not totally essential I be able to use BibScan
to download papers. There will be a small enough number that I can do it manually if the DOI links are problematic.
I just ran the search through WoS for Tillage. Here are the results, which are much more similar to what I got back in 2018 than what @nathanhwangbo 's table shows. I just copied and pasted the keywords I sent you into WoS and included or excluded Illinois.
@nathanhwangbo , maybe rerun your queries like I did and we can see if your results match mine. If they don't then double check that you included all the keywords for each topic.
What do you think makes for the difference between when you did it and when @nathanhwangbo did? Something to do with the automated vs manual search? That level of difference seems pretty good to me.
@swood-ecology I'm really not sure why our results differ. Clearly the automated search is dropping results, but I'm even more perplexed why his manual search results don't match mine. Hopefully we'll know more once @nathanhwangbo runs the queries again.
Found the culprit.
The difference between the two manual search results is a small difference in queries. The first term of the Tillage Specific Query in the doc is conservation till*
. The version in the doc (what was originally used) does NOT have quotes around it. My version has quotes (ie "conservation till*"
).
Should I just stick with the old version? The difference is that the unquoted version is equivalent to conservation AND till*
, while the quoted version looks specifically for the phrase(s) conservation till*
- We have the same problem with the first term in Early Season Pest Management, which is
pesticide seed treatment*
So that explains the difference between our WOS manual searches (1112 vs 1029). I double checked, and the difference between my WOS manual search and my WOS automated search (1029 -> 1004) is a direct result of querying using the Lite API (looking through less collections).
- note to self: The way the collections are abbreviated isn't straightforward. The Lite API only looks through SCI, ISTP, ISSHP, IC. These correspond to "Science Citation Index Expanded", "Conference Proceedings Citation Index - Science", "Conference Proceedings Citation Index - Social Sciences", "Index Chemicus" respectively
Nice catch! Is it hard to make a list of the papers that were included in the original but not the latter search? My hunch is that we want "conservation till*"
and that any papers that weren't caught by "conservation till*"
but were grabbed by conservation till*
wouldn't have made it through Lesley's final screening process. But I guess it would be good to confirm that.
For tillage:
- Adding quotes to
conservation till*
decreases the number of papers in the automated search from 1085 -> 1004 - out of these 81 removed papers, 2 of them are in our final reference list (ie made it past the screening process.
For Early Season Pest Management:
- Adding quotes to
pesticide seed treatment*
decreases the number of papers in the automated search from 141-139. - Neither of those two papers made it past the screening process, so we're good here.
I think we're good on tillage because those two papers that made it into the final reference list are actually there because they're papers about cover crops
, not conservation tillage
. So they must have also come out in the cover crops
search.
Ah, I didn't think about that! A quick check shows that you're right: both of the papers also show up in the cover crop query.
I'll go ahead and keep the quoted versions of the queries then
I just realized that the Web of Science API doesn't give us a full citation (let alone in Bibtex format), so I'm imagining the following workflow:
- Query Web of Science API using
rwos
package (automated, but user specifics start/end dates) - Pass titles into
rcrossref
package, to find a match in CrossRef. Separate out matches that aren't exact/super similar (automated) - Do some kind of manual check to see if we can find a match for any of the non-exact matches
- Export remaining titles using
rcrossref
to Bibtex format - Do the rest in BibScan package/colandr? Can we automate any of the filtering process after getting the bib file?
The alternative to using rcrossref
is building out a bib file manually from the information from the WoS API. But the WoS API is missing a ton of DOIs, which I think we need for BibScan.
Started to implement this workflow in wos_pubsearch_workflow.R
.
FYI, there are 289 different papers that show up if we run the queries for 2018-2019.
Question: What's the plan for filtering through the query results? Is Colandr going to be involved?
Colandr could be involved, but it's not necessary. The benefit of Colandr is that it helps keep the reviewer organized. I found, however, that the machine learning component of Colandr didn't really speed up the review process.
Because we don't have another option lined up lets plan to use Colandr. Is there any way we can get both the bib files and pdfs to automatically upload into Colandr? They will have to load by Review so that the filtering criteria doesn't become too overwhelming.
@nathanhwangbo those 289 papers are ones that didn't show up in the original search that do show up when you do the search to the present day? that's a lot! sounds like I'll have to carve out some time to go through those.
@LesleyAtwood I'm fine with using Colandr for this, but I agree that it really just gives us a home base for going through things, rather than a useful machine learning tool. I'm not sure how you'd auto-upload stuff to Colandr since you can't interact with it using code (as far as I know). I think we'd have to have the script auto-run in the background somewhere, ping us once there were a certain number of papers, then we'd have to take that .bib and upload it manually into Colandr.
I agree, 289 papers is a lot. I guess the topics are more popular than ever!
Yup, these are 289 are completely new papers (it's actually 290 now 😄 )
Out of these 290 papers, I wasn't able to find DOIs for 3 of them (the titles for these three are saved in the matched_title_lower
column of wos_cr_nomatch_20191002.csv
)
The rest are in citations_20191001.bib
.
I'm starting to play with BibScan, but haven't had a high success rate for downloading pdfs yet.
Oh geez! Well I should get on reviewing those soon. @LesleyAtwood do you think we could load those into Colandr now and I could start reviewing?
@swood-ecology, Yes, the Colandr reviews are cleared and ready for the next batch of papers. It will probably be easiest to use the same framework I used because the selection criteria are already created and described. I can send you the selection criteria protocol by the end of the day. It's ready to go I just want to read over it again.
Great. Should we go over the first couple together just to make sure I have it right? Maybe on Monday? @nathanhwangbo do you have the searches saved to .bib files that we can load into Colandr? Also, do you think it would be possible to have a cron
component to the script that would run it every month and let us know when there were 20 new papers?
@swood-ecology , Tuesday would be better for me. I'm free between 8-11 and after our SOC market meeting
I have the .bib files saved in the repo sorted by date of query (see here: https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/tree/master/auto_pubsearch/Bibfiles).
However, I tried to import one of them into Colandr and wasn't having any luck (I would try to import the file and nothing would happen). I'm totally new to Colandr, so it might just be that I'm missing a step. @brunj7 also tried with a different bibfile with the same result. He was able to get the import to work with an .ris
file though.
A few other notes:
- I'm working on making
wos_pubsearch_workflow.R
a script that we can run as a cron job with an email alert system. We'll see how that goes - I organized the file structure for this project in a folder called
auto_pubsearch
. All files should have a Year/Month/Day suffixBibfiles
contains the final.bib
files for use.failed_matches
holds tables for all the papers in the query that didn't make it into the.bib
file because I couldn't find enough information.wos_queries
holds data frames with the direct results from the Web of Science API
- I'm not sure if you guys have any use for this, but I tried running the bib file through BibScan to see how well it performs. Out of 287 papers, 72 of them were successfully downloaded.
@nathanhwangbo this paper describes a pretty cool workflow for regularly updating data, like our .bib files.
@nathanhwangbo are the .bib files separated out by literature review? they didn't seem to be in the repo. you'll see that Lesley has 4 reviews, each of which correspond to search criteria for each review. I hope this isn't too complicated, but we'd need different .bib files for each review, rather than an overall .bib.
It's not a problem, but reason I put all the reviews together was so that I could easily remove papers that are duplicated across reviews. Should I just leave duplicates in there?
I see what you mean. I would leave them separately, though, and keep it as how Lesley has done it. We'd have to totally re-do Colandr to be able to read in only one file.
Ok, the files split by review are here (look for the 20191007 files): https://github.com/Science-for-Nature-and-People/Midwest-Agriculture-Synthesis/tree/master/auto_pubsearch/Bibfiles
Thanks @nathanhwangbo. I tried uploading the files to Colandr and was having the same problem with the references not being added. At first I wondered if it was a permissions issue and so tried to upload the .bib to a Colandr review for which I'm the owner (@LesleyAtwood owns the AgEvidence reviews). That didn't work.
Then I was wondering if there might be something different about the .bib files that you're writing from R vs what's downloaded directly from WoS. One think I noticed is that the .bib files you generated don't have all of the information as the WoS .bib, which includes things like the full abstract, which is needed for the initial screening.
Do you think this is reconcilable, or do you think we should be thinking about a workflow where perhaps the cron
process flags for us when there's a certain number of papers that are new, but then we do the actual search and .bib extraction manually within Web of Science?
Also, @LesleyAtwood identified some differences between the references your search generated vs the original approach. Here's your reference:
@article{Snapp_2018, doi = {10.1016/j.still.2018.02.018}, url = {https://doi.org/10.1016%2Fj.still.2018.02.018}, year = 2018, month = {aug}, publisher = {Elsevier {BV}}, volume = {180}, pages = {107--115}, author = {Sieglinde Snapp and Sowmya Surapur}, title = {Rye cover crop retains nitrogen and doesn't reduce corn yields}, journal = {Soil and Tillage Research} }
Here's Lesley's original:
@article{ ISI:000414816800024, Author = {Sivarajan, S. and Maharlooei, M. and Bajwa, S. G. and Nowatzki, J.}, Title = {{Impact of soil compaction due to wheel traffic on corn and soybean growth, development and yield}}, Journal = {{SOIL \& TILLAGE RESEARCH}}, Year = {{2018}}, Volume = {{175}}, Pages = {{234-243}}, Month = {{JAN}}, Abstract = {{As the size and weight of agricultural equipment have increased significantly in the past few decades, the severity and depth of compacted zone may have increased proportionately. Past research indicates that soil compaction affects crop growth and grain yield. Very few studies have been conducted in North Dakota (ND) to understand soil compaction under the current machinery, and its effect on crop growth and yield. The research was conducted on a no-till crop field at Jamestown, ND, USA for 2013 (corn) and 2014 (soybean) growing season. The objective of this study was to evaluate the effect of wheel traffic on soil strength indices and its impact on crop emergence, development and yield. The study also evaluated the effect of winter freezing thawing cycle on soil compaction in the study field. The experiment consisted of five soil transects and two traffic conditions based on machinery traffic in the field for both years such as most trafficked (MD rows and least trafficked (LT) rows, laid out in a randomized complete block design with three replicates in strip-plot with space for corn season in 2013, and for soybean season in 2014. Data collected included soil resistance or cone index (CI), soil bulk density, soil moisture content, plant emergence, plant height and grain yield. The results showed that CI values followed a similar pattern for different soil transects up to 37.5 cm depth and then increased sharply. An average CI of 1.19 MPa was noted over the whole profile at 0-45 cm depth for the study area and not significantly different between MT and LT rows for both years. Moderate compaction resulted in early emergence of corn plants in MT rows by 175\% compared to LT rows. The plant height didn't show any significant difference between MT and LT rows for both years. The yield data showed significant difference between the soil transects, but no difference was observed between MT and LT rows in both 2013 and 2014 season. The interactions between soil transects and traffic conditions were not significantly different for all soil and plant related dependent variables. The freeze-thaw cycle occurred during winter from 2013 to 2014 and 2014 to 2015 alleviated soil resistance over the whole soil profile at 0-45 cm depth. Results show that different crops grown in a no till field are not very much influenced by wheel traffic. The study also suggests that moderate compaction occurred after harvest in a no till field could be alleviated by the effect of freeze thaw cycle.}}, DOI = {{10.1016/j.still.2017.09.001}}, ISSN = {{0167-1987}}, EISSN = {{1879-3444}}, Unique-ID = {{ISI:000414816800024}}, }
Thanks for doing that testing.
Can you do similar testing to show how you imported your Wisconsin test file (savedrecs-2.bib
) into Colandr? I'm not having any luck, even with this file. Once I figure out how to import a file into Colandr, I'll be able to test what information Colandr minimally needs to accept a reference.
In general, though, we are able to get most of the information in the original references -- the ones we don't have are Abstract and EISSN.
That being said... if abstracts are required in the .bib
file, then it might be better to create a workflow where you guys manually do the query/get .bib
files from Web of Science (as you suggested). The Web of Science API doesn't give us abstracts, so I'm trying to grab it from the CrossRef API. While this workflow worked well with DOIs (matching all but 3 papers), it's performing poorly for abstracts (out of the 288 papers in the 2018-2019 queries, CrossRef was only able to find 37 abstracts).
That's too bad you can't get abstracts because those are definitely a must-have. They pop up in Colandr and allow us to screen the papers within Colandr. So maybe we should think about the manual workflow. Do you think there's anything we could automate? Like, doing the search automated through cron
and ping us as a reminder when we should do the manual search?
I'm not sure what's up with Colandr not importing those .bib files. Let me quick email the creator and see if she has any idea.