Any link inputted gets summarized.
Aadityaa2606 opened this issue · 2 comments
bug Description
The issue is if we input any link (eg. www.google.com) the summariser thinks it's an article link and summarises it.
To Reproduce
Steps to reproduce the behavior:
- Go to https://aisummariser.oxlac.com/ or clone and run the local dev environment.
- copy and paste a nonrelevant link.
- click on go.
- see the non-relevant summary.
Expected behavior
Prevent accepting irrelevant links, if the user tries to submit an irrelevant link then show them an error similar to this
Possible approaches
- Try to make a list of all legitimate new article providers and if that link doesn't start with that specific parent URL show the error toast
- A better way is to make a web crawler that finds the legitimacy of an article for better insight check this out https://www.quora.com/Is-there-a-News-API-web-crawler-to-determine-if-URL-is-an-article-or-navigation-page
- Find any existing API's that can return the type of content in the URL and with that information sort out the non article url's
claim
The fix proposed by @rnavaneeth992 is a really good approach but not quite feasible for every irrelevant link, so I am reopening the issue back for other contributors to make additional improvements to the detection system on top of the existing approach!
Explanation of the Fix:
-
The previous fix added a try-catch block after querying the URL, the try-catch block Raises an HTTP error if the HTTP request returned an unsuccessful status code. which means if the link didn't give any HTTP error it doesn't detect the link is irrelevant
-
There was also a second check placed that if the summary content is 0, it predicts the URL as invalid.
Additional improvements that can be made:
-
Right now the existing approach finds and prevents a few links from getting summarised like www.google.com but still, there are sites like https://www.linkedin.com/feed/ https://github.com/ https://www.udemy.com/ and many more
-
We need a concrete method that separates news articles from normal websites to prevent irrelevant results and make the web application more feasible.