Exam schedules have too much data and bad logic

Question

Exam schedules have too much data and bad logic

Closed this issue 4 years ago · 0 comments

Describe the bug

Schedules are scraped from here. If you look at the site, there are greyed out segments at the bottom kept for archive purposes which is also populated in the app that makes it too clumsy.
Also, the logic is too bad that it tries to find a <ul> node as soon as it finds a  node and whatever is there inside the first <ul> that comes after a  tag, it is populated as the exams under that category.
Example (assume the scraped content looks like this):

<p>Arts & Science Supplementary Examinations April/ May - 2020</p>
<ul></ul> // Empty UL
<ul>
<li>Schedule for II Semester Int M Sc & BA Ell (2015 2016 & 2017 Batches) April 2020<a href="https://intranet.cb.amrita.edu/sites/default/files/II%20sem%20Int%20MSc%20BA%20English%282015%2016%20%26%2017%29.pdf">Download »</a></li>
<li>Schedule for II & IV Semester BA Mass Comm(2015 2016 & 2017 Batches) April 2020  <a href="https://intranet.cb.amrita.edu/sites/default/files/II%20sem%20Int%20MSc%20BA%20English%282015%2016%20%26%2017%29.pdf">Download »</a></li>
</ul>
<p> .......................... </p>
<ul>
.
.
</ul>

In this case, the app will read the empty UL and populate in the app, which will come as an empty list.

Expected behavior

The archived ones need not be populated in our app or can be kept in a separate section called archives.
But the reality is the next <ul> should be read. Logically, everything between two  tags should be read and is part of the first  category.

Additional context

 tags contain the exam categories and <ul> contains the exams listed under each category.
Ping me here if you still have doubt.