anufrievroman/calcure

Improve time of parsing larger .ics files

Closed this issue · 9 comments

The ICS file of my personal Google calendar, which includes entries dating back to 2012, is approximately 680 KB and encompasses nearly 1700 events. It takes approximately 14 seconds for python-ics to parse all entries which considerably delays the startup time of calcure.
It would be great if calcure gains the ability to optionally filter entries and discard those which are too far from today's date.

This is a very basic (and likely buggy) implementation which only considers entries from current year:

# loaders.py

    def read_file(self, path):
    ...
        with open(path, 'r', encoding="utf-8") as file:
            lines = self.read_lines(file)
            lines = self.filter_lines(lines.splitlines())
            return lines

    def filter_lines(self, lines):
        event_lines = list()
        out_lines = list()
        in_event = False
        it = iter(lines)
        for line in it: 
            if line in 'BEGIN:VEVENT':
                in_event = True
                event_lines.append(line)
                continue

            if not in_event:
                out_lines.append(line)
                continue

            if line.startswith('DTSTART') and not str(datetime.datetime.today().year) in line:
                event_lines = []
                in_event = False
                for x in it: 
                    if 'END:VEVENT' in x: break
                continue

            event_lines.append(line)
            if line in 'END:VEVENT':
                out_lines.extend(event_lines)
                event_lines = []
                in_event = False
        return '\n'.join(out_lines)

In my case it resulted in a greatly reduced file size and parse time:

# import ics; ics.Calendar(open('file').read())
Time to read 679 kB file: 12.71 s
Time to read 62 kB file: 0.91 s

Calcure loading time:

original file: 13.2 s
reduced file: 1.25 s
no external ics file read: 0.3 ms

Limiting my view to only events from the current year is a price I am willing to pay if the program starts in about a second.

It may also be worth mentioning that perl is able to process the ICS file significantly faster:

use iCal::Parser;

my $parser=iCal::Parser->new();
my $hash=$parser->parse(shift);
# original file: 5.4 s
# reduced file: 0.4 s

Hi, sorry for the delay. I think your idea looks pretty good, I'll implement it with some user parameters to control the range. Feel free to make a PR with this snippet.

About perl, it's cool, but I'd prefer not to introduce an additional dependency to improve in a niche feature, let's keep it in python.

p.s. although I wish the range feature was implemented directly in pyics library.

About perl, it's cool, but I'd prefer not to introduce an additional dependency to improve in a niche feature, let's keep it in python.

The reason I mentioned perl was that perhaps it would be better to look for other tools providing a way to filter events (if there are any) - ideally by multiple criteria and then optionally ask calcure to run them after load.

Today I made a (hopefully) interesting discovery: first I configured vdirsyncer to sync the data from Google Calendar .ics file to a local vdir format. The original 680kB .ics file was converted into 1681 individual .ics files stored in a single directory, combined size of all files grew to 2.7 MBs (some extra lines/sections were added).

Then I installed khal and ran ikhal to start an interactive session. Time of the first start was 4.67 seconds (which is quite decent already), but subsequently as khal uses sqlite for caching (?) it got under a second each run.

To me it seems there's definitely a lot of room for improvements.

As I was digging through khal code I noticed that unlike calcure they're using icalendar (not ics). I decided to make a simple comparison and the differences are stunning.

First the code that loads the same original (~600kB) file from previous examples:

import icalendar
import ics
import timeit

def ics_load():
    cal = ics.Calendar(data)
    print(f'Number of events loaded: {len(cal.events)}')


def ical_load():
    cal = icalendar.Calendar.from_ical(data)
    events = []
    for component in cal.walk():
        if component.name == 'VEVENT':
            event_name = component.get('summary')
            event_start = component.get('dtstart').dt
            events.extend([f'Event: {event_name} {event_start}'])
    print(f'Number of events loaded: {len(events)}')


with open('basic.ics') as f:
    data = f.read()
    ics_time = timeit.timeit("ics_load()", globals=globals(), number=1)
    ical_time = timeit.timeit("ical_load()", globals=globals(), number=1)

print(f'ICS load: {ics_time}')
print(f'ICAL load: {ical_time}')

And here is result of an execution:

Number of events loaded: 1695
Number of events loaded: 1695
ICS load: 12.023302923000301
ICAL load: 0.4829574479999792

As a matter of fact I found the numbers too good to be true but on the other hand I do see the event names and times. I would very much welcome having your opinion.

Wow, that's interesting! Initially I went with ics library because I got a working version quicker, and syntax is cleaner, but indeed this library has its issues and clearly loading time is too long. Basically, we only need to parse the following fields in loaders.py:

event.name
event.all_day
event.begin.year
event.begin.month
event.begin.day
task.name
task.priority
task.due.year
task.due.month
task.due.day

So if it is possible with icalendar library, we might switch to it. That would solve this issue without creating filters.

So if it is possible with icalendar library, we might switch to it. That would solve this issue without creating filters.

All that seems to be supported with icalendar:

# if component.name == 'VEVENT'..
event.name         # str(component.get('summary'))
event.all_day      # component.get('dtstart').params.get('VALUE') == 'DATE'
event.begin.year   # component.get('dtstart').dt.year
event.begin.month  # component.get('dtstart').dt.month
event.begin.day    # component.get('dtstart').dt.day

# if component.name == 'VTODO'..
task.name          # str(component.get('summary'))
task.priority      # component.get('priority')
task.due.year      # component.get('due').dt.year
task.due.month     # component.get('due').dt.year
task.due.day       # component.get('due').dt.day

Made an experimental branch with event parsing handled by icalendar - startup performance-wise it looks quite promising. I am not currently using any todo (.ics with VTODO) items. @anufrievroman Could you perhaps add a few to the repo itself? Thanks!

Here is a little example of tasks.ics file with a few tasks following nextcloud standard:

BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Sabre//Sabre VObject 4.4.2//EN
PRODID:-//Nextcloud Tasks v0.14.2
PRODID:-//Nextcloud Tasks v0.14.5
BEGIN:VTODO
UID:6cb1fd92-2eb5-43a1-a9d2-a3bbf9dc7b7c
CREATED:20230219T100637
LAST-MODIFIED:20230219T100708
DTSTAMP:20230219T100708
SUMMARY:Task with deadline from nextcloud
DUE;VALUE=DATE:20230224
END:VTODO
BEGIN:VTODO
UID:7eb4a1e2-4dd3-4629-8be7-b1c84f5db465
CREATED:20220116T203728
LAST-MODIFIED:20230218T161231
DTSTAMP:20230218T161231
SUMMARY:Unimportant task from nextcloud
PRIORITY:6
PERCENT-COMPLETE:18
STATUS:IN-PROCESS
END:VTODO
BEGIN:VTODO
UID:bc2f8f98-44ba-4003-ae23-b27b77facf78
CREATED:20230218T154114
LAST-MODIFIED:20230218T161138
DTSTAMP:20230218T161138
SUMMARY:Normal task from nextcloud
STATUS:NEEDS-ACTION
END:VTODO
BEGIN:VTODO
UID:c4dcf921-e819-4a4b-b331-0f017c0df558
CREATED:20230218T154053
LAST-MODIFIED:20230218T161154
DTSTAMP:20230218T161154
SUMMARY:Cancelled task from nextcloud
STATUS:CANCELLED
END:VTODO
BEGIN:VTODO
UID:e5f32ad6-efe0-4e7c-8c8c-ea64afada253
CREATED:20220116T203724
LAST-MODIFIED:20230218T161147
DTSTAMP:20230218T161147
SUMMARY:Completed task from nextcloud
STATUS:COMPLETED
PERCENT-COMPLETE:100
PRIORITY:2
COMPLETED:20230218T154018
END:VTODO
BEGIN:VTODO
UID:e66dec63-f3d3-4f47-b0da-e7ba362195e6
CREATED:20230218T154154
LAST-MODIFIED:20230218T161111
DTSTAMP:20230218T161111
SUMMARY:Important task from nextcloud
PRIORITY:4
END:VTODO
END:VCALENDAR