Mercury-Language/mercury

mmc has segmentation violation on source file with very large list

Opened this issue · 3 comments

Hello all,

I'm trying to convert some Prolog code over to Mercury, and ran into a problem where the compiler crashed when trying to deal with a very large list (over 5MB of text). Maybe there simply isn't a way around an extreme case like this, but Prolog didn't have a problem, so I'm hoping there is a way to solve the issue in Mercury. To simplify things, I removed all extraneous code and tried to compile a module that had nothing but a single constant defined. The file starts like this:

:- module oils.
:- interface.
:- import_module list.
:- import_module char.

:- func oils = list(list(list(char)))

:- implementation.

oils = [[['O', 'O', 'I', 'L', '_', '_'], ['O', 'O', 'I', 'L', '_', '_'], ['_', '_', 'I', 'L', 'L', '_'], ['_', '_', 'I', 'S', 'S', '_'], ['_', '_', 'S', 'S', '_', '_']], [['O', 'O', 'I', 'L', '_', '_'], ['O', 'O', 'I', 'L', '_', '_'], ['_', '_', 'I', 'L', 'L', '_'], ['_', '_', 'I', '_', 'S', 'S'] ...

but the part at the end is an extremely long list that goes on and on. Then I try to compile and get this error:

$ mmc --make oils

*** Mercury runtime: caught strange segmentation violation ***
PC at signal: 107637906453004 (61e5672db20c)
This may have been caused by a stack overflow, due to unbounded recursion.

Incidentally, this is happening in Ubuntu (focal). Specifically, I'm using GitHub codespaces.

I'm wondering if there is some obscure compiler flag or other trick that can make this work. I'm also wondering what the exact nature of the problem is. Can Mercury not handle any file that is this big, or is the problem specifically that the list is so big. I'm wondering if breaking it up and concatenating the components would solve the problem. I also considered storing this in a text file, and then having Mercury read and parse the data (it is of type list(list(list(char))), but the parsing aspect of the problem seemed non-trivial, so I'm trying to avoid that.

Thank you in advance for any help (and sorry if I should actually be asking this somewhere else)

I could not reproduce your problem. I constructed a file along the lines of your example whose size is 5.3 Mb. The current version of the Mercury compiler could compile it to C code, though it took a while. Note that you can get "progress reports" from mmc by specifying the -v (verbose) option.

Which version of the compiler are you using? I made a change in May of 2022 that improved the compiler's behavior on very large inputs like these. If you are using a compiler from before that, then I think that could explain the stack overflow you got.

Note that despite the similarity of the surface syntax, Mercury is semantically very different from Prolog. Most Prolog systems do almost no semantic checking of their code, while Mercury (and a few Prologs such as Ciao)
do significant amounts of it. It is obviously much easier to avoid running out of stack space when doing no checks than when doing lots of checks :-)

I am not sure what you mean by "the parsing aspect". If this means that you think you would have to write a parser for your list yourself, then I have good news: you don't. If you have a well-typed Mercury term you can return e.g. as the result of a function, then you can put that term, followed by a period, into a file, and read that term from that file with a single call to io.read. That is what I would do in such cases, partly because a huge term in a source file
makes editing that file very inconvenient, partly because it avoids the hit to mmc compile time, but mostly because it avoids a much bigger hit to gcc compile time. Compiling my test module from Mercury to C took
a bit less than 40 seconds on my laptop, but took a bit more than 400 seconds to compile from .c to .o.

This is an ok forum to ask such questions, though since github imposed unwelcome, unnecessarily-complicated and very-poorly-documented two factor authentication requirements, I strongly prefer the Mercury users mailing list, users@lists.mercurylang.org.

I was able to create the compiler error this morning. Since I'm using GitHub codespaces, I'm using a fresh install of Mercury each time. I literally run this bash script:

sudo apt update
sudo apt install -y wget ca-certificates
wget https://paul.bone.id.au/paul.asc
sudo cp paul.asc /etc/apt/trusted.gpg.d/paulbone.asc
rm paul.asc
sudo cp mercury.list /etc/apt/sources.list.d/

sudo apt update
sudo apt install -y gcc-multilib mercury-recommended

mmc --make test
./test

This assumes the existence of a file mercury.list that contains

deb http://dl.mercurylang.org/deb/ focal main
deb-src http://dl.mercurylang.org/deb/ focal main

During compilation, it gives this error:

*** Mercury runtime: caught strange segmentation violation ***
PC at signal: 108323428776106 (6285037e38aa)
This may have been caused by a stack overflow, due to unbounded recursion.

The actual test.m file is attached, but renamed to test.m.txt because of GitHub's upload requirements.
test.m.txt

I think I'm installing the latest stable version of the compiler, but possibly not. The source of the problem could be a limitation in GitHub codespaces, but I need that since I'm actually making an assignment for students to complete, and I need a simple platform that they can install things on without having to troubleshoot different systems. Incidentally, this massive list is part of a unit test.

Incidentally, before receiving your reply I was able to "solve" the problem by breaking the big list down into several sublists and then combining them in the code with ++, though I wish I had known how easy it would be to use io.read. Perhaps I'll change the code to do that instead.

Thanks for your help!

You are welcome.

That install method will get a stable release version of the Mercury. There was supposed to be a stable release version between 2022 May and now, but life happened instead :-( This means the bug you report has already been fixed, and the fix has been available in all releases-of-the-day for the last two years.

As a former academic myself, I fully understand the need to simplify the incidental steps students need to complete get to their actual task as much as possible :-(. I also know that students often copy, both in later subjects and in industry, the coding styles they see in handout programs, including the quick-and-dirty parts that they are not supposed to copy. So if you would accept a piece of friendly advice, I would switch to using io.read. Besides being better style, it is also much more flexible, and allows you to give them data for more unit tests. The one downside is that its use results in much less useful diagnostics for any errors, since you get them from the runtime system and not the compiler. I don't know whether that is relevant in your context.