pisa-engine/pisa

token-filters porter2 issue with parse_collection

amallia opened this issue · 15 comments

Describe the bug
parse_collection crashes when --token-filters porter2 is used.

To Reproduce

:~/pisa/build$ cat collection.tsv | ./bin/parse_collection -f plaintext -b 10000 --token-filters porter2  -o ./bm25.msmarco.fwd
[2023-02-14 22:44:39.624] [info] Number of worker threads: 128
realloc(): invalid old size
Aborted (core dumped)

Huh, that's so strange. And works without it? What about krovetz, have you tried it?

I'll have a look sometimes this week.

@amallia I cannot reproduce this locally on test collection. Can you provide more details about your build? Is this debug or release, do you use sanitizers? Which compiler? Does this happen quickly or after a longer time?

If you could share the input file with me, that could also be helpful.

Here an example:

collection.tsv:

0	The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2	Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.
3	The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 � 2-1946 under the control of the U.S. Army Corps of Engineers, under the administration of General Leslie R. Groves.
4	versions of each volume as well as complementary websites. The first website�The Manhattan Project: An Interactive History�is available on the Office of History and Heritage Resources website, http://www.cfo. doe.gov/me70/history. The Office of History and Heritage Resources and the National Nuclear Security
5	The Manhattan Project. This once classified photograph features the first atomic bomb � a weapon that atomic scientists had nicknamed Gadget.. The nuclear age began on July 16, 1945, when it was detonated in the New Mexico desert.
6	Nor will it attempt to substitute for the extraordinarily rich literature on the atomic bombs and the end of World War II. This collection does not attempt to document the origins and development of the Manhattan Project.
7	Manhattan Project. The Manhattan Project was a research and development undertaking during World War II that produced the first nuclear weapons. It was led by the United States with the support of the United Kingdom and Canada. From 1942 to 1946, the project was under the direction of Major General Leslie Groves of the U.S. Army Corps of Engineers. Nuclear physicist Robert Oppenheimer was the director of the Los Alamos Laboratory that designed the actual bombs. The Army component of the project was designated the
8	In June 1942, the United States Army Corps of Engineersbegan the Manhattan Project- The secret name for the 2 atomic bombs.
9	One of the main reasons Hanford was selected as a site for the Manhattan Project's B Reactor was its proximity to the Columbia River, the largest river flowing into the Pacific Ocean from the North American coast.

Command:
cat collection.tsv | ./bin/parse_collection -f plaintext -b 2 --token-filters porter2 -o /dev/null

I just realized it does not depend on --token-filters. Idk why I had this impression last time.
It depends on -b, setting it to a value smaller than then the number of documents makes it fail.

Strange, it does not fail when I run this, compiled with GCC 12. I see some memory leaks from tbb, but that's it.

I'm recompiling with Clang to see if I can reproduce it.

Ok, I managed to reproduce it. Will debug, hopefully sometime this week. It's not so much about the batch size as it's to do with concurrency, and if you set a larger batch, then only one thread does the job, so the error doesn't occur.

Ok, seems like it's a tokenizer issue. The boost lib tokenizer is probably not thread safe...

I believe this should fix it: #528

Here an example:

collection.tsv:

0	The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2	Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.
3	The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 � 2-1946 under the control of the U.S. Army Corps of Engineers, under the administration of General Leslie R. Groves.
4	versions of each volume as well as complementary websites. The first website�The Manhattan Project: An Interactive History�is available on the Office of History and Heritage Resources website, http://www.cfo. doe.gov/me70/history. The Office of History and Heritage Resources and the National Nuclear Security
5	The Manhattan Project. This once classified photograph features the first atomic bomb � a weapon that atomic scientists had nicknamed Gadget.. The nuclear age began on July 16, 1945, when it was detonated in the New Mexico desert.
6	Nor will it attempt to substitute for the extraordinarily rich literature on the atomic bombs and the end of World War II. This collection does not attempt to document the origins and development of the Manhattan Project.
7	Manhattan Project. The Manhattan Project was a research and development undertaking during World War II that produced the first nuclear weapons. It was led by the United States with the support of the United Kingdom and Canada. From 1942 to 1946, the project was under the direction of Major General Leslie Groves of the U.S. Army Corps of Engineers. Nuclear physicist Robert Oppenheimer was the director of the Los Alamos Laboratory that designed the actual bombs. The Army component of the project was designated the
8	In June 1942, the United States Army Corps of Engineersbegan the Manhattan Project- The secret name for the 2 atomic bombs.
9	One of the main reasons Hanford was selected as a site for the Manhattan Project's B Reactor was its proximity to the Columbia River, the largest river flowing into the Pacific Ocean from the North American coast.

Command: cat collection.tsv | ./bin/parse_collection -f plaintext -b 2 --token-filters porter2 -o /dev/null

I just realized it does not depend on --token-filters. Idk why I had this impression last time. It depends on -b, setting it to a value smaller than then the number of documents makes it fail.

Hi!
I had the same problem trying to replicate the robust04 experiment.
During the various parsing run, I always had a "core dumped" error.
Using a batch size very big solved the problem.
PISA was compiled with GCC 9.4.0 on Ubuntu 20.04.
I comment here because maybe it is a good idea to reopen the issue. As soon as I can, I will look into the code to see if I am able to help you solve this bug.
Thanks!

Thanks @veneres I'll reopen. Would it be possible for you to provide the minimal data input and parameters that cause the failure?

Thanks @elshize for reopening the issue.
The exact command that I used is the following:

gzip -dc \
$(find path_to_robust_collection -type f -name '*.*z' \( -path '*/FR94/[0-9]*/*' -o -path '*/FT/FT*' -o -path '*/FBIS/FB*' -o -path '*/LATIMES/LA*' \)) \
| path_to_pisa_bin_folder/parse_collection -f trectext -b 1000 -F porter2 -H 1 -o robust04/fwd

Thus the input is the "robust collection" (TREC disks 4 & 5) and the parameters are the same as the ones proposed in the regression test, which are:

  • -f trectext
  • -b 1000
  • -F porter2
  • -H 1

However, I was able to finish the parsing-only changing the parameters in:

  • -f trectext
  • -b 1000 -b 1000000
  • -F porter2
  • -H 1

Thank you.
P.S. please notice that the parameters are slightly different from the ones in the regression test because some of them changed identifier, and if you copy and paste the command in the guide you get an error. I will do a pull request to update the docs.

@veneres thanks for providing the details. I'll do my best to look into it this coming week.

Looks to be related to thread-safety of porter2. Not sure what exactly, though, need more time to investigate.

@veneres I pushed a fix here #542
Do you think you could test that branch to see if you're still getting the error or not?

Sorry for the late reply @elshize, now everything works like a charm. Thank you!
I was going to start a pull request to fix the documentation when I saw that you are already updating it in the benchmark-docker branch (ab1d5c6).
Feel free to tell me if I can help you with something.

No worries, I'm actually working on the docs here #549 you could help by having a look at the docs/src folder, mostly docs/src/guide where I'm porting the old docs to.

I'm trying to make it easier to explore, divided guide and CLI reference and I'm generating help printouts so that they are consistent. Would appreciate if you could give those a look (mind that cli ref is not complete, so again, focus on the guide), and check if all looks good/understandable, etc.