Out of Memory Exception with large dataset
Closed this issue · 36 comments
Hello everyone,
I'm just starting to use SolrBundle 1.5 with Symfony 3.1.* and I have problem of memory usage.
When I call the solr:index:populate command, I have a table (mysql) with almost 560K rows that trigger an OutOfMemoryException.
I already increase the memory_limit parameter from php.ini file to 512Mo without success.
Can somebody help me to optimize the commande code in order to save as much memory as possible ?
Have a nice day,
best regards
Benjamin
can you upgrade to version 1.6.*? It include a couple of performance improvements.
I'm going to check with my peers, but I think it won't be a problem. I'll keep you posted.
First of all, thank you for your response. I was afraid to wait much longer before your answer :-).
Secondly, I updated my bundle version and I tried to populate the solr index with the command.
Here is my error message :
Synchronize 559206 entities
Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 12288 bytes) in >/var/www/vendor/doctrine/orm/lib/Doctrine/ORM/Internal/Hydration/AbstractHydrator.php on line 296
Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 32768 bytes) in > /var/www/vendor/symfony/symfony/src/Symfony/Component/Debug/Exception/OutOfMemoryException.php on line 1
Can I ask you : on what size of data this bundle has been tested ?
- increase your memory limit
- run
./bin/console solr:index:populate --flushsize=200
The bundle is tested with ~61k rows.
I already increase the memory limit up to 512Mo. I think it's already high enough.
And I forget to mention it but I already try to reduce the flushsize parameter.
Here is the call I've made :
php bin/console solr:index:populate --flushsize=50 Bundle:Entity
Anything else ?
Can you increase the limit just for the import? To import stuff needs always a lot of memory.
Do you have a specific amount to recommend ?
you need min. 2-3GB for your 520k
I will try that then.
But don't you think it's too much ?
If it's not possible to avoid that situation, can I suggest to add a parameter to the command to pilot that memory_limit ?
Don't be afraid. The main-problem is that Symfony and Doctrine allocate a lot memory. I had the same problem a few months ago.
You have to set a new memory_limit (ini_set
) at the beginning of a script, in this case in the beginning of bin/console
. So a optional command parameter will not work.
I will check with my infrastucture guy, because even if I'm using a memory_limit = -1, I still have the same error about the 512Mo limit outreached.
If I specify 2048M within the php.ini file, it's not accepted and the system goes back to 512M values.
Anyway, I'll continue to work on that and I'll keep you posted ;-)
Ok,
I tried with a memory_limit = 2048M and it took more than 1 hour to crash.
I will make another test with the limit increase up to 4096M.
But I don't have the time right know. I'll keep you posted ;-)
Stupid question maybe: what php version are you on?
As far as I know a single php process (lower than PHP 7) cannot allocate more than 2G. Even on 64bit compiled versions. So bigger than 2G won't do much.
That affects post sizes as well as memory used. There are patches for that for PHP lower than 7. It is supposed to be fixed in php7 (but haven't tested that). Please correct me if I'm wrong.
Also: the Entity Manager of Doctrine does some weird stuff. If the populate command won't work, try your own command. Get the entities in smaller batches and detach() the entites from the EntityManager after you put them into Solr.
I think the best way to import your data is to write a little php-script (no symfony-command) with some PDO magic to export your data as json. Then you can use the solr post tool to import your data.
@Koalabaerchen : this is not a stupid question at all ;-).
I should have tell that info at first : I'm using PHP 7.
I'm trying the command with 4Go. If it's not working, I'm going to develop my own logic
@floriansemm Hello,
I tried the command after set the memory_limit up to 4go and that's taking so much time, I can't use it like this.
After thinking, I've got a question : do you think it would be possible to implement an incremental version of the populate command ? Right now, it give me the impression that each time it's called, it recreate the index from the beginning. But my large dataset of 560k rows is not the only one I've got of that size (I've got another one as large).
It could be a nice addon to the bundle. What do you think about that ?
That said, I just remembered about a SynchronizationFilter annotation that could be the answer to my problem :-)
If I want to get an incremental command, I need to be able to get the id of the last solr document inserted.
How can I query that ? I can't find the documentation I need.
a command like ./bin/console solr:index:populate AppBundle:Entity --start-offset=100 --stop-offset=300
is possible.
to get the last insert document: http://stackoverflow.com/questions/12346240/how-to-get-last-indexed-record-in-solr
I have implement a prototyp command which can index a huge amount of rows/documents (entity with 3 fields, 5000000 rows). My local setup has 2GB RAM and it took 15min to export/import all rows.
You can ping if you want some information about how it works.
@floriansemm thanks for your link, but this is not what I searched for.
I need to send the same request but with the PHP Solr QueryBuilder.
I can't find how to do that.
Your implementation sounds very interesting. And also very close to my own situation.
Yes, I would like to know how you did it. Because, last time I tried to import my data, it wasn't finished after three full hours.
The script selects 100000 rows from the table person
and writes it to a csv-file. My vagrant-box has 2GB and 2 CPUs and needs for the export ~15min.
$threshold = 100000;
$rows = 5000000;
$page = ceil($rows / $threshold);
for ($i = 0; $i<$page; $i++) {
$statement = $this->getContainer()->get('doctrine.dbal.default_connection')
->executeQuery(sprintf('select id, name, email from person limit %s, %s', $threshold * $i, $threshold));
$statement->execute();
$rows = $statement->fetchAll();
$data = array();
foreach ($rows as $row) {
$data[] = sprintf("\"person_%s\", \"%s\", \"%s\"\n", $row['id'], $row['name'], $row['email']);
}
$fileName = '/tmp/test/data'.$i.'.csv';
file_put_contents($fileName, "id, name_s, email_s\n");
file_put_contents($fileName, implode('', $data), FILE_APPEND);
$output->writeln('write page '. $i);
$statement->closeCursor();
}
When its done I call the solr post-tool. All files under /tmp/test
will be indexed.
./post -c core0 /tmp/test/
I think there is a misunderstand : I know how to get data with doctrine.
I thought you were talking about Solr. I misread your message.
Yes we talk about Solr. The last code snippet shows how I add the documents to my solr-core. I use for this the solr post tool
Ok, I think we have a good track here.
Do you think it's possible to use the Solr Client PHP to inject data into the index from anything else than Entity classes ?
This bundle is designed for Doctrine Entities. Other datasources are not supported, but you can implement your own stuff and use the underlying Solarium Lib to index documents.
Hello,
just to keep you posted: I tried first to implement my own solution to use the Solr Client from the Solarium library. But I'm struggling to understand how to use in order to have what I want.
So, just to know if it was a better solution, I implement you idea of using a bunch of csv files and import them with the post tool, like you suggested.
My conclusion is : even you suggested solution is slow... so slow. I was optimistic, but my table is so large, just the process to create the csv files is slowing down as and when the files are created.
That is unexpected, by the way : as and when the files are created, the process is taking more time at each file. I don't understand why.
Just for you to know : my table has 535 882 rows and each rows is composed of 179 columns with some text fields which can be pretty long... my use case is just a nightmare...
I think every export-script will cash, but you can try something like this. The export is implemented directly in mysql-land. It is really fast Query OK, 5000000 rows affected (14.40 sec)
!
If you really want this export/import thing then you have to clean up your table.
In order to implement my own command to populate the Solr index, I have a question about the already existing populate command.
I'm trying to get the maximum value of the 'id' field of a specific entity.
Something strange was returned: 999999. This is strange because it's not the highest value in the database and I couldn't find anywhere an explanation about that.
Does somebody have an explanation please ?
https://github.com/floriansemm/SolrBundle/blob/master/Repository/Repository.php#L71
If you use the repository method findAll
.
yes, good point. Thanks for you answer, but this is not the problem.
Here is what request I wrote:
$this->hydrationMode
= HydrationModes::HYDRATE_INDEX;
$query = $this->solr->createQuery($this->entityName);
$query->setRows(1);
$query->addSort('id', Query::SORT_DESC);
$result = $query->getResult();
EDIT: I wrote my own Repository and this is a specific method I wrote in it ;-)
as you can see, I've setted
`$query->setRows(1);
If the limit is set at 1000000 how can I retrieve the max value of a specific field in a specific type of document ?
your query does not work because id is a string like documentname_1234
. The easiest way to get the highest id is to use doctrine and the underlying database.
The idea here is to get the last inserted id from Solr, not from Doctrine.
That said, I don't understand what you're saying : why the fact that my id is a string have an influence on the query's result ? The field values are always numbers, so even if it's a string, it shouldn't be limited by a number. Maybe by the length of the string ?
By the way, why the id, which is declared as an integer in the entity class, is represented as a string into Lucene ?
Hello @floriansemm ,
good news : I finally made a script that handle correctly the memory (it seems at least, my tests are not finished yet).
It's a bit complicated to explain all the code part, but basically, I copied a lot of your code, just to keep the same logic. It's possible that I made some changes, but only in the way the code is structured, not in the logic.
Then I added some calls to EntityManager::detach(), where I could.
I also added calls to EntityManager::clear(), at the end of each batchLoop.
Finally, I even made calls to the gc_collect_cycles() just to be sure that the memory is flushed at the end of each Entity index process.
But that wasn't the whole problem. I also had to manage the fact that I'm going through a very large table (170 columns, more than 550K rows). To manage that situation, I just create a specific DQL query to limite the quantity of rows manage by the entityManager.
I found that webpage that explain clearly the problem and propose a solution to it : http://www.xarg.org/2011/10/optimized-pagination-using-mysql/
I hope it is going to help some people.
If my explanations are not clear (which is completely possible). Feel free to ask anything.
Your problem is very special but I learned a lot from it. Thanks for the link, it is a very interesting article!
I have started a little tutorial how-to index large sets of entities.
I also try your suggestion about the mysql export, but my server didn't have the rights to export those files so it didn't work for me.
That said, the other problem was still there : my query was taking more time at each "page".
Don't forget to include the logic of sorting your query on a column (id for example) and use the last handled value as a filter to skip already processed rows ;-)