SimpleBrowserDotNet/SimpleBrowser

Memory usage for long running polling browser

georgiuk opened this issue · 11 comments

Hi,
It is first time I am using SimpleBrowser and I am impressed. I use it to log in a website and then poll for data every x seconds. I saw that you don't recommend using it this way, but this feels right to me :-) . Anyway by doing this the memory keeps increasing. So by using ANTS memory profiler I saw that there are two lists in Browser object that keep getting larger: the _allActiveElements and the _history . So to use it this way I had to force clearing them using reflection. So before every navigation I call a method to clear these lists. I don't know if you have any plans of implementing such functionality as a mode? Anyway I post it here in case anybody finds it useful.

private void cleanupBeforeNavigation()
{
    var fieldElements = typeof(Browser).GetField("_allActiveElements", BindingFlags.Instance | BindingFlags.NonPublic);
    System.Collections.IList elements = fieldElements.GetValue(browser) as System.Collections.IList;
    elements.Clear();

    var fieldHistoryPos = typeof(Browser).GetField("_historyPosition", BindingFlags.Instance | BindingFlags.NonPublic);
    fieldHistoryPos.SetValue(browser, -1);

    var fieldHistory = typeof(Browser).GetField("_history", BindingFlags.Instance | BindingFlags.NonPublic);
    System.Collections.IList history = fieldHistory.GetValue(browser) as System.Collections.IList;
    history.Clear();
}

Kostas

@georgiuk _allActiveElements should not grow. It should only contain the elements in the current page. Whenever there is a navigation, this should clear and reset. I'll look into this. If this is really happening, it's a bug.

Similarly, the two _history collections should not grow infinitely. There is code in SimpleBrowser that limits the _histories stored to the last 20 navigations. (20 is arbitrary and hard coded.) That code is suspect. Management of histories is all internal to the Browser class. One or more developers have decided to directly manipulate the history collections rather than using the methods that limit the growth of histories. I don't know if that's what's happening here, but I would bet that it is. I can look into that too. I suspect an overhaul of the history management code is coming.

When I get issues that say "SimpleBrowser uses a lot of memory", I typically suggest a change to the algorithm. You are essentially doing this:

var browser = new SimpleBrowser();
while(true)
{
   browser.Navigate();
   Thread.Sleep(milliseconds);
}

If you don't care about the history, change your algorithm to this:

while(true)
{
   var browser = new SimpleBrowser();
   browser.Navigate();
   Thread.Sleep(milliseconds);
}

That way, your browser is destroyed when it goes out of scope. All of the memory it uses is released too.

Alternatively, you could close your current browser tab using the Close() method:

var browser = new SimpleBrowser();
while(true)
{
   browser.Navigate();
   browser.Close();
   Thread.Sleep(milliseconds);
}

Close will clear the _history.

Hi thank you for your answer,
the thing is that I have to login, so logging in in every polling will be slower and will put some pressure on the web server (it is an embedded device). Except if I can store the cookies and add the cookies back to the new browser instance. Is it possible?

Kostas

@georgiuk I started looking at this issue today. In my testing _history never grows beyond 20 items, by design. _allActiveElements, however, grows infinitely. I can see in the code where elements are added to this collection on every navigation, but the collection is never referenced after that. This research leads to the question, why does _allActiveElements exist? I've commented _allActiveElements in my fork and am running some tests. With RetainLogs set to false, my sample application which has made 500+ navigations so far has not grown beyond 31 MB. It grows to about that, then shrinks back to 21 MB. I'm going to keep testing. If my tests prove out, I'll just remove _allActiveElements.

Hi. Yes my findings where these also. Sure one thing is to remove _allActiveElements in case is not needed. The other thing would be to have the max _history to be configurable and not hardcoded to 20. In this way someone could set it to 1 or 0 so it doesn't grow at all.

@georgiuk The minimum is required to be 1. The "current" page is the 0th index in the collection of navigation histories. I was able to make the modification to eliminate _allActiveElements and set the NavigationHistoryCount to a minimum of 1. (It still defaults to 20.)

@georgiuk I put in a fix for this issue, tested it, and committed it to my local repository. I then ran the unit tests and a test failed. (Oops. I should have run the unit tests before committing.) In investigating why the test failed, I found that _allActiveElements is actually being implicitly used. I still believe that _allActiveElements is inefficient and there has to be a better way. I'm going to continue working this issue.

This bug was introduced 8 years ago (5/19/2012 12:05:59 PM).

Hi @Teun

It looks like you wrote the code in question many years ago. Do you recall what the _allActiveElements collection was supposed to be for? It looks like it was intended to hold the elements for navigation state, so that when the user used NavigateForward() or NavigateBack(), that the HTML elements would still be available for use without having to reparse the document. The problem with that is any time NavigateForward or NavigateBack is called, all of the elements in _allActiveElements are marked as invalid. Any attempt to use an invalid element results in an InvalidOperationException.

There are a few problems with this implementation:

  1. _allActiveElements grows infinitely. There is no mechanism to remove elements from this collection, except to close the Browser. This results in high memory usage over time. I think some of the high memory usage issues reported may be because of this.
  2. Navigating forward or back shouldn't necessarily invalidate the HTML elements. In Chrome, I can view a page, click on a link, go back, and click on a link from the cached page. SimpleBrowser should work the same way. (That is, why is the HtmlElement invalid? If it's correct that it is invalid and no longer able to be used, why are we keeping it in memory?)
  3. Is the collection of HTML elements really needed? The NavigationState holds the XDocument thats the result of being parsed. Isn't that enough? All of the query and find methods reference the XDocument, not the collection of HTML elements. If I go back or forward, can't the XDocument from the navigation at the current navigation position be used without needing the collection of HtmlElements?

For now, I've moved the collection _allActiveElements collection into the NavigationState object so that the elements are at least tied to the navigation history, which has a maximum size. (That is, each navigation state has its own collection of active elements rather than having one for the entire browser.) I really think, though that the entire collection of HTML elements can be removed, unless you see some reason why it shouldn't be.

I've also made the maximum size of the navigation history configurable rather than being hard-coded at 20.

No response from @Teun. Moving forward with the changes outlined above.

@georgiuk I will be committing these changes today. I have implemented code to trim the size of the navigation history and the _allActiveElements collection. I will address the issue with the _log separately, because that collection is part of two other issues - #151 and #242.

@georgiuk I will be committing these changes today. I have implemented code to trim the size of the navigation history and the _allActiveElements collection. I will address the issue with the _log separately, because that collection is part of two other issues