johanneszab/TumblThree

Didn't release memory

mizuikk opened this issue · 13 comments

resme
The TumblThree can not release memory until it is stall.
And when TumblThree stalled,click stop botton ,the main interface will closed automaticly,But the process will not close.
And if kill process manually and start tumblthree again, all task will go back to last startup time.

What version number was that? What exactly did you do (download)? I.E. what kind of blogs? Do you have any error at the top (i.e. connection timeouts)? How many concurrent downloads have you set (the connection amount as well as number of concurrent blogs)?

I was pretty sure I sorted out all the leaks a year ago (see the commit message). At least when I let it run over 2-3 days, it has ~300-400mb memory usage (last tested with v1.0.8.63).

What version number was that? What exactly did you do (download)? I.E. what kind of blogs? Do you have any error at the top (i.e. connection timeouts)? How many concurrent downloads have you set (the connection amount as well as number of concurrent blogs)?

I was pretty sure I sorted out all the leaks a year ago (see the commit message). At least when I let it run over 2-3 days, it has ~300-400mb memory usage (last tested with v1.0.8.63).

1.0.8.68. Concurrent blogs=1 with 100 Concurrent connections.
And some task will get stall even if corresponding tumblr status is normal.
It will lead software stop to do anything.

Maybe you answer all the questions?

  • what kind of blogs did you download? Also likes and/or (tag) searches? Or just (hidden) blogs?
  • Any (timeout) errors at the top?

Maybe you shouldn't use 100 connections? What happens if you use the defaults? If you have time, you could give the v1.0.8.63 a try. I'll most like not publish any new release this year.

Maybe you answer all the questions?

  • what kind of blogs did you download? Also likes and/or (tag) searches? Or just (hidden) blogs?
  • Any (timeout) errors at the top?

Maybe you shouldn't use 100 connections? What happens if you use the defaults? If you have time, you could give the v1.0.8.63 a try. I'll most like not publish any new release this year.

I will try v1.0.8.63 run for a hour.
I use default settings earlier get same result.
The task have a high probability get stall in empty blog and less probability in normal blog&hidden blog.
When the issue task is get ready to start, it will never start to download,the task just turn to blue.
It have nothing to do with blog types. Other settings is default. I just changed connections.
I used proxy,it get a few timeout errors.

Maybe you answer all the questions?

  • what kind of blogs did you download? Also likes and/or (tag) searches? Or just (hidden) blogs?
  • Any (timeout) errors at the top?

Maybe you shouldn't use 100 connections? What happens if you use the defaults? If you have time, you could give the v1.0.8.63 a try. I'll most like not publish any new release this year.

I noticed that if a blog have huge contents(like: hisame), the task will not release memory until it finished.
v1.0.8.63's task will also stall in random blog(even with all default settings), It may occur it when the single task process stuck in network connection for a few seconds,or api rate limit someting(I don't know how you write it),an then TumblThree will not disconnect tcp or do a retry,It will occupy a task process lead next task to stop and with indefinite repetition whole soft will stopped.

resme2

Please attach screenshots from all your settings in the Settings window, or attach the Settings.json from %LOCALAPPDATA%\TumblThree.

{
"AccessTokenUrl": "https://www.tumblr.com/oauth/access_token",
"ApiKey": "",
"AuthorizeUrl": "https://www.tumblr.com/oauth/authorize",
"AutoDownload": false,
"Bandwidth": 0,
"BlogType": "None",
"BlogTypes": [
"None",
"All",
"Once finished",
"Never finished"
],
"BufferSize": 512,
"CatBoxType": 0,
"CheckClipboard": true,
"CheckDirectoryForFiles": false,
"CheckOnlineStatusOnStartup": false,
"ColumnSettings": [
{
"Key": "Name",
"Value": {
"m_Item1": 0,
"m_Item2": 200,
"m_Item3": 0
}
},
{
"Key": "Downloaded Files",
"Value": {
"m_Item1": 1,
"m_Item2": 0,
"m_Item3": 0
}
},
{
"Key": "Number of Downloads",
"Value": {
"m_Item1": 2,
"m_Item2": 0,
"m_Item3": 0
}
},
{
"Key": "Url",
"Value": {
"m_Item1": 3,
"m_Item2": 250,
"m_Item3": 0
}
},
{
"Key": "Progress",
"Value": {
"m_Item1": 4,
"m_Item2": 200,
"m_Item3": 0
}
},
{
"Key": "Online",
"Value": {
"m_Item1": 5,
"m_Item2": 80,
"m_Item3": 0
}
},
{
"Key": "Type",
"Value": {
"m_Item1": 6,
"m_Item2": 120,
"m_Item3": 0
}
},
{
"Key": "Date Added",
"Value": {
"m_Item1": 7,
"m_Item2": 120,
"m_Item3": 0
}
},
{
"Key": "Last Complete Crawl",
"Value": {
"m_Item1": 8,
"m_Item2": 120,
"m_Item3": 0
}
},
{
"Key": "Rating",
"Value": {
"m_Item1": 9,
"m_Item2": 110,
"m_Item3": 0
}
},
{
"Key": "Personal Notes",
"Value": {
"m_Item1": 10,
"m_Item2": 120,
"m_Item3": 0
}
}
],
"ConcurrentBlogs": 1,
"ConcurrentConnections": 100,
"ConcurrentScans": 4,
"ConcurrentVideoConnections": 100,
"ConnectionTimeInterval": 60,
"CreateAudioMeta": false,
"CreateImageMeta": false,
"CreateVideoMeta": false,
"DeleteOnlyIndex": false,
"DisplayConfirmationDialog": false,
"DownloadAnswers": true,
"DownloadAudios": true,
"DownloadCatBox": false,
"DownloadConversations": true,
"DownloadFrom": null,
"DownloadGfycat": false,
"DownloadImages": true,
"DownloadImgur": false,
"DownloadLinks": true,
"DownloadLocation": "D:\tumblr\combine",
"DownloadLoliSafe": false,
"DownloadMixtape": false,
"DownloadPages": null,
"DownloadQuotes": true,
"DownloadRebloggedPosts": true,
"DownloadSafeMoe": false,
"DownloadTexts": true,
"DownloadTo": null,
"DownloadUguu": false,
"DownloadUrlList": false,
"DownloadVideos": true,
"DownloadWebmshare": false,
"DumpCrawlerData": false,
"EnablePreview": true,
"ExportLocation": "blogs.txt",
"ForceRescan": false,
"ForceSize": false,
"GfycatType": 0,
"GridSplitterPosition": 674,
"Height": 838.40000000000009,
"ImageSize": "raw",
"ImageSizes": [
"raw",
"1280",
"500",
"400",
"250",
"100",
"75"
],
"IsMaximized": true,
"Left": 40,
"LimitConnections": true,
"LimitScanBandwidth": false,
"LoadAllDatabases": false,
"LoliSafeType": 0,
"MaxConnections": 90,
"MaxNumberOfRetries": 3,
"MetadataFormat": 0,
"MixtapeType": 0,
"OAuthCallbackUrl": "https://github.com/johanneszab/TumblThree",
"OAuthToken": "",
"OAuthTokenSecret": "",
"PageSize": 50,
"PortableMode": false,
"ProgressUpdateInterval": 100,
"ProxyHost": "127.0.0.1",
"ProxyPassword": "",
"ProxyPort": "1000",
"ProxyUsername": "",
"RegExPhotos": false,
"RegExVideos": false,
"RemoveIndexAfterCrawl": false,
"RequestTokenUrl": "https://www.tumblr.com/oauth/request_token",
"SafeMoeType": 0,
"SecretKey": "
",
"SettingsTabIndex": 1,
"ShowPicturePreview": true,
"SkipGif": false,
"Tags": null,
"TimeOut": 60,
"TimerInterval": "22:40:00",
"Top": 0,
"TumblrHosts": [
"data.tumblr.com"
],
"UguuType": 0,
"UserAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36",
"VideoSize": 1080,
"VideoSizes": [
"1080",
"480"
],
"WebmshareType": 0,
"Width": 1550.4
}

Please attach screenshots from all your settings in the Settings window, or attach the Settings.json from %LOCALAPPDATA%\TumblThree.

i will delete settings and look what will happed

No issues here for me.

I've stared TumblThree yesterday, downloaded 12 hours constantly until now of a single blog with 370'000 posts whereas I was able to download 130'000 posts with my crappy connection. All default settings, yet it has 400MB memory usage while written 32GB to the disk:

tumblthree_memory2

tumblthree_memory1

I'll not have the time to look into this any further in this year. But thanks already for your images and the settings.json

Some things I've noticed from those information:

  • In your settings.json you still have the "proxy" option set, which indicates that you've used TumblThree for quite some time. Since we switched at some point to not use our own implementation but the Internet Explorer/Windows proxy settings, I've only removed this setting from the Settings. But some portions of the implementation is still there (#291). Maybe that is what causes the leak, if you still run into some portions of that code, but I'm not sure if/how that could cause a leak. The previous proxy settings were unreliable for sure and didn't work for anyone. It's a good thing that you started your settings from scratch now.

  • I've added the additional "concurrent video connection" option in the settings for a reason as explained in the tool tips! I would not change this at all, certainly not to 100! It's also explained here in more detail in the release notes or some issue.

  • I would also not set the "concurrent connection" to 100. I've never thought that someone would actually increase the connection count to this high. I don't know if Tumblr allows you to establish 100 connections to them from a single host. They for sure have some load balancing, and usually at this scale, things a rate limited. Thus, I'm guessing(!) you cannot even open the 100 connections to them. But I've never actually tested this. And certainly I've not tested if that has any other than the regular connection timeout/issues impact on TumblThree.

Let me know if you still have this issue with the fresh/new settings.json.

I'll not have the time to look into this any further in this year. But thanks already for your images and the settings.json

Some things I've noticed from those information:

  • In your settings.json you still have the "proxy" option set, which indicates that you've used TumblThree for quite some time. Since we switched at some point to not use our own implementation but the Internet Explorer/Windows proxy settings, I've only removed this setting from the Settings. But some portions of the implementation is still there (#291). Maybe that is what causes the leak, if you still run into some portions of that code, but I'm not sure if/how that could cause a leak. The previous proxy settings were unreliable for sure and didn't work for anyone. It's a good thing that you started your settings from scratch now.
  • I've added the additional "concurrent video connection" option in the settings for a reason as explained in the tool tips! I would not change this at all, certainly not to 100! It's also explained here in more detail in the release notes or some issue.
  • I would also not set the "concurrent connection" to 100. I've never thought that someone would actually increase the connection count to this high. I don't know if Tumblr allows you to establish 100 connections to them from a single host. They for sure have some load balancing, and usually at this scale, things a rate limited. Thus, I'm guessing(!) you cannot even open the 100 connections to them. But I've never actually tested this. And certainly I've not tested if that has any other than the regular connection timeout/issues impact on TumblThree.

Let me know if you still have this issue with the fresh/new settings.json.

I deleted all files in /local/tumblthree and use defaults(I just changed api keys and logined tumblr account ,all default without proxy),
The task still randomly stucked,even with 1task 4vidoe-8 connection
I'm using other python script ,they can running whole day with full speed but can not craw hidden blog and hard to know which blog i downloaded.

I'll not have the time to look into this any further in this year. But thanks already for your images and the settings.json

Some things I've noticed from those information:

  • In your settings.json you still have the "proxy" option set, which indicates that you've used TumblThree for quite some time. Since we switched at some point to not use our own implementation but the Internet Explorer/Windows proxy settings, I've only removed this setting from the Settings. But some portions of the implementation is still there (#291). Maybe that is what causes the leak, if you still run into some portions of that code, but I'm not sure if/how that could cause a leak. The previous proxy settings were unreliable for sure and didn't work for anyone. It's a good thing that you started your settings from scratch now.
  • I've added the additional "concurrent video connection" option in the settings for a reason as explained in the tool tips! I would not change this at all, certainly not to 100! It's also explained here in more detail in the release notes or some issue.
  • I would also not set the "concurrent connection" to 100. I've never thought that someone would actually increase the connection count to this high. I don't know if Tumblr allows you to establish 100 connections to them from a single host. They for sure have some load balancing, and usually at this scale, things a rate limited. Thus, I'm guessing(!) you cannot even open the 100 connections to them. But I've never actually tested this. And certainly I've not tested if that has any other than the regular connection timeout/issues impact on TumblThree.

Let me know if you still have this issue with the fresh/new settings.json.

I think i may found the problem,
If crawl a bran-new blog with default settings, no error.
but if you recrawl exist blog with default settings,the soft will definitely stucked at sometime.

but if you recrawl exist blog with default settings,the soft will definitely stucked at sometime.

What exactly does this "exist" mean? When did you add this blog, years ago (based on your Settings.json from above, it could be quite some time ago)? There were for sure changes in the blogs data structure, which I usually mention in the release notes. For example the split from one (database) file into one containing all the "meta" data related fields, and another one containing the downloaded files. I've done this in order to not having to load all the data of every ever downloaded file of all the added blogs into the memory.

I'm asking, because I still cannot reproduce neither the memory usage, nor the hang. If I add a blog now, crawl all 8141 posts from it, then recrawl it, no issue. Neither does TumblThree use excessive memory, nor does it hang.

Thus, if your blogs are too old, maybe you'll have to export all your blogs once (via Settings), re-add them, and recrawl them. Already downloaded files still in place will not be re-downloaded.