janmg/logstash-input-azure_blob_storage

Plugin not creating registry file

sera4000 opened this issue · 17 comments

Hi Jan,

Any chance you could suggest a way to troubleshoot this? I have two Azure resource groups - one in a Dev environment and the other in a Production environment. Each one has a Storage Account with a container. The permissions and config 'seems' identical.

Running your plugin against the Dev environment works - blob log files (json lines) are ingested by your plugin and sent up to elasticsearch.

Running the plugin against the Production environment produces no visible errors (in DEBUG mode), however no data is ingested, and no data/registry.dat file is created. I have tried using resume, start_over, and start_fresh.

The input config looks like this:

input {
azure_blob_storage {
storageaccount => "some-storage-account"
access_key => "some-key"
container => "logcontainer"
codec => "line"
registry_create_policy => "start_fresh"
path_filters => ['2020-05-14/11/*']
interval => 60
}
}

The blob files in each environment are the same. Size of each file is between 5kb and 5mb.

Any suggestions for how to troubleshoot / trace in more detail?

Many thanks,
Sera

janmg commented

In short, check without path_filters, check for acl's and use debug_until.

If you start_fresh, you wouldn't need the path_filters, during the start it would learn which files exist and therefor ignore those files and only process newly discovered files the next polling interval. The path_filters uses glob (Ruby File fnmatch) and although your path_filters looks good, it is rather fiddly and may not work as expected.

Can you verify with the storage explorer (application or in the portal) that your keys work and that you can create a directory "data" and that you don't have endpoint access control lists (ACLs) that block your client?

Can you set "debug_until" to a number of events. That will log on INFO level for a couple of files which files it processes, how large they are and if they have grown.
debug_until => 1000

That should give you for the first 1000 events information on which blobs were found.
1: list_blobs #{blob.name} #{offset} #{length}
2: adding offsets: #{name} #{off} #{file[:length]}
3: processing #{name} from #{file[:offset]} to #{file[:length]}

And if it doesn't work, can you than paste some of the logfiles? Also the absence of logentries helps to pinpoint. I can add some debug log statements in line 346, 370, 199, 257 ...

Thanks a lot for the assistance. I am following your troubleshooting procedures and will respond with the results soon. The storage account contains hundreds of thousands of files, which is why I used path_filters. If I remove the filter, what will happen if not all of the files have been processed by the time the interval repeats? Do I need to set a very long interval?

I'm assuming that the registry file only gets created after all files have been processed.

janmg commented

During register (init) the registry is loaded or created, but not saved. During the normal run loop, the clock starts, the files are listed, compared to the registry and the files that have grown or are new are put in a worklist. Once all the files are processed the registry is saved and the clock is compared to see if interval has passed. The first time the plugin has a huge worklist it takes a long time to process everything, unless you "start_fresh". The next time the worklist will be a lot smaller and the plugin will wait until the interval has passed before listing the files again. 60 seconds seems a good interval for this polling behaviour.

The debug_until feature is useful because logstash itself in debug is very chatty, debugging needs to happen mostly just after plugin start.

You can also test the path_filters using irb, if you have directories * will not be enough, you would need '**/*'

File.fnmatch?('2020-05-14/**/*', '2020-05-14/11/00/15/logfile.txt', File::FNM_PATHNAME | File::FNM_EXTGLOB)

So maybe for you a path_filters =>['202*/**/*'] will fix this for the next 10 years?

Using Storage Explorer app, connecting with key, I am able to create directory 'data'.

My Production account has millions of .blob files. The directory structure is 4 layers deep. The "root" directory is named 'prod-access-data_fced752da2e641d7bdf0f7124cfc7d34'. I need to pull everything under the 'Requests' subdirectory (and recursive). So path filter I am using is:

path_filters => 'prod-access-data_fced752da2e641d7bdf0f7124cfc7d34/Requests/**/*'

Below the 'Requests' directory are subdirectories labelled by date, for example '2020-05-15'. Below that are subdirectories labelled by hour, for example '07'. The .blob files are inside these.

I have re-created the exact same folder structure in a Dev account, but have only included a few .blob files. I then run the following against my Dev and against my Production accounts:

input {
azure_blob_storage {
storageaccount => "accountname"
access_key => "key"
container => "ds-container"
codec => "line"
registry_create_policy => "start_fresh"
path_filters => 'prod-access-data_fced752da2e641d7bdf0f7124cfc7d34/Requests/**/*'
interval => 60
debug_until => 1000
}
}

What is happening is that in the Dev account this works correctly, the registry file is created, and data is ingested. However, in the Production account, there are no error messages. It just displays:

[2020-05-15T05:14:18,710][INFO ][logstash.inputs.azureblobstorage][main] main starting fresh

... and stays like that for a long time. No registry file is created.

If I remove the path filter, in Production, it then does process the blob files ( I see them listed from debug_until), but after 10 hours it is still processing ...

Any suggestions for getting the path filter to work in Production? Why would it work in Dev but not in Production?

Many thanks,
Sera

More detail - the Prod container has 32TB of blob files. I'm certain this is why nothing appears to be happening. I have tried creating a new container in the same Prod account, with just a few blob files, and it all works fine.

One thing I still don't understand, regarding start_fresh ... After letting it cycle through start_fresh a few times, no registry file is created. But if I then change the config to resume, the first time that runs a registry file is created. Would it not be simpler (and same speed) to just use resume?

I guess what is confusing me is that 'start_fresh' does not create a registry file. Only after running 'start_fresh', stopping logstash, changing to 'resume', does a registry file get created. But the behavior then follows 'resume' - and the pre-existing blob files get processed. This is not what I want - I want only new blobs to get processed. Any idea why 'start_fresh' is not creating a registry file?

janmg commented

prefix filters the file list on the azure side, path_filters does this in the plugin.

If you configure prefix to only fetch the blobs in the Requests directory, you won't have to keep a large registry and then filter it in the plugin. How many files would that be?

https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs

If you leave the value on resume without a registry, the plugin will start from absolute scratch processing everything. If you want to only capture changed and new files, than start_fresh is the way to go. I can add a save_registry(@registry) in line 146. And then some debug statements I already mentioned. I don't have time in the next couple of days, but I'm really curious to where in your case the processing grinds to a halt. 32TB shouldn't be a problem, but the amount of files may be problematic. The Faraday component that my plugin depends on, has been a bit flaky and in march a fix was released, also azure-storage-blob 2.0 came in march. So it's a good moment to release a new version of this plugin anyway ... I'm just superbusy in the next couple of days.

I will try with prefix filter - that sounds much better. I assume the syntax is the same as path filter.

Yes I think it would be good to add save_registry. I did test with smaller environments and found that, for start_fresh, the registry file is created after both the first interval completes and a new blob file created.

The directory I'm interested in is 7TB and 2.5 million blob files.

For the last 2 days I've tried to run the plugin against the full 32TB (17 million files) using start_fresh without any filter. My logstash is an EC2 instance with 2vCPU and 8GB mem, running Centos 7. I've increased java heap space to 4GB, but everything else is pretty much default. After about 4 million files are listed, I get a java heap memory error.

With prefix and start_fresh configured, it now lists all the files in the directory - takes about 10 hours. Unfortunately after listing them, it just starts the cycle again immediately, and no registry file gets created. I have let it cycle through 3 times, and no registry file. My interval is set to 120. Please advise on how to configure this ... Thanks.

Note: I have now added save_registry(@registry) in line 146, and changed the interval to 500000. Will let you know if this works.

Note after testing: This does not work for very large directories (2 million files). It works fine against smaller directories. For large directories the plugin does not create a registry file, and does not respect the interval configured.

janmg commented

I added the save_registry after listing to GITHUB.

I fixed the 3 times listing problem. Originally I added this loop because the azure library the plugin depends on azure-storage-blob, which uses Faraday and sometimes a connection error occurred, I intended this to be a retry and not somehting that is always done. but the break didn't work as I intended.

The azure api returns a maximum of 5000 files in one go and the nextMarker having to retrieving the next batch. For you that is 3400 requests and that will take a considerable amount of time. Because the plugin didn't provide any feedback until you enabled debug_until. Now I added for every marker a log statement, so that you don't have to list 17.000.000 files, but only 3400 times the marker.

The interval is used to determine how long to sleep before starting to list again. Unless processing exceeds this interval. For you sleeping is not going to happen. Since the listing is going to take over 3 hours, which far outstrips a 60 seconds sleep time. Then because you use start fresh a new list is created, so the plugin can compare which files are new and which ones have grown. With the bug this would result in 6 listings instead of 2, which for a couple of thousand files wasn't noticeable. But for your scale, it's a different story. For you I estimate 14400 is a good sleeptime, that makes the plugin run in 4 hour intervals, so your server has some time to catch it's breath. But even setting it to 0 would just mean no delay and loop again.

I'm still puzzled on why the registry doesn't get created and it might have to do with memory allocation. I use -Xms1g -Xmx4g, but for you the filelist may grow very large. If you have free memory, it would be good to allocate more for Xmx than 4GB and monitor what JAVA actually grabs for logstash.

I would appreciate it if you can build the 0.11.4 version and test if that works better for you? The build.sh script are the steps I use on Ubuntu. 18.04 LTS. I can build a release if I had some time to look through the changes I made a bit more thorough.

Thanks for your detailed feedback, it helps me a lot to vizualize the actual problem and where to make improvements

Hi - I am unable to install the gem due to dependencies. If you could build a release so that I could install via the plugin installer, I will be able to test. Thanks ...

janmg commented

No worries, today I worked for issue 2 on storing the registry on the local file path, which in your case with AWS to Azure traffic and lots of files, should massively offload the saving of the registry. I still want to improve the timers saving the registry during regular intervals and then I'll release a new version that should scale better.

I have a couple of questions regarding a new release:

  1. How will we know when you have done this?
  2. How does one install a new release? Can it be done directly from the Logstash plugin install command (/usr/share/logstash/bin/logstash-plugin install ...)?

Thanks...

janmg commented

Good questions, I have pushed 0.11.4 last week, didn't receive many complaints yet =)

I have updated the CHANGELOG.md file to explain what the changes are, then I tagged the commit that I used for 0.11.4. Then I pushed the GEM to rubygems. That's where logstash will pull the compiled plugin from.
https://rubygems.org/gems/logstash-input-azure_blob_storage

Plugin management is explained here
https://www.elastic.co/guide/en/logstash/current/working-with-plugins.html

For all distributions/versions the commands are slightly different, but for Ubuntu installation, the commands would be
cd /usr/share/logstash
bin/logstash-plugin list --verbose
bin/logstash-plugin update logstash-input-azure_blob_storage

And then you need to reload the pipeline
service logstash restart

Logstash uses the HUP signal to reload the pipelines where the configuration has changed, so you don't have to disturb other pipelines.
https://www.elastic.co/guide/en/logstash/current/reloading-config.html
kill -SIGHUP $(pgrep logstash)

I think you could use these configuration parameters

prefix => 'prod-access-data_fced752da2e641d7bdf0f7124cfc7d34/Requests'
registry_create_policy => 'start_fresh'
interval => 14400
debug_timer => true
registry_local_path => '/usr/share/logstash/plugins'

Use 'resume' if you want to process everyhting, interval is 4 hours your file_listing will take long, debug_timer will help time how many seconds all steps take and saving the registry localy in a separate thread will be much faster than pushing it back to the storage account

Thanks for all this detailed info. I have installed ver 0.11.4 and am testing now with your recommended config. Two things I still don't quite understand:

  1. What will happen if the first round of listing does not finish before 4 hours? Does it interrupt the process and start again?

  2. At what stage should I change from start_fresh to resume? Does it matter if I make this change in the middle of a listing interval?

(I do not want to ingest historical data. I only want to ingest new data from today.)

janmg commented

The first list_blob is done at the startup, only if you start_fresh (for ingesting new data). This fills the registry with the known files and offsets. Then the eternal processing loop starts with creating a new list to see if some files are new or have grown. Then the files are read and the lines are added as a message to the logstash queue, which then go through the filter block. After each processed file, the time since the start of the worklist is checked, to save the registry after every interval. With the registry you don't loose work in case logstash crashes. Saving of the registry happens in a separate thread and since 0.11.4 only one at a time. If the whole worklist is processed before an interval has passed, the plugin will go into an interruptible sleep. So if you stop the plugin, it can stop immediately. If the interval has passed then no sleep for the plugin and the loop starts again doing a list_blobs.

You can change to "resume" right after the plugin has started, or better when you know the registry was created, so that at the next start it will resume from the registry.

Hi - just to let you know ... the plugin is working fine (with local registry) now. Thanks for all your support!

The only issue I am still seeing, is high jvm memory use. My Logstash server has 16GB memory, so I have configured Java mem as follows:

-Xms12g
-Xmx12g

Logstash is running fine with your plugin, however if I try and run any other Java command (eg. list plugins) at the same time, I get the following error:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000514cc0000, 12535988224, 0) failed; error='Not enough space' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 12535988224 bytes for committing reserved memory.

Looking at 'top' while Logstash is running shows the logstash java process is using 84% mem and 40-50% cpu.

Thanks again for your assistance.
Sera