janmg/logstash-input-azure_blob_storage

Index -1 out of bounds for length

ThreatLentes opened this issue · 20 comments

I'm giving the plugin a try but no idea why I'm getting the below error.

Running Logstash 8.10.4 fresh install

Using bundled JDK: /usr/share/logstash/jdk
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
logstash 8.10.4

Logstash Config:

input {
    azure_blob_storage {
        codec => "json"
        storageaccount => "nsgflowsiemtest"
        access_key => "base64=="
        container => "insights-logs-networksecuritygroupflowevent"
        logtype => "nsgflowlog"
        prefix => "resourceId=/"
        path_filters => ['**/NSG-SIEM-POC/**/*.json']
        interval => 30
    }
}
filter {
    json {
        source => "message"
    }
}
output {
    stdout{codec => rubydebug}
}

For reference, the full location of the json in the storage account is below:
resourceId=/SUBSCRIPTIONS/7123871293721379/RESOURCEGROUPS/SIEM-POC/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/NSG-SIEM-POC/y=2023/m=10/d=27/h=15/m=00/macAddress=7812738HD/PT1H.json

Getting the following error:

INFO ] 2023-10-27 13:57:38.074 [[main]-pipeline-manager] azureblobstorage - === azure_blob_storage 0.12.9 / main / 791cae / ruby 3.1.0p0 ===
[INFO ] 2023-10-27 13:57:38.074 [[main]-pipeline-manager] azureblobstorage - If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage
[INFO ] 2023-10-27 13:57:38.084 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2023-10-27 13:57:38.098 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[ERROR] 2023-10-27 13:57:38.379 [[main]<azure_blob_storage] azureblobstorage - caught: undefined local variable or method `path' for #<LogStash::Inputs::AzureBlobStorage:0x6e1391e9>
[ERROR] 2023-10-27 13:57:38.380 [[main]<azure_blob_storage] azureblobstorage - loading registry failed for attempt 1 of 3
[ERROR] 2023-10-27 13:57:38.451 [[main]<azure_blob_storage] azureblobstorage - caught: undefined local variable or method `path' for #<LogStash::Inputs::AzureBlobStorage:0x6e1391e9>
[ERROR] 2023-10-27 13:57:38.452 [[main]<azure_blob_storage] azureblobstorage - loading registry failed for attempt 2 of 3
[ERROR] 2023-10-27 13:57:38.485 [[main]<azure_blob_storage] azureblobstorage - caught: undefined local variable or method `path' for #<LogStash::Inputs::AzureBlobStorage:0x6e1391e9>
[ERROR] 2023-10-27 13:57:38.485 [[main]<azure_blob_storage] azureblobstorage - loading registry failed for attempt 3 of 3
[INFO ] 2023-10-27 13:57:38.486 [[main]<azure_blob_storage] azureblobstorage - learn_encapsulation, this can be skipped by setting skip_learning => true. Or set both head_file and tail_file
[INFO ] 2023-10-27 13:57:39.295 [[main]<azure_blob_storage] azureblobstorage - learn json header and footer failed because Index -1 out of bounds for length 4299
[INFO ] 2023-10-27 13:57:39.295 [[main]<azure_blob_storage] azureblobstorage - head will be: '' and tail is set to: ''
[ERROR] 2023-10-27 13:57:41.159 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 13:58:10.738 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 13:58:41.230 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 13:59:11.240 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 13:59:40.411 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:00:11.547 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:00:40.893 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:01:13.502 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:01:41.194 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:02:11.273 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:02:43.040 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs
[ERROR] 2023-10-27 14:03:11.053 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464374 while trying to list blobs

janmg commented

I think this is related to the path_filter. The error happens when trying to list all the blobs where also the path filtering happens with File::FNM_PATHNAME and File::FNM_EXTGLOB. I don't know how if /NSG-SIEM-POC//*.json will be able to find the files.

in a couple of days I can test with a debug program to see what the files the filter would find, but for now I recommend removing the path_filter. If you have more files in the blob container you can use the simple configuration option "prefix" as a filter.

Thank you @janmg

I made some changes to the config to avoid some of the errors like the registry. I also removed path_filter.

input {
    azure_blob_storage {
        codec => "json"
        storageaccount => "nsgflowsiemtest"
        access_key => "base64=="
        container => "insights-logs-networksecuritygroupflowevent"
        logtype => "nsgflowlog"
        prefix => "resourceId=/SUBSCRIPTIONS/7123871293721379/RESOURCEGROUPS/SIEM-POC/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/NSG-SIEM-POC/"
        file_head => "{"
        file_tail => "}"
        skip_learning => true
        registry_local_path => "/usr/share/logstash/blobplugin"
        interval => 30
        registry_create_policy => "start_over"
    }
}
filter {
    json {
        source => "message"
    }
}
output {
    stdout{codec => rubydebug}
}

Below are the errors. At some point it just throws an error without a message.

[INFO ] 2023-10-27 16:45:59.773 [[main]-pipeline-manager] javapipeline - Pipeline Java execution initialization time {"seconds"=>0.63}
[INFO ] 2023-10-27 16:45:59.777 [[main]-pipeline-manager] azureblobstorage - === azure_blob_storage 0.12.9 / main / b0bf1d / ruby 3.1.0p0 ===
[INFO ] 2023-10-27 16:45:59.778 [[main]-pipeline-manager] azureblobstorage - If this plugin doesn't work, please raise an issue in https://github.com/janmg/logstash-input-azure_blob_storage
[INFO ] 2023-10-27 16:45:59.778 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2023-10-27 16:45:59.790 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[INFO ] 2023-10-27 16:45:59.998 [[main]<azure_blob_storage] azureblobstorage - head will be: '{' and tail is set to: '}'
[ERROR] 2023-10-27 16:46:02.460 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464363 while trying to list blobs
[ERROR] 2023-10-27 16:46:31.440 [[main]<azure_blob_storage] azureblobstorage - caught: Index -1 out of bounds for length 5464363 while trying to list blobs
[ERROR] 2023-10-27 16:47:00.943 [[main]<azure_blob_storage] azureblobstorage - caught:  while trying to list blobs
[ERROR] 2023-10-27 16:47:31.462 [[main]<azure_blob_storage] azureblobstorage - caught:  while trying to list blobs
[ERROR] 2023-10-27 16:48:00.922 [[main]<azure_blob_storage] azureblobstorage - caught:  while trying to list blobs
[ERROR] 2023-10-27 16:48:31.371 [[main]<azure_blob_storage] azureblobstorage - caught:  while trying to list blobs
[ERROR] 2023-10-27 16:49:00.792 [[main]<azure_blob_storage] azureblobstorage - caught:  while trying to list blobs
[ERROR] 2023-10-27 16:49:31.227 [[main]<azure_blob_storage] azureblobstorage - caught:  while trying to list blobs
janmg commented

I dont have much time to setup a test environment. but in your case the plugin doesnt seem to be able to read the blobs and I think you either filter too much or the access key is not healthy.

prefix is an option that is used direclty by the Ruby BlobClient to list the blobs. You don't need to set it if you use the storage account only for NSG flowlogs. You don't have to set a full path, it's enough to set prefix => "resourceId=/" or if you have multiple resource groups,
prefix => "resourceId=/SUBSCRIPTIONS/7123871293721379/RESOURCEGROUPS/SIEM-POC"

The plugin reads nsgflowlogs as JSON and the learning is used to figure out what the first block and the last block contains so that it can read the other blocks as valid json, but also read partial blocks still as valid JSON. In below example the plugin can read time:1 as valid json.

{"records":[
{ "time": "1", "subscription": "abcde", "resourcegroup": "rg", "nsg": "testrule" }
{ "time": "2", "subscription": "abcde", "resourcegroup": "rg", "nsg": "testrule" }
]}

And if time:2 got added to the same file, the plugin will do a partial read and adds the head and tail of the JSON and it is still valid. The start
{"records":[
{ "time": "2", "subscription": "abcde", "resourcegroup": "rg", "nsg": "testrule" }
]}

Your head '{' and tail '}' will not result in valid JSON. set it to '{"records":[' and ']}' for NSG flowlogs. I do this by default if the logtype is nsgflowlog, but you can override it by setting file_head and file_tail if it is different. The plugin will still try to read a blob to check and this you can skip by setting skip_learning => true

[2023-11-09T15:36:26,491][INFO ][logstash.inputs.azureblobstorage][main][b3a491c2e8d924a069dbcde96cf93021a2d0e3b327eba5aa2b3842c5a9a7c58c] learn json header and footer failed because Index -1 out of bounds for length 4281
[2023-11-09T15:36:26,491][INFO ][logstash.inputs.azureblobstorage][main][b3a491c2e8d924a069dbcde96cf93021a2d0e3b327eba5aa2b3842c5a9a7c58c] head will be: '' and tail is set to: ''
[2023-11-09T15:36:27,444][ERROR][logstash.inputs.azureblobstorage][main][b3a491c2e8d924a069dbcde96cf93021a2d0e3b327eba5aa2b3842c5a9a7c58c] caught: Index -1 out of bounds for length 5464332 while trying to list blobs
[2023-11-09T15:36:30,649][DEBUG][org.logstash.execution.PeriodicFlush][main] Pushing flush onto pipeline.

These are the logs I am seeing I am using 8.10.4 version of elasticsearch and logstash.

FROM docker.elastic.co/logstash/logstash:8.10.4

RUN bin/logstash-plugin install logstash-input-azure_blob_storage

This is my dockerfile.

janmg commented

I don't really understand where the Index -1 out of bounds comes from, it least the file has a length. because the plugin tries to list all the files in the blobstorage with their file lenght so it can read them one at a time and if it detects the file has grown, it will read the delta. I don't understand the second error eiter, " while trying to list blobs" because there was supposed to be an exception, but instead there are just spaces.

Is there something special with these accounts? I created a fresh account for testing and didn't see this errors. In the repo there is a blob_debug.rb that can iterate through the account. Also a look with the storage explorer may indicate what is so special in these repositories?

@janmg

Here another trace

It says caught: undefined local variable or method path' for`

[2023-11-10T22:05:19,386][INFO ][logstash.javapipeline ][main] Pipeline started {"pipeline.id"=>"main"}
[2023-11-10T22:05:19,400][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2023-11-10T22:05:19,500][ERROR][logstash.inputs.azureblobstorage][main][483c55d274dabd287ceabf0602317051ad3522c4df0276fb7482b9d6e98985a5] caught: undefined local variable or method path' for #<LogStash::Inputs::AzureBlobStorage:0x6556358d> [2023-11-10T22:05:19,501][ERROR][logstash.inputs.azureblobstorage][main][483c55d274dabd287ceabf0602317051ad3522c4df0276fb7482b9d6e98985a5] loading registry failed for attempt 1 of 3 [2023-11-10T22:05:19,519][ERROR][logstash.inputs.azureblobstorage][main][483c55d274dabd287ceabf0602317051ad3522c4df0276fb7482b9d6e98985a5] caught: undefined local variable or method path' for #LogStash::Inputs::AzureBlobStorage:0x6556358d
[2023-11-10T22:05:19,519][ERROR][logstash.inputs.azureblobstorage][main][483c55d274dabd287ceabf0602317051ad3522c4df0276fb7482b9d6e98985a5] loading registry failed for attempt 2 of 3
[2023-11-10T22:05:19,584][ERROR][logstash.inputs.azureblobstorage][main][483c55d274dabd287ceabf0602317051ad3522c4df0276fb7482b9d6e98985a5] caught: undefined local variable or method `path' for #LogStash::Inputs::AzureBlobStorage:0x6556358d
[2023-11-10T22:05:19,584][ERROR][logstash.inputs.azureblobstorage][main][483c55d274dabd287ceabf0602317051ad3522c4df0276fb7482b9d6e98985a5] loading registry failed for attempt 3 of 3

is it coming from here ? https://github.com/janmg/logstash-input-azure_blob_storage/blob/master/lib/logstash/inputs/azure_blob_storage.rb#L435

I am not fluent with ruby. but I could not find where the reference for path is coming from ?

I don't really understand where the Index -1 out of bounds comes from, it least the file has a length. because the plugin tries to list all the files in the blobstorage with their file lenght so it can read them one at a time and if it detects the file has grown, it will read the delta. I don't understand the second error eiter, " while trying to list blobs" because there was supposed to be an exception, but instead there are just spaces.

Is there something special with these accounts? I created a fresh account for testing and didn't see this errors. In the repo there is a blob_debug.rb that can iterate through the account. Also a look with the storage explorer may indicate what is so special in these repositories?

I tried blob_debug.rb but I keep getting

Faraday::ConnectionFailed: Connection reset

I'm dealing with the same error.

[2023-11-11T12:15:16,857][ERROR][logstash.inputs.azureblobstorage][main][cd330872858e585d21c29db5cdbdd5d28094025289d883cf2088cbbb878b43f0] caught: while trying to list blobs

I've investigated into all the points made such as changing the paths, changing the file head and end, and verifying the storage account permissions. I think in all cases, the key is not the problem as it has sufficient permissions, but as stated by @janmg, there could be a minor inconsistency in configuration. I would assume that it's do to the fact that when an NSG Flow Log is created, it forces you to create a new storage account. However, looking into storage explorer, nothing settings seem off about the account connected.

@janmg do you mind specifying the specific steps you went through to create your storage account and configure the logs? Did you create the storage account separately and then link it to an NSG Flow, or create them both together?

Here are the storage account settings:

image

Here are the blob settings:

image

This is what is linked with the plugin. Nothing seems special per se. Can you demonstrate the settings you have in order to compare?

janmg commented

Connecting to the storage account itself is going fine, but I don't know why a list blob returns an index out of bounds. The registry keeps a list of files and their sizes, my test storage account is really small because I only setup one VM for 6 hours and let it attract some unwanted traffic to test my logstash pipeline and it works.

If this is happenening more often, I will release a version which prints out more debugging information to get my finger on the problem. Below is my test pipeline.

input {
    azure_blob_storage {
        storageaccount => "janmg"
        access_key => "lmHqbCLSgD1UVB3r2+...deZQ=="
        container => "insights-logs-networksecuritygroupflowevent"
        codec => json
        # below options are optional
        logtype => "nsgflowlog"
        prefix => "resourceId=/SUBSCRIPTIONS/F5DD6E2D-1F42-4F54-B3BD-DBF595138C59/RESOURCEGROUPS/VM/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/OCTOBER-NSG/"
        file_head => '{"records":['
        file_tail => "]}"
        skip_learning => true
        registry_local_path => "/usr/share/logstash/data/plugin"
        registry_create_policy => "start_over"
        interval => 60
    }
}
	
output {
    stdout { codec => rubydebug }
}

image

I also tried upgrading my storage account to v2 just to rule out that possibility. No luck.

I have been playing with blob_debug.rb, It fails 9/10 times with the error message Faraday::ConnectionFailed: Connection reset
I have tried different storage accounts of different versions as well. no luck.
not sure what to make out of it.

janmg commented

I have modified the plugin to add more debugging, I now also receive the index out of bounds and also on my blob_debug.rb which is only doing a blob_list. This points to a problem in the dependancy azure-storage-ruby. Which hasnt changed in the last 2 years, which then must point to a problem in their dependancy in faraday or nokogiri. I haven't determined the exact reason, but I'm now no longer clueless.

https://github.com/Azure/azure-storage-ruby/blob/master/blob/lib/azure/storage/blob/container.rb#L602

Thanks. If you are able to reform the plugin to work that would be great.

janmg commented

I think I now understand that somehow the azure storage ruby that uses faraday to connect to the storage account is now not working anymore, upgrading to faraday 2 isn't really possible because logstash uses.
Azure/azure-storage-ruby#227

I don't see a quick fix. I also don't see an easy alternative. I'm most tempted to rewrite the file handling in go, to make it available to any logsystem out there. But it's an aweful lot of coding and I don't have much free time to spare.

@janmg Would it be possible to use this https://github.com/honeyankit/azure-storage-ruby instead of the one managed buy MS ? I tried doing it locally, but it just spins my head.

janmg commented

I cloned the version from muxcmux for common and blob and pushed it to rubygems as version 3.0.0
https://rubygems.org/gems/azure-janmg-common
https://rubygems.org/gems/azure-janmg-blob
The problem is however that while building the plugin, bundler will compare the versions that logstash uses and states there is a version conflict. I haven't figured out how to use multiple versions of a gem.

my head is spinning too. I don't know which update killed the list_blob and if I figure it out, it should be possible to fix somehow, however ruby is not my strongest programming language and I choke up on the dependencies.

I'm looking into migrating to java, but it wouldn't improve performance. I have previously considered making a fluentd plugin, but it's also in ruby. I also studied using filebeats, but I don't see how to easily use this for nsgflowlogs.
https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-azure-blob-storage.html

Yesterday I started looking at storage-blobs-go-quickstart to see if I can split out the azure file handling from the logstash plugin event queue. That way a golang application would run connecting to the storage account and deal with listing and retrieving the files, while the logstash plugin would only have to pick it up and process it further. That seems more future proof. But it will cost me some weekends to get it into a working shape.

Any other suggestions are very welcome

Was just looking into using this plugin, i'm not of help here as i don't get these languages, but i can at least test for you. Just did a fresh install and have the exact same error.

janmg commented

if the blob is not easily accessible from a logstash plugin anymore because of conflicting dependencies of Faraday. I though moving the file handling to a golang helper program would simplify the flow, but at least api keys don't seem to be supported. When I looked at my storage account, I saw a feature named "blob change feed" grayed out. It looks like it's intended for Apache Spark.

I always felt that nsgflowlogs written to a blob for something else to read the blob, felt wrong. But if it's the only way to get the nsgflowlogs it's what had to be done. Now I feel stronger that we should just turn to Microsoft Azure and politely ask for the flowlogs to be sent to an EventHub instead, then we can just do a Kafka read with whatever analytics tool we please.

I'll continue the golang route, to see if it's viable, but pretty please Microsoft, provide an alternative flow.

Hello ThreatLentes & Janmg,

For Ubuntu, I have faced the same issue when I tried to install the 8.x versions. Then I found out it is working fine with 7.10.1 version. This I have faced in only in Ubuntu but in Windows all 8.v versions are working.

janmg commented

Thanks for the update. I think until Logstash 8.9 the plugin should work in Ubuntu, but I can't put my finger on why it started failing. I have started a golang version to push the events in a queue, a proof of concept is working, but it will take some time to finish it with proper file listing and partial reading.