gsxryan/storj_telegraf_mon

How to use HOST variable

aeleos opened this issue · 20 comments

I see that there is a new host variable to avoid two datasources, and I am wondering how I should set this.

I can't seem to get any kind of output into grafana. I know that all my scripts are setup correctly and working and they are seemingly configured correctly in telegraf. Is there any way I can look into influxdb to see what kind of data if any is being put into the database?

Here's an example;
image

Asking Jerome in #11

I see that there is a new host variable to avoid two datasources, and I am wondering how I should set this.

The host variable is used at import time, to avoid hardcoding a value in several charts within the dashboard.
It is not here to avoid two datasources.
A datasource must be configured by you before importing the dashboard. Depending on you Telegraf setup, it might be named telegraf.autogen for instance.

I've just deleted and imported the dashboard successfully. Your database must be empty or datasource misconfigured.

I can't seem to get any kind of output into grafana. I know that all my scripts are setup correctly and working and they are seemingly configured correctly in telegraf. Is there any way I can look into influxdb to see what kind of data if any is being put into the database?

You should give an eye to Chronograf to inspect you InfluxDB content.

In the same time, you can try to see if your Telegraf is working fine by inspecting [input.exec] logs:

  • Enter container by running bash: docker exec -i -t telegraf /bin/bash
  • Test input plugins: telegraf --debug --config /etc/telegraf/telegraf.conf --input-filter exec --test

Thanks for your help, I have other grafana dashboards working from other telegraf inputs.

Here is the output I get from the commands you gave me

StorJHealth,NodeId=Default,host=Tower Deleted=0,FailedCrit=0,FailedWarn=0,Ratio=100,Success=0 2000000000
StorJHealth,NodeId=Default,host=Tower DLFailed=0,DLRatio=100,DLStarted=0,DLSuccess=0,PUTAcceptRatio=100,PUTFailed=0,PUTLimit=0,PUTRatio=100,PUTStarted=0,PUTSuccess=0 2000000000
StorJHealth,NodeId=Default,host=Tower GETRepairFail=0,GETRepairRatio=100,GETRepairSuccess=0,PUTRepairFailed=0,PUTRepairRatio=100,PUTRepairSuccess=0 2000000000
StorJHealth,NodeId=Default,host=Tower InfoDBcheck=0,Reboots=0,VoucherCheck=0 2000000000
StorJToken,WalletAddress="0x7afEb8d3aF76a9E9EE35c5404cCe55de43C8BcCa",host=Tower,stat=tokens BalanceEUR=0,BalanceSTORJ=0,BalanceUSD=0 2000000000
StorJToken,host=Tower,stat=prices STORJPriceEUR=0.1616,STORJPriceUSD=0.1796 2000000000
StorJHealth,host=Tower,path="/rootfs/mnt/user/storj/storage/storagenode2/" directory_size_kilobytes=1559449936 2000000000

Two things are strange in your logs:

  1. all timestamps are equal to 2000000000. Could you please run date +'%s%N' at your command line and inside your Telegraf container? A normal output should give something similar to 1564150298286036137
  2. All figures are equal to zero. Is your Storj node container properly named storagenode? The script tries to parse the output of the command docker logs --since 24h storagenode. Does this command output something? Edit line 20 with your container name if needed.
  1. Date outputs 1564151904 which seems correct

  2. When I try and run docker logs --since 24h StorjNode2-V3 from inside my telegraf container I get an error, saying that it cannot find docker. That could definitely be the problem. Maybe if I can edit the script to grab the logs from a text file instead.

Do you have any ideas?

I also had to edit my tokens script a bunch to use wget instead of curl and to fix some grep parameters that weren't supported by my containers grep as they are GNU features but not posix (I think)

Yeah, we totally forgot to add a requirement to the README.
Your Telegraf container must be run with

    -e HOST_PROC=/host/proc \
    -v /proc:/host/proc:ro \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -v /usr/bin/docker:/usr/bin/docker \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \

To be allowed to use docker command line. I'll update the README in my next PR.

Awesome, now my successrate.sh has a much better output

StorJHealth,NodeId=Default FailedCrit=0,FailedWarn=0,Success=66,Ratio=100.000,Deleted=4759 1564153278
StorJHealth,NodeId=Default DLFailed=3,DLSuccess=2733,DLRatio=99.890,PUTFailed=1262,PUTSuccess=44659,PUTRatio=97.252,PUTLimit=13416,PUTAcceptRatio=76.899,DLStarted=2802,PUTStarted=45925 1564153278
StorJHealth,NodeId=Default GETRepairFail=0,GETRepairSuccess=0,GETRepairRatio=100,PUTRepairFailed=0,PUTRepairSuccess=0,PUTRepairRatio=100 1564153278
StorJHealth,NodeId=Default InfoDBcheck=0,VoucherCheck=4,Reboots=1 1564153278

Still unable to get anything in my grafana dashboard. I will try a few more things like chronograf and see if I can get anything to happen.

Ok so chronograf is not shwoing any kind of data for under StorJHealth or StorJToken. All of the fields exist (presumably populated by the grafana dashboard) but all the queries show no data.

Depending on your config, wait for 30m to have first dataset.

Great, thanks for your comments, I've updated the README file. Future installations.

I noticed Telegraf does not poll on the first start. So if it's 30m you'll have to wait 1 hour to get your first two data points. You can temporarily set it to a low interval to get data more often, then set it back to our recommend default.

I set the interval to 1m for all 3 exec inputs and I still don't get any data. Are you guys only use 1 influxdb database for both Storj data and other system data?

I previously had it set up for two data stores, with 2 separate Telegraf polling services. I changed my config to a single data source to test Jerome's commit, and it seems to be working fine. Can you post your telegraf.conf (omitting passwords)

Here is the main part:

`[[inputs.exec]]
commands = ["sh /rootfs/mnt/user/appdata/telegraf/successrate.sh" ] #some configs may need "sh " before /
timeout = "180s" #If you want to run faster than 180s be sure to change this
interval = "30m" #Comment this out if you already declare it earlier in the config.
#name_suffix = "_foo" # Will result as "StorjHealth_foo" Uploaded dashboard will not use
data_format = "influx"

[[inputs.exec]]
commands = ["sh /rootfs/mnt/user/appdata/telegraf/tokens.sh" ] #some configs may need "sh " before /
timeout = "60s"
interval = "1h" # if you don't care to track STORJ price, you can increase it to 24h
data_format = "influx"

[[inputs.exec]]
commands = ["sh /rootfs/mnt/user/appdata/telegraf/folder_size.sh /rootfs/mnt/user/storj/storage/storagenode2/"]
timeout = "60s"
interval = "30m"
data_format = "influx"
`
and the output

[[outputs.influxdb]] urls = ["http://10.0.1.13:8086"] database = "telegraf"
`

When importing the dashboard, did you select 'telegraf' as the StorJ datastore?
If so, is your database getting any data at all?

Is your telegraf running inside a container also? Can you nmap -p 8086 10.0.1.13 from inside the container successfully?

Do you require basic authentication to write to your telegraf database? I think by default it's disabled, but i changed mine long ago. You can use chronograf to set the database properties, and view database stores.

add to [[outputs.influxdb]]

username = "username"
password = "password"

https://docs.influxdata.com/influxdb/v1.7/administration/authentication_and_authorization/

Yes, I selected telegraf as my database, and yes I am getting data. I have other dashboards running with other data and interestingly enough on the storj dashboard I can see some things like the docker0 network throughput.

My telegraf is inside of a container and the ports are properly mapped. I am using unraid so I can pretty easily see all my port mappings and such.

I am not using any authentication and I my setup is definitely working on a basic level of telegraf -> influx -> grafana, atleast for system data. I am pretty stumped as to why this isn't working, because at this point im 99% sure that each system is working on its own, from the scripts to telegraf to grafana but something is causing issues along the way.

It sounds like you're doing everything right.

Have you tried re-setting the dataset in a grafana dash graph? I've had to reselect my storj database on a few panels to get the metrics to show up;

image

Ok, I think I have found something. When I run the script manually, inside the container, I get a reasonable output like

StorJToken,stat=tokens,WalletAddress="" BalanceSTORJ=0,BalanceUSD=0,BalanceEUR=0 1564324775
StorJToken,stat=prices STORJPriceUSD=0.1682,STORJPriceEUR=0.1518 1564324775

However, when I run telegraf --config-directory=/etc/telegraf --test --input-filter=exec

I get a bad output with a weird datetime:

StorJHealth,NodeId=Default,host=Tower Deleted=16,FailedCrit=0,FailedWarn=0,Ratio=100,Success=35 2000000000
StorJHealth,NodeId=Default,host=Tower DLFailed=0,DLRatio=100,DLStarted=320,DLSuccess=283,PUTAcceptRatio=95.371,PUTFailed=603,PUTLimit=213,PUTRatio=87.918,PUTStarted=4992,PUTSuccess=4388 2000000000
StorJHealth,NodeId=Default,host=Tower GETRepairFail=0,GETRepairRatio=100,GETRepairSuccess=0,PUTRepairFailed=0,PUTRepairRatio=100,PUTRepairSuccess=0 2000000000
StorJHealth,NodeId=Default,host=Tower InfoDBcheck=0,Reboots=0,VoucherCheck=0 2000000000
StorJToken,WalletAddress="0x7afEb8d3aF76a9E9EE35c5404cCe55de43C8BcCa",host=Tower,stat=tokens BalanceEUR=0,BalanceSTORJ=0,BalanceUSD=0 2000000000
StorJToken,host=Tower,stat=prices STORJPriceEUR=0.1517,STORJPriceUSD=0.1687 2000000000
StorJHealth,host=Tower,path="/rootfs/mnt/user/storj/storage/storagenode2/" directory_size_kilobytes=1712216188 2000000000
StorJHealth,NodeId=Default,host=Tower Deleted=16,FailedCrit=0,FailedWarn=0,Ratio=100,Success=35 2000000000
StorJHealth,NodeId=Default,host=Tower DLFailed=0,DLRatio=100,DLStarted=320,DLSuccess=283,PUTAcceptRatio=95.372,PUTFailed=603,PUTLimit=213,PUTRatio=87.921,PUTStarted=4993,PUTSuccess=4389 2000000000
StorJHealth,NodeId=Default,host=Tower GETRepairFail=0,GETRepairRatio=100,GETRepairSuccess=0,PUTRepairFailed=0,PUTRepairRatio=100,PUTRepairSuccess=0 2000000000
StorJHealth,NodeId=Default,host=Tower InfoDBcheck=0,Reboots=0,VoucherCheck=0 2000000000
StorJToken,WalletAddress="",host=Tower,stat=tokens BalanceEUR=0,BalanceSTORJ=0,BalanceUSD=0 2000000000
StorJToken,host=Tower,stat=prices STORJPriceEUR=0.1517,STORJPriceUSD=0.1687 2000000000
StorJHealth,host=Tower,path="/rootfs/mnt/user/storj/storage/storagenode2/" directory_size_kilobytes=1712218480 2000000000

Update: It works!

I ended up deleting the $(date +'%s%N') after each line and let telegraph generate its own timestamps for the data, and that ended up fixing it. Not sure why but telegraf refused to properly execute $(date +'%s%N') but its working now.

Thanks for everyones help!

If you are interested, I can pull together a PR of some of the other changes I made like adding a fallback to wget and improving compatibility with POSIX grep rather than GNU grep.

That's great! I didn't realize telegraf would automatically insert the datetime. Any contribs are appreciated. Please document why the fallbacks were needed (environments / dependents)