Tendrl/documentation

Improve and clarify documentation related to Tendrl installation

mbukatov opened this issue · 3 comments

This issue refers to https://github.com/Tendrl/documentation/wiki/Tendrl-Package-Installation-Reference/0f43d3700bdad962d1fdeb4757810d16c307ef60

Based on email threads from tendrl-devel and chat with Ken.

List of pending improvements and issues in Package Installation Reference:

  • add initial paragraph with a description of Tendrl machine roles in general: there is single Tendrl Server machine (we don't use HA setup yet), one Performance Monitoring machine (or it could be collocated with the Tendrl Server), one Alerting (good default is collocation with the Tendrl Server) and multiple Storage Server machines (where ceph/gluster are/will be installed).

  • clarification of empty password for etcd:

In Tendrl Package Installation Reference document, I see that during
configuration of tendrl-api, both password and username values are
changed to '' (empty string), while there are some default config
values shipped in /etc/tendr/etcd.yml.

I have 2 questions: is this username/password pair used to connect
to etcd (in other words, is it an etcd user)? And if so, how come that
it works when empty string values are specified?

This is an option provided in the api config if the etcd is setup with
authentication. Empty string means there is no auth.

  • specify which repository of ceph-installer to use (moved out, so marking as done):

the installation reference states:

configure the ceph-installer repo (as made available by the project)

Could we be more precise and add a link to ceph-installer repositories
we expect people to use?

The original rationale was that Tendrl project would expect the
admin-user persona to be aware of the repository to add. Upon further
thought on your email, I see the valid point you raise.

In usmqe-setup, we were using the following repositories for downstream
testing right now:

https://shaman.ceph.com/api/repos/ceph-installer/master/latest/centos/7/noarch/noarch
*
https://shaman.ceph.com/api/repos/ceph-ansible/master/latest/centos/7/noarch/noarch
*
http://copr-be.cloud.fedoraproject.org/results/ktdreyer/ceph-installer/epel-7-x86_64/

These look good. Can we confirm with Ken that these are good to go?

  • improve reasoning behind deployment decisions for monitoring and alerting (there is a single tendrl server role now, so there is no such deployment decision to make, marking as done)

On 05/31/2017 12:07 PM, Anmol Babu wrote:

  1. The performance-monitoring installation brings in(installs) and configures graphite.

  2. All nodes push stats to the graphite installed by performance-monitoring application..

  3. Writes to graphite induce read/write disk operations as the stats are maintained in files by whisper database(part of graphite stack).

    Note: Metrics get fed into the graphite stack via the Carbon service, which writes data out to Whisper databases for long-term storage.

Ok, so performance monitoring machine is a target for performance data
pushes from all other Tendrl machines.

  1. What are pros and cons of each installation configuration? Based on
    what should one decide on which machine to install the role in both
    cases?

I think the answer to this is mainly governed by:
a. Points 1 to 3 above under "Important points to note..."
b. performance-monitoring application does lots of etcd interactions and also has interactions from tendrl/api which makes me say that having them co-resident
would be beneficial from network utilization perspective may not be a major gain though...
c. To avoid having to dedicate a node for this purpose one can as well use the same node for installing etcd, api, dashboard, performance-monitoring and alerting application along with their dependent service node-agent.
d. Point c in essence also suggests that this can be essentially deployed on a storage node as well.
Note: Having etcd and api along with performance-monitoring, node-monitoring and alerting might bring in resource utilization related issues.
But this might need to be confirmed via tests...

And at the same time, performance monitoring communicates a lot with
etcd and tendrl api. Ok.

So for this reason, it may make sense to deploy it on the Tendrl Server
(along with etcd and tendrl api/web). But you are concerned with the
resource requirements, especially wrt scaling.

But I don't get why it seems a good idea to deploy performance
monitoring on storage node. There are even more issues with resource
utilization in that case ...

  1. What is suggested safe default?

The answer to this varies in accordance with considerations above...

I will reply below the summary.

  1. Does "New node" for Performance Monitoring mean a dedicated machine
    just for this role?

Yes, based on considerations above, if this is inevitable, there is nothing in code that stops such a deployment...

Ok.

  1. Why alternative place for monitoring is "new node", while alerting
    could be placed on storage node? Is it possible to install monitoring
    and alerting on single dedicated machine? And would it make sense?

The alternatives to both applications from the perspective of code are the same..
Its just a fix required on documentation if it suggests otherwise..
Having performance-monitoring and alerting on a dedicated machine might make sense but yes as said above it all depends on considerations above...

You mentioned Performance Monitoring constrains in a clear way, but not
for Alerting. I'm interested especially in:

  • having performance monitoring and alerting on the same machine -
  • having alerting on storage machine

To summarise, I would like to say that definitive and/or quantitative answers to most of these questions, can be made after:

  1. We test each of these deployment scenarios
  2. We perform scale tests to see how tendrl scales...
    So at this point, we don't have enough experiences with the behavior of
    the system to suggest how to deploy the system exactly. That's fine, but
    I would suggest to write down a gist of what we know right now directly
    into the docs.

Would it make sense to write something like (here I'm proposing what
we could add into the docs and verifying that I understand you at the
same time - feel free to correct me and fix/expand the text):

For Performance Monitoring:

--- suggested update ---
This role could be deployed either on Tendrl Server machine (along with
tendrl api, etcd, ... as described above) or a dedicated machine.

Performance monitoring machine is a target for performance data pushes
from all other Tendrl machines and communicates a lot with services on
Tendrl Server (etcd and tendrl-api) - which affects network utilization
and produces lots of tiny I/O operations.

For this reason, it makes sense to deploy it on Tendrl Server, but
when you see problems with resource utilization on Tendrl Server,
it may be better to go with deployment on a dedicated machine.

We have not enough experiences to provide final and exact guidelines
here.
--- suggested update ---

The same should be done for Alerting.

For testing, we deploy both alerting and performance monitoring on
tendrl server, as can be seen here:

https://github.com/Tendrl/usmqe-setup/blob/master/tendrl_server.yml#L18

I understand you reply as a request for us to start tracking resource
utilization per services on this type of machine, to get a better data.

  • clarification of provisioner/gluster tag: should it be used on every node or only on a single one?

Based on Tendrl/tendrl-ansible#14

Resolved by wiki commit 00a3649

  • small clarification related to dedicated monitoring setup

see https://www.redhat.com/archives/tendrl-devel/2017-June/msg00106.html

Additional Details

Information which should be documented, but not necessarily in the Installation Reference document linked above:

  • minimal requirements, both for production and poc demo clusters, rohan writes:

Storage nodes should follow hardware guidelines as given by the storage
system documentation (ceph, gluster docs).

As for Tendrl Server, it hosts the tendrl-api, tendrl-monitor
(tendrl-node-agent which monitors all other tendrl-node-agent) and the
tendrl central store (etcd) which contains master data for tendrl managed
nodes, tendrl managed clusters and all the error/warning/notice (alerts)
logs coming out of tendrl.

Given the responsibilities of tendrl-server, it is a good practice to run
it on 12 GB memory and 4VCPU, but for POC clusters, I would say you can get
away with 8GB Memory and 4 VCPUs for tendrl-server

The reason it is not documented globally and done per release is , we are
yet to finish performance optimizations and dont really lots of precise
data about how much resources Tendrl-server would require. Perhaps you can
generate some numbers while testing Tendrl?

This is no tracked by:

@r0h4n can someone assist with this one? thanks

I'm going to review this and do the changes of Installation Guide. If the piece of information required for the update is still missing, I'm going to create a separate issue on component responsible for given area.