jaegertracing/jaeger

SURVEY: Who is using Jaeger

badiib opened this issue · 41 comments

Hi, you are in a group of individuals who have create or commented on issues in the Jaeger repository and we are doing a simple informal survey about Jaeger usage. If you could answer the following questions, it would be very valuable to gauge interest in the project:

  • If applicable, what company/organization do you represent? How many software engineers?
  • How are you using Jaeger? E.g. full production deployment, considering, experimenting, or "I am not using Jaeger" etc.
    • How long have you been using Jaeger?
    • If you are not using Jaeger but chose another tracing system, what were the reasons?
  • How many services (or microservices) exist in your system layout?
    • How many of them are traced?
  • Can you describe your tracing setup and volumes? I.e. which storage you use, how many traces/spans you store, etc.
  • What types of problems are you solving with tracing?

Also consider adding your organization to ADOPTERS.md.

@jkandasa @Sunfaces @jbdalido @princeop @pavolloffay @mabn @jpkrohling @nlamirault @JodeZer @prestonprice57 @jrbury @objectiser @sloev @hwinkel @Madhu1512 @yuekui2 @valichek @dianvaltodorov @ZhouZiHe @LoungeFlyZ @jeluard @diegofernandes @d-ulyanov @jyothepro @yqf3139 @tomersimis @ruinanchen @szdavid92 @anuptalwalkar @hekike @sul4bh @Strandedpirate @julianste @awhiteside @nklmish @sweatybridge @kevinearls @felixbarny @hzariv @nlamirault @longXboy @drzero42 @xdralex @philipgian @bharat-p

  • A financial system service providers company From Shanghai
  • experimenting
  • To be honest, we are considering zipkin more.My team members are more familiar with ES and MySql than Cassandra, and our java coders like zipkin. Though l as a pure gocoder perfer jaeger.
  • We are during a microservice transformation.And our system is half java and half golang.
  • BTW, we select a kafka x ES solution with zipkin which jaeger does not provide.

If applicable, what company/organization do you represent?

I am a contributor to fission, which is a FaaS solution on top of Kubernetes.

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.

We need to integrate a distributed tracer for two usage:

  • Help to troubleshoot performance problems with fission itself.
  • Provide tracer handler to user so that they can instrument their code easily and trace the functions as part of a bigger solution.

Currently I am doing some experiments on the integration.

If you are not using Jaeger, why not?

Will find myself some time to try Jaeger. It seems Jaeger has better client library support.

How many services (or microservices) exist in your system layout?

Around 10 microservices. Excluding user functions, which are also services evolving over time.

  • If applicable, what company/organization do you represent?

I work for a subsidiary company of Orange.

  • How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.

We are experimenting OpenTracing in a futur API Gateway services.

  • If you are not using Jaeger, why not?

We use Jaeger using Kubernetes deployment.

  • How many services (or microservices) exist in your system layout?

Around 10 services.

  • which storage are you using ?

Cassandra.

  • Zenly, a live location sharing social network (https://github.com/znly)
  • Full production deployment on top of scylladb
  • Around 10 services using jaeger in production, more coming

@jbdalido glad to see scylladb !

  • Elastica (part of Symantec)
  • Experimenting in QA/Dev systems
  • Around 10-15 services/microservices
  • If applicable, what company/organization do you represent?
    --- eBay, Inc

  • How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.
    --- We are currently evaluating Jaeger and Open Zipkin for OpenTracing

  • If you are not using Jaeger, why not?
    --- We have not ruled out Jaeger yet. We used Jaeger Java client first and now evaluating the backend that was recently open sourced. Lack of streaming support for the collector is an issue for Jaeger.

  • How many services (or microservices) exist in your system layout?
    --- 500+

Also integration with mesh service proxy such as Envoy or Linkerd is important to us.

If applicable, what company/organization do you represent?
Stitch Fix

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.
Considering/experimenting

If you are not using Jaeger, why not?
The environment for which we are considering Jaeger is mostly Python 3, so waiting either for this pull request to be merged or an alternative implementation :)

mabn commented

If applicable, what company/organization do you represent?
Base CRM

How are you using Jaeger?
Experimenting in production - there's a process which listens on kafka to our custom traces, converts them and publishes to jaeger.

If you are not using Jaeger, why not?
Traces with ~1M spans make jaeger hard to use, have to deal with it first.
As for instrumenting services with opentracing - this will take time, only 1 service has it so far.

How many services (or microservices) exist in your system layout?
100+

Storage
We're using AWS managed Elasticsearch - mainly because it's managed, but also because we have experience with ES and not with Cassandra. I'm still trying to make it work properly though - right now (2017-09-22) it performs poorly and drops a lot of spans because indexing does not use bulk API, indices are created without index.translog.durability=async and AWS ES requires signing of each index so there's additional proxy to go through.

If applicable, what company/organization do you represent?
RisingStack

How are you using Jaeger?
Experimenting with automatic instrumentation for Node.js: https://github.com/RisingStack/jaeger-node

If you are not using Jaeger, why not?
Node.js async_hooks is still in experimental phase.
Currently, our own tracing is more feature complete: http://trace.risingstack.com

How many services (or microservices) exist in your system layout?
50+ (our product's backend)

Lightbend has OpenTracing integration for Akka (and this is being extended to more Lightbend technologies, such as Akka HTTP, Play, and Lagom). Many of our customers are interested in tracing for distributed systems or microservices. The Jaeger client is used as the default OpenTracing client to report to Jaeger or Zipkin, giving our customers the option of using Jaeger.

If applicable, what company/organization do you represent?

GrafanaLabs

How are you using Jaeger?

currently prototyping an implementation for our tsdb with the goal of validating performance and suitability and then taking to production.
potentially we may add opentracing to our other software (like Grafana) as well.
our most urgent need was just getting rich, context-specific distributed logging in place so we can diagnose performance trouble and jaeger looks like a good fit. In particular compared to "just distributed logging" systems like ELK/crate or oklog, we realized we want tracing not just logging.

How many services (or microservices) exist in your system layout?

We have about 20 different projects that we run, but many of them run them multiple times (many of our customers have a dedicated single-tenant deployments in kubernetes)

UPDATE sept 22
we're now using jaeger in prod for 2 different projects (each running hundreds times due to multi-tenancy) and we're also working on adding jaeger support into grafana itself.

backend: cassandra

If applicable, what company/organization do you represent?

Northwestern Mutual

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.

I developed Kanali which we use to proxy all production traffic in our Kubernetes clusters. Kanali integrates with Opentracing to provide end to end distributed tracing. I love the Jaeger project as it is the most robust and clean UI for Opentracing IMHO

How many services (or microservices) exist in your system layout?

We currently use Jaeger to visualize tracing for 100s of microservices. These traces are used by 1000s of developers every day.

otisg commented

Interesting to hear folks say jaeger has a better client library,
especially as Jaeger is OpenTracing which is supposed to make that point
moot between systems.

@adriancole I think people say this because OpenZipkin doesn't seem to have OpenTracing compatible Python or Node tracer, only Java and Go or, if it has, it's not immediately obvious.

If applicable, what company/organization do you represent?

Under Armor

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.

Limited production deployment, expanding.

How many services (or microservices) exist in your system layout?

100s.

#396 @black-adder

  1. If applicable, what company/organization do you represent?

Weave

  1. How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.

Full production deployment across both Kubernetes and virtual machines. Using OpenTracing+Jaeger with Cassandra for storage.

  1. How many services (or microservices) exist in your system layout?

100s of microservices

Am I the only one who finds "How many services (or microservices) exist in your system layout?" an ambiguous question? I don't understand if this means the amount of unique software projects, or the amount of daemons running (where you count all copies of the same service running)

@Dieterbe I take service to be a unique microservice. A good analogy would be a Kubernetes service.

Hi all, @jnewmano @ejwood79 @otisg @frankgreco @Dieterbe @pvlugter @hekike @mabn @xdralex @hzariv @bharat-p @jbdalido @nlamirault @yqf3139 @JodeZer

could you also please mention which storage are you using? Whether Cassandra or Elasticsearch. Edit your comment or just comment below.

Thanks

otisg commented

Elasticsearch here at Sematext

We're in experimentation phase at Ticketmaster. Hundreds of microservics that will need to be instrumented but after a few teams have started tracing interest is gaining.

B0go commented

If applicable, what company/organization do you represent?

https://github.com/ContaAzul | http://contaazul.com

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.

We just deployed it to production in our Kubernetes cluster saving data to ElasticSearch on AWS (AWS Elastic Search Service)

How many services (or microservices) exist in your system layout?

~100 instances of ~ 50 services

  • If applicable, what company/organization do you represent?
  • How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.
    • Tracing workloads in all environments
  • If you are not using Jaeger, why not?
    • N/A - We use Jaeger and love it
  • How many services (or microservices) exist in your system layout?
    • In then tens and growing
  • Storage
    • ElasticSearch 6 via AWS, in a private VPC
  • If applicable, what company/organization do you represent?
    local.ch
    localsearch

  • How are you using Jaeger?
    Production deployment inside k8s using ES backend and zipkin reporters. Java (spring-sleuth), Ruby and pending Go instrumentations.

  • How many services (or microservices) exist in your system layout?
    40+ services/batches and expanding

Q: How are you using Jaeger?
A: I'm doing evaluation of several OpenTracing frameworks for C++. I had successes with both OpenTracing-cpp and OpenCensus-cpp. I still haven't evaluated Jaeger's C++ (todo). While doing this, I realized I needed viewer, and started with Zipkin's UI for the first few hours, though found some limitations, or maybe I'm putting too much information at the app (several thousand traces). At first I was avoiding Jaeger's UI, since I thought it was just specific to Jaeger itself (had to do an evaluation over a day, and would continue through the week), only to find that it supports ZipKin mode. I was pre-excited about seeing the screenshots, the nice timeline, the folding/unfolding, and visually it was pleasure to use (ZipKin's UI also looks nice, maybe there are things ZipKin's UI can do that Jaeger's can't.). At any rate, I'll continue using it, and keep on looking until I finalize my choice.

Q:How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.
A: As I said not yet now, still evaluating, but the plan is to have metrics in our desktop app, that talks to few servers, eventually have these servers, and whatever they are fronted with also monitored, and find out what collection scheme/mechanism would be appropriate. It's an in house game level editor, used by hundreths of people from few different studios, and we already collect logs (elastic), but perf metrics are done on demand - by asking users to run XPerf and then we analyze through WPA (Microsoft tools). Additionally we collect crash dumps, but not using exceptions yet (C++). So something that unifies, or provides alternative information (also export aggregated metrics to prometheus/grafana, which OpenCensus can do, and maybe Jaeger too (need to start looking into it soon)). All in all, just trying to get the idea what's available right now.

Q: If you are not using Jaeger, why not?
A: Still evaluating,

Q: How many services (or microservices) exist in your system layout?
A: For our app we have to talk to one or more (edge) perforce servers, custom caching solution, spawn SN-DBS fxc.exe (shader compiler), eventually to a local Windows Service serving/processing assets, etc. But we also have heavy multi-threaded case using ConCRT (Microsoft's "lite" version of Intel's TBB in a way), plus std::thread, and even WinAPI style CreateThread()'s. I'm looking for ways to safely hook this (propage my context across), and there are some gotchas - like green threads/coroutines, and possibly using thread local with push/pop style to keep the "active" thread. How easy I can achieve this may dictate which of the API's I would use (I've also noted that OpenCensus may have some extra locks, hidden allocs per span creation, though this should not be a big deal, and seems fixable). So I'm very excited to go ahead and eval Jaeger, and report back.

--- I worked for Google for some time, and had to use dapper, occasionally look at rpcz, tracez, etc during my oncall duties (wasn't regular SRE, medium sized java team with mixed responsibilities). Since then I've loved the approach, and the genuine idea of distributed tracing, and trying to see whether it's going to bring benefit to us. I'm glad that the industry is moving in the right way, though the information is a bit like sparse, and I don't know yet all the players :)

@malkia:

If applicable, what company/organization do you represent?

I speak only for my team, I don't know whether it's used in other teams/projects across the company, but my team is part of Activision's Central Tech.

@malkia very cool. Thanks for answering that.

If applicable, what company/organization do you represent?
RiksTV, Norwegian broadcast distributor

How are you using Jaeger?
Early days - we're using Jaeger in some backend python and .Net core apps.

If you are not using Jaeger, why not?
Our majority of code is still on "legacy .Net", which is apparently difficult to Jaeger-enable. Usage will broaden as we transition to .Net core.

How many services (or microservices) exist in your system layout?
60+

Storage
Self-managed Elasticsearch running in AWS.

@trondhindenes thanks, added you here: #1121

If applicable, what company/organization do you represent?
Vistar Media

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or I am not using Jaeger etc.
We are using Jaeger in an AWS-based stack for performance analysis and debugging in all envs. We annotate traces with business logic metadata as well.

We have the Jaeger infrastructure running in ECS and deployed via CloudFormation, the agents are deployed both in ECS and paired with ElasticBeanstalk applications.

How many services (or microservices) exist in your system layout?
Less than 10, but this is increasing. We trace some services that are isolates and also are experimenting with tracing our builds (we use Bazel)

Storage
AWS hosted ElasticSearch

If applicable, what company/organization do you represent?
@comtravo

How are you using Jaeger?
Production on a subset of microservices.

If you are not using Jaeger, why not?
Currently we are using Jaeger but considering Opencensus as it matures because what we really miss is good auto-instrumentation support for Node.js. We forked the auto instrumentation from RisingStack and fixed some small issues.

DataDog ships their own opentracing-api compatible tracer along with auto instrumentation which is cool.

How many services (or microservices) exist in your system layout?
26

Storage
AWS ES

If applicable, what company/organization do you represent?

Candide
@candide-eu

How are you using Jaeger?

Full production.

Running on GKE with an Elasticsearch backend hosted on Elastic Cloud.

We have the client library integrated into our NodeJS service shell library to automatically trace inter-service requests.

How many services (or microservices) exist in your system layout?

>30 k8s services in our prod environment. Most of them Jaeger-enabled.

elasticsearch in IBM Cloud Private with tls enabled

Q: If applicable, what company/organization do you represent? How many software engineers?
A: Tencent TEG Infosec Department, about 300+ engineers.

Q: How are you using Jaeger?
A: Full production deployment.

Q: How long have you been using Jaeger?
A: Since May in 2019, has been about 4 months.

Q: If you are not using Jaeger but chose another tracing system, what were the reasons?
A: We are using Jaeger.

Q: How many services (or microservices) exist in your system layout?
A: At least 100 services.

Q: How many of them are traced?
A: At least 10 services are traced, and this number would be about 80+ at the end of this year.

Q: Can you describe your tracing setup and volumes? I.e. which storage you use, how many traces/spans you store, etc.
A: kafka+es, currently about 600 millions spans each day.

Q: What types of problems are you solving with tracing?
A: We use Jaeger for monitoring health of rpc servers, analyzing root cause and drawing service topology.

Q: If applicable, what company/organization do you represent? How many software engineers?
A: Ozon (e-commerce, marketplace), about 500 engineers.

Q: How are you using Jaeger?
A: Full production deployment (either for Kubernetes + legacy non-Kubernetes services).
Our setup of Jaeger is strongly modified and most of the components have been rewritten (except for UI, see details below)

Q: How long have you been using Jaeger?
A: ~1 year

Q: If you are not using Jaeger but chose another tracing system, what were the reasons?
A: After several months of using Jaeger our developers asked us to add more advanced sampling policies to get more insights: priority sampling for traces with errors, long traces, etc. Probabilistic sampling was cool at the start but it provides too small possibilities when you're troubleshooting on production. Also, there was a question with logs - how to use span logs but avoid writing logs to 2 places.
Finally, we've replaced Jaeger agent and collector by our implementation.
Main features: tail-based sampling (traces with errors, traces with anomaly high time, etc.), keeping ALL traces in memory for 30m (searchable from Jaeger UI), Jaeger UI backend integrated with our logging system (it attaches logs to spans on-the-fly, so we're not writing span logs to Jaeger's ElasticSearch), building near-realtime dependency graph with RPS/RT for each edge.

Q: How many services (or microservices) exist in your system layout?
A: >500 services.

Q: How many of them are traced?
A: We've built "scratch" framework as the basement of any microservice that instrumented with metrics and tracing out of the box, so most of the services are well-instrumented (~95% of services are covered).

Q: Can you describe your tracing setup and volumes? I.e. which storage you use, how many traces/spans you store, etc.
A:
Setup:

  • Our own implementation of Jaeger agents
  • Our own implementation of collectors (3 instances with 12 CPU + 90GB)
  • ElasticSearch (6 instances with 8CPU + 64GB + 3 master nodes with 4CPU + 8GB).

Stats:

  • On collectors: 600k spans / s (~20k traces / s)
  • Collectors keeps ALL traces for 30m in memory (thats why such memory). Collectors provide search by traceid and by service + tags, so it's fully integrated with Jaeger UI.
  • Sampled to storage: 20k spans / s (~1k traces/s)

Q: What types of problems are you solving with tracing?
A:
We're using tracing for 2 main directions:

  • Fast troubleshooting on production and analyzing root cause
  • Building and analyzing service topology (our custom near real-time implementation). Here we also have several directions: a) just to understand service dependencies. b) analyze services graph of particular web-page (we usually don't use full topology because of tonns of services). c) finding out bad-design practices, like services recursive calls.
  • We're also planning to use dynamic service topology for smart alerting (print root cause right in the alert, smart alerts inhibition, etc..)

Thanks for Jaeger!
And ask me if you're interested in any details :)

If applicable, what company/organization do you represent? How many software engineers?

Redbox; ~50 Software Engineers, ~5 DevOps/Delivery Engineers

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or "I am not using Jaeger" etc.

We are using Jaeger in production for all of our applications on Kubernetes, as well as a select set of non-Kubernetes cloud applications. All services are ASP.NET Core (C#). We use a managed ElasticSearch cluster with collectors across our cloud infrastructure to ensure we can perform end to end spans across multiple regions/cloud providers. For Kubernetes we are using the Jaeger Operator and Istio as a service mesh. All services being traced are using the Jaeger C# Client with our own wrapper library to add some additional features like logging the JaegerSpanId and adding Prometheus metrics for the internal Jaeger metrics. Most services are using the remote sampling configuration from the collector.

How long have you been using Jaeger?

Around 6 months, 3 months in production.

How many services (or microservices) exist in your system layout?

70+ Services/Microservices using various cloud providers and k8s.

How many of them are traced?

Around 30 services in both Kubernetes and non-Kubernetes cloud environments.

Can you describe your tracing setup and volumes? I.e. which storage you use, how many traces/spans you store, etc.

  • ElasticSearch
  • One Managed Elasticsearch cluster per environment (Production and Staging)
  • Production environment handles ~10-15 million spans every 3 days (we keep 3 days of history)
  • Remote Sampling, probalistic, 0.3 is our default with some services occasionally at 100% sampling to debug specific issues

What types of problems are you solving with tracing?

We use Jaeger to observe and troubleshoot performance issues and to understand what service-to-service dependencies we have.

If applicable, what company/organization do you represent? How many software engineers?

bilibili;

How are you using Jaeger? E.g. full production deployment, considering, experimenting, or "I am not using Jaeger" etc.

We are using Jaeger in production for most of our applications on Kubernetes, as well as few of applications deployed on machine.
We use Jaeger Agent and Jaeger Collector with little revise. Those two provide enough features in production.

However, we rewrite Jaeger SDK and Jaeger Job totally. In our experience, almost all of golang applications can use Jaeger for tracing easily for us, but others do not, i.e. Java, Python. Skywalking agent may be a better choice for trace collection, because applications can import jar more easily than a SDK.Maintaining tracing SDK for thousands of different language applications is a really painful job, especially for python. We hope find a painless way to manage that in the future.

How long have you been using Jaeger?

Around 1 years in production.

How many of them are traced?

1000+ Services/Microservices using various cloud providers and k8s.

Can you describe your tracing setup and volumes? I.e. which storage you use, how many traces/spans you store, etc.

We apply Clickhouse now, but used ScyllaDB before, where Elasticsearch performs bad in scalability and Cassandra/ScyllaDB is hard to do complex query for lots of situation.

We have 1million/s spans and save them 7 days, for troubleshooting performance issues and maintaining dynamic service-to-service dependencies.

如果适用,您代表什么公司/组织?有多少软件工程师?
mihoyo;
你是如何使用 Jaeger 的?例如完整的生产部署、考虑、试验或“我没有使用 Jaeger”等。
we use agent->collector->kafka-> flink and ingester -> clickhouse.

we reimplement jaeger-agent:
1, use websocket to redirect []byte directly from client to collector.
2, use unix domain socket to replace udp.

您使用 Jaeger 多久了?
I in charge of it for half of year.
We had used at least 3 year.
如果您没有使用 Jaeger 而是选择了其他跟踪系统,原因是什么?
您的系统布局中存在多少服务(或微服务)?
thousands.
其中有多少被追踪?
ALL.
您能描述一下您的跟踪设置和数量吗?即您使用哪个存储,您存储了多少跟踪/跨度等。
use clickhouse to store at least millions of spans per second for 30 days.
您通过跟踪解决了哪些类型的问题?
1, service dependency graph.
we use google's pprof to make display thousand's of service relation is possible and loop very good.

search with service, only show the service and it's up and down stream service.
search with group, only show the group's service.
connect service dependency graph with metric, a service node in graph do not only have it's name but also have the average latency, span count, error percent in time range.

and just like the google's pprof, our ui also have:

  • a service with hight error percent, it's node is more red
  • a service with high accumulates relations it's node is more bigger
  • the line between service and service is more vertical if the relation is more bigger

2, full sampling.
After reimplement jaeger agent and replace agent thrift marshal/unmarshal protocol by more efficient protocol.
We can sampling all trace.

3, high accuracy histogram.
We use clickhouse as metric store, which make store histogram each service/operation with hundred time bucket possible,
which would cost hundred of GB memory if using prometheus.

4, Critical path.
show each span's truly execute time.
we have different two ui to display:

  • pprof, group by operation. (but have some problem, case one operation express multi span, it make harder for user to understand it)
  • jaeger ui(we change the jaeger ui to make the execute time as black duration bar)

5, Connect trace with runtime/pprof.
We can connect a trace with runtime pprof, show a request's flamegraph.
which func the request costed cpu.

6, Tail-based sampling.
sampling span with p99 latency, error tag.

7, package instrumentation.
elastic search, kafka, net/httptrace, mongodb, redis, grpc, sql.

8, explore.
an ui which make we can:

  • search with multi service/operation
  • group by tag
  • give recommend tagKey/service/operation(order by span count) in search.