apache/dolphinscheduler

[DSIP] support other business systems to run in DS

Opened this issue · 10 comments

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Present situation: For enterprises that use Apache dolphinscheduler, Apache dolphinscheduler is more often used as the core scheduling system for data processing. However, there are often other business systems in the enterprise, and most of these business systems have requirements such as scheduled task job management. At the same time, the role of the business system may also be applied to a certain aspect of data processing.

Planned renovation:

Add DS-plugin Java plug-in
(1) The business system is referenced through Maven. The business system implements fixed interfaces and uses fixed annotations to complete the creation of scheduled task executors. The plug-in includes registration of Apache dolphinscheduler api-server, workflow, node creation, scheduled task creation, etc. Core functions.
(2) Business system application processing can be referenced through the parent-child process, making it a part of data flow.
2.Apache dolphinscheduler api-server and master-server add corresponding processing logic
(1) api-server adds registration, task creation processor and logic.
(2) master-server adds task triggering and result processing logic
(3) Add external system nodes

The purpose of SDK is to allow third-party systems to register in DS through the integration of DS SDK plugins. The SDK includes third-party system registration, task list functions, task parameter functions, and task execution functions. Third party systems can delegate tasks to DS triggers by integrating SDKs and corresponding implementations. For example, in a data quality system, if a company wants to use its own data quality system to implement data quality audit tasks, and the data processing tasks are in DS, the processing tasks of the two platforms cannot be triggered by dependencies. The purpose of this feature is to solve the integration problem between DS and third-party systems. The above introduction is about the modification of DS, including the main process of task definition development and task execution. The main implementation of this DSIP is to provide a dependency (DS SDK) for third-party systems; 2. Add a new node (Other System node) to DS

mail: https://lists.apache.org/thread/6s4scjfvy8406q14thxj98js6bt1fvd9

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

(1) The business system is referenced through Maven. The business system implements fixed interfaces and uses fixed ?
annotations to complete the creation of scheduled task executors. The plug-in includes registration of Apache dolphinscheduler > api-server, workflow, node creation, scheduled task creation, etc. Core functions.

I understand that this is a JAVA SDK for third parties, correct me if I'm wrong.

I don't quite understand for the other functions you listed, what do these have to do with SDK?

I also can't understand the meaning. It seems that a DS SDK is needed?

(1) The business system is referenced through Maven. The business system implements fixed interfaces and uses fixed ?
annotations to complete the creation of scheduled task executors. The plug-in includes registration of Apache dolphinscheduler > api-server, workflow, node creation, scheduled task creation, etc. Core functions.

I understand that this is a JAVA SDK for third parties, correct me if I'm wrong.

I don't quite understand for the other functions you listed, what do these have to do with SDK?

I added a description on the issue

I also can't understand the meaning. It seems that a DS SDK is needed?

I added a description on the issue

Please provide the needed information of a DSIP #14102.

The purpose of SDK is to allow third-party systems to register in DS through the integration of DS SDK plugins. The SDK includes third-party system registration, task list functions, task parameter functions, and task execution functions. Third party systems can delegate tasks to DS triggers by integrating SDKs and corresponding implementations. For example, in a data quality system, if a company wants to use its own data quality system to implement data quality audit tasks, and the data processing tasks are in DS, the processing tasks of the two platforms cannot be triggered by dependencies. The purpose of this feature is to solve the integration problem between DS and third-party systems. The above introduction is about the modification of DS, including the main process of task definition development and task execution. The main implementation of this DSIP is to provide a dependency (DS SDK) for third-party systems; 2. Add a new node (Other System node) to DS

Let's ignore whether this feature make sense or not. I didn't see any detail information about this DSIP.

I have ever implemented a similar system, I am sure you are not aware of the key issue here.

Have you considar the sdk dependency? How does the sdk comminicate withe DS, how does ds find the third part system instance, how can we collect the task log? How can we do failover when third part system instance crash? How DS protects itself from the third part system?

You should think about the basic questions first.

Please provide the needed information of a DSIP #14102.

The purpose of SDK is to allow third-party systems to register in DS through the integration of DS SDK plugins. The SDK includes third-party system registration, task list functions, task parameter functions, and task execution functions. Third party systems can delegate tasks to DS triggers by integrating SDKs and corresponding implementations. For example, in a data quality system, if a company wants to use its own data quality system to implement data quality audit tasks, and the data processing tasks are in DS, the processing tasks of the two platforms cannot be triggered by dependencies. The purpose of this feature is to solve the integration problem between DS and third-party systems. The above introduction is about the modification of DS, including the main process of task definition development and task execution. The main implementation of this DSIP is to provide a dependency (DS SDK) for third-party systems; 2. Add a new node (Other System node) to DS

Let's ignore whether this feature make sense or not. I didn't see any detail information about this DSIP.

I have ever implemented a similar system, I am sure you are not aware of the key issue here.

Have you considar the sdk dependency? How does the sdk comminicate withe DS, how does ds find the third part system instance, how can we collect the task log? How can we do failover when third part system instance crash? How DS protects itself from the third part system?

You should think about the basic questions first.

I don't quite understand what "ignoring this feature makes sense" means.
2. Please provide detailed information about DSIP in the email. If there are any unclear descriptions, they will be added to the issue.
3. The dependency of SDK is necessary for third-party systems to delegate tasks to DS triggering, and it does not have too much intrusion on third-party systems. Third party systems need to implement three functions, among which the task execution function should have already been implemented.
4. The Netty communication built into the SDK synchronizes the IP and port of third-party systems to DS during the system registration (service startup) phase. Currently, the first version does not consider load balancing functionality, although it allows the same third-party system to register multiple services.
When creating the Other System node of DS, users need to select the identifier of the third-party system. The master then retrieves the registration information of the third-party system from the database based on the identifier, and communicates with the third-party system to execute the corresponding function.
6. The first version did not implement log collection function. In the future, log viewing function will be added, and logs will be retained locally in third-party systems. When querying the page, the SDK package will be called to obtain the corresponding file content (during this process, the log storage directory is saved in DS, and the log file reading process will not execute the file)
7. The first version did not implement load balancing provided by third-party systems, but when instance communication fails, it will communicate with other instances. The follow-up plan is to add third-party system resource monitoring function, which will mark the available and unavailable status of the system during the monitoring process.
8. The entire process is a call from DS to the third-party system, with only two stages of registration and task execution status synchronization. The third-party system will communicate with DS through SDK, and DS will strictly control the received information during this process.

@xdu-chenrj
I have some questions.

  1. The dependency of SDK is necessary for third-party systems to delegate tasks to DS triggering, and it does not have too much intrusion on third-party systems. Third party systems need to implement three functions, among which the task execution function should have already been implemented.

Does the SDK only support JAVA? Are there plans for other programming languages?

What's the difference between this plugin and http task plugin? Is it possible to enhance the http task plugin for more use cases? For example, is there a better use case for grpc task plugin? Not just for JVM-based business systems?

  1. The Netty communication built into the SDK synchronizes the IP and port of third-party systems to DS during the system registration (service startup) phase.

Which component of DS does the third-party system synchronize ip and ports to? api-server or master? Or something else?

If it is synchronized to the master, it means that the network between the master and other business systems can be interoperable. I'm afraid there are some security risks. At the same time, this will reduce the stability of the master.

I think it's good idea to add a DS-Job plugin to trigger third-part system method/task.
But it should be isolated from third-party systems at the database, it seems no need to add other system degister in ds.

I don't quite understand what "ignoring this feature makes sense" means.

"ignore whether this feature make sense or not".

  1. Please provide detailed information about DSIP in the email. If there are any unclear descriptions, they will be added to the issue.
    You should provide the details design e.g. interface, architecture picture, rather than how to use.
  1. The dependency of SDK is necessary for third-party systems to delegate tasks to DS triggering, and it does not have too much intrusion on third-party systems. Third party systems need to implement three functions, among which the task execution function should have already been implemented.

You mean the sdk will only rely on jre, no third-party lib dependency? Are you sure?

  1. The Netty communication built into the SDK synchronizes the IP and port of third-party systems to DS during the system registration (service startup) phase. Currently, the first version does not consider load balancing functionality, although it allows the same third-party system to register multiple services. When creating the Other System node of DS, users need to select the identifier of the third-party system. The master then retrieves the registration information of the third-party system from the database based on the identifier, and communicates with the third-party system to execute the corresponding function.

Use long connection? Have you tested the long connection effect on DS here?

  1. The first version did not implement log collection function. In the future, log viewing function will be added, and logs will be retained locally in third-party systems. When querying the page, the SDK package will be called to obtain the corresponding file content (during this process, the log storage directory is saved in DS, and the log file reading process will not execute the file)
  2. The first version did not implement load balancing provided by third-party systems, but when instance communication fails, it will communicate with other instances. The follow-up plan is to add third-party system resource monitoring function, which will mark the available and unavailable status of the system during the monitoring process.

So the first version is just a demo? I didn't see any detail design about the second version, it's more like a toy can be changed at will.

  1. The entire process is a call from DS to the third-party system, with only two stages of registration and task execution status synchronization. The third-party system will communicate with DS through SDK, and DS will strictly control the received information during this process.

How can DS strictly control the received information? I even didn't see which information will be sent here? Is there any protocol schema?