-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][core] Flink sql task support customized test case input #3975
Comments
Your design is great! I agree with your idea because Alibaba Cloud Flink also has the function of uploading datasets for debugging. |
@aiwenmo Thank for your reply! What I ignored is that during FlinkCDC task debug, change events is important. And that is hard to simulate in current scheme I designed, I think your idea about build a ValuesDataBase is excellent, a temporary database can not only simulate data change event for cdc task but also generate data for batch or stream flink sql task. As all discussion mentioned above, I will redesign the scheme details based on ValuesDataBase and implement in the draft PR and |
@MactavishCui Thank you very much for your contribution! In order to facilitate contributions to the Apache incubator, Dinky's project positioning will be corrected to a debugging and optimization framework for stream computing, exploring the key components currently missing in the stream computing ecosystem. This positioning is more in line with Dinky's core capabilities that have been completed and planned for the future, and will give Dinky greater core advantages. Unfortunately, currently most users' understanding of Dinky is only limited to the Flink platform, which is not in line with the current situation and goals of the project. Dinky laid the foundation for exploring innovation in the field of stream computing in its initial design, with the original intention of better integrating into enterprise big data architecture as a middleware to supplement the capabilities of stream computing. Now we have an increasingly active community and a growing number of users, and we believe that in the future, it can become a core component and essential textbook content in big data architecture. Finally, your contribution to Mock testing has greatly advanced the debugging direction of stream computing. In the future, my personal contribution focus will also be on this direction, such as implementing mock testing of application patterns through websock and other methods, capturing and storing Schema Change Events in FlinkCDC for auditing and tracing, supporting Mock testing of FlinkCDCPipeline, and tracing the performance of stream data changes through row level blood relations. Welcome to continue communication~ |
@aiwenmo Hi~ I have studied the code of ValuesDatabase from flink-cdc and made some redesigns for this feature. After several attempts, here are still some points puzzling me. The first problem I want to figure out is that what is the more suitable position for our MockDatabase? A util that can apply change events? Or more like a cache database, here lists my attempt schemes:
Because I'm not much familiar with cdc tasks and dinky-cdc model now(Surely I'll learn them these days, lol), the second point I'm worried about is that will these schemes good fo rfurther expansion like the source mock for cdc tasks? Which scheme do you think is better? Or is there some better schemes and suggestions? Looking forward to your reply. |
Due to attending FFA 2024 in the past two days, there is no equipment available for design and response. I apologize for this. Your design ideas are great. Here are some new ideas I have based on your design. In order to better design this content, I suggest starting with a macro level design. What we expect is that Dinky can provide a complete Mock Test, including Source and Sink, Data and Event. At this point, I call it MockDataBase, which simulates all behaviors in stream processing. MockDataBase includes two write behaviors: Append (Changelog) and Upsert. MockDataBase is an abstract design that cannot be instantiated. It is instantiated by defining persistent implementation classes, such as memory schema or MySQL schema. This may not be easy to understand, but I can give an example to express this idea very well. We can deploy a Kafka cluster specifically for mock sources and sinks, but we need to modify the connector configuration to match that of the Kafka cluster. So at this point, Dinky can proxy similar to its implementation of the RS file system protocol, with only one MockDataBase, and its specific implementation is configured through the configuration center. This is a preliminary idea, I will continue to update the details. What is your opinion? Look forward to your reply. |
@aiwenmo Thanks very much for your reply! Your response really helps me out of my confusions! Now I can understand that a MockDataBase design means dinky provides a 2x2 abilities (schemes & data x query & write, like what ValuesDataBase of flink-cdc does) by interfaces. And the specific implements are based on user configurations. As a result of this scheme, dinky can build a IO control bridge between users and tasks. From the perspective of users, task input can be customized and output can be previewed, while from the perspective of tasks, test cases can be read and execute result can be insert. I'm very glad to update my codes based this new scheme! Based on this design, I think not only for this issue, but there are many optimizations can be made, like refactoring mock data sink, making mock database implement extension, supporting mock source for cdcsource tasks and etc. Looking forward to your reply! |
Of course, Looking forward to your contribution. @MactavishCui We expect to complete this core feature in Dinky-1.3.0. In order to facilitate code review, this feature needs to be broken down into multiple small pull requests. For example:
This does not represent the final result, we can continue to update the content of MockDataBase in this issue. Next, I think we can discuss the design of MockDataBase. Looking forward to your reply. |
@aiwenmo Thx for your reply! I agree that it will be better to implement a MockDataBase first. The interfaces of MockDatabase:
The Models used by MockDatabase: These mentioned above are my preliminary designs for MockDatabase, looking forward to your reply and suggestions! |
@MactavishCui Your design is excellent! Today I have a bold idea: It would be more appropriate to define MockDataBase as Sandbox. Sandbox is a hybrid data storage system for simulating data. It has the following capabilities:
We can implement an MVP version (Minimum Viable Product version) first, including the following capabilities:
In addition, for the questions you raised, I have the following responses:
This innovative new feature is really attractive. Details will be continuously updated. |
@aiwenmo Thank you very much for your suggestions. About our MVP sandbox abilities, I made following detail designs and here are something I want to discuss with you:
All mentioned above is my design and confusion, looking forward to your reply! |
@MactavishCui Thank you for your reply! The following is my response:
Looking forward to your reply. |
@aiwenmo |
@MactavishCui Thank you for your participation! Regarding this, I am also very excited. This is the spark generated by the collision of two ideas, and both are indispensable. |
Hi! @aiwenmo I have made some attempts there days about sandbox and how to use it. According to our discussion last week, In the MVP version, sandbox has the abities including metadata, schema and records management. Based on this new design, l think there are following points that can be worked on:
And here lists my new designs of the 2nd point.
About the backend implement
Besides, task debug process is designed as follows:
All mentioned above is my new design of this feature. Is this scheme acceptable, and may l submit a PR for this feature based on sandbox? I find that you will work on the sandbox construction, so if I can submit codes for this feature, before the construction of sandbox finished, I'll mock the usage of sandboxes temporarily, focus on the front end codes and backend codes for sand box application of this feature and make synchronous changes when sandbox is ready. I also find that there are some code optimization after my last new feature PR, so l'll study these optimization codes and pay more attention to the scheme and code style. |
Thank you very much for your attempt and design. I agree with your scheme. Besides, there is a small issue below that needs correction:
Finally, I will give you a notice when the sandbox is ready. thx |
Search before asking
Description
Mock sink has been implemented in dinky-1.2.0. This issue is for the discussion of mock the source part of flink sql. Currently, dinky only support get table inputs from the real production environment, but most kind of test cases, especially edge condition cases cannot be caught in a short time. As a result, it will be more helpful for users to submit customized test cases and test their tasks as comprehensive as possible.
Use case
Provide a form for users to submit customized test cases. Flink sql tasks can read these test cases as input when some mock-source options is set.
Based on MockStatmentExplainer.class, the connector of source tables can also be changed to the customized one, which help users can submit customized test cases. Here lists my scheme:
An open-api is designed for getting the test cases submitted by users. Then all test cases will be collected to next operators
I have submitted a Draft PR, which contains most of the implement code for this scheme.
If this scheme is acceptable, I will complete the development these days and
ready for review
. If there are better schemes for this feature, I'm also willing to implement them and contribute!Looking forward to your discussion and reply.
Related issues
#3893
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: