Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][core] Flink sql task support customized test case input #3975

Open
3 tasks done
MactavishCui opened this issue Nov 27, 2024 · 15 comments · May be fixed by #3974
Open
3 tasks done

[Feature][core] Flink sql task support customized test case input #3975

MactavishCui opened this issue Nov 27, 2024 · 15 comments · May be fixed by #3974
Assignees
Labels
New Feature New feature

Comments

@MactavishCui
Copy link
Contributor

MactavishCui commented Nov 27, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Mock sink has been implemented in dinky-1.2.0. This issue is for the discussion of mock the source part of flink sql. Currently, dinky only support get table inputs from the real production environment, but most kind of test cases, especially edge condition cases cannot be caught in a short time. As a result, it will be more helpful for users to submit customized test cases and test their tasks as comprehensive as possible.

Use case

Provide a form for users to submit customized test cases. Flink sql tasks can read these test cases as input when some mock-source options is set.
Based on MockStatmentExplainer.class, the connector of source tables can also be changed to the customized one, which help users can submit customized test cases. Here lists my scheme:
image
An open-api is designed for getting the test cases submitted by users. Then all test cases will be collected to next operators
I have submitted a Draft PR, which contains most of the implement code for this scheme.
If this scheme is acceptable, I will complete the development these days and ready for review. If there are better schemes for this feature, I'm also willing to implement them and contribute!

Looking forward to your discussion and reply.

Related issues

#3893

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@MactavishCui MactavishCui added New Feature New feature Waiting for reply Waiting for reply labels Nov 27, 2024
@aiwenmo aiwenmo removed the Waiting for reply Waiting for reply label Nov 28, 2024
@aiwenmo
Copy link
Contributor

aiwenmo commented Nov 28, 2024

Your design is great! I agree with your idea because Alibaba Cloud Flink also has the function of uploading datasets for debugging.
This feature can effectively assist users in conducting edge condition testing.
In addition, the data for mock source and sink testing should have clear table structures.
In order to better simulate and manage this data in memory and facilitate further expansion of functionality, my idea is to build a temporary in memory database in dinky1.3.0 to manage mock data, similar to the ValuesDatabase in FlinkCDC. Users can decide whether their data should be persisted to the file system or MySQL data tables through configuration options. This ValuesDatabase can effectively restore SchangeChangeEvents and perform DataUpsert.
I believe that building mock raw data through Dinky's interface has more advantages than manually assembling CSV data.
What do you think?

@MactavishCui
Copy link
Contributor Author

Your design is great! I agree with your idea because Alibaba Cloud Flink also has the function of uploading datasets for debugging. This feature can effectively assist users in conducting edge condition testing. In addition, the data for mock source and sink testing should have clear table structures. In order to better simulate and manage this data in memory and facilitate further expansion of functionality, my idea is to build a temporary in memory database in dinky1.3.0 to manage mock data, similar to the ValuesDatabase in FlinkCDC. Users can decide whether their data should be persisted to the file system or MySQL data tables through configuration options. This ValuesDatabase can effectively restore SchangeChangeEvents and perform DataUpsert. I believe that building mock raw data through Dinky's interface has more advantages than manually assembling CSV data. What do you think?

@aiwenmo Thank for your reply!
I agree with your idea that it's more convenience for users to build mock data by interfaces while not CSV data. As a result, in current implement I submitted in Draft PR, a editable table modal of front part is generated by the table structures gotten from backend, users can submit their data set by this editable table, which makes dataset edit more clear.

What I ignored is that during FlinkCDC task debug, change events is important. And that is hard to simulate in current scheme I designed, I think your idea about build a ValuesDataBase is excellent, a temporary database can not only simulate data change event for cdc task but also generate data for batch or stream flink sql task.

As all discussion mentioned above, I will redesign the scheme details based on ValuesDataBase and implement in the draft PR and ready for review when all development and tests ready!

@aiwenmo
Copy link
Contributor

aiwenmo commented Nov 28, 2024

@MactavishCui Thank you very much for your contribution!

In order to facilitate contributions to the Apache incubator, Dinky's project positioning will be corrected to a debugging and optimization framework for stream computing, exploring the key components currently missing in the stream computing ecosystem. This positioning is more in line with Dinky's core capabilities that have been completed and planned for the future, and will give Dinky greater core advantages.

Unfortunately, currently most users' understanding of Dinky is only limited to the Flink platform, which is not in line with the current situation and goals of the project. Dinky laid the foundation for exploring innovation in the field of stream computing in its initial design, with the original intention of better integrating into enterprise big data architecture as a middleware to supplement the capabilities of stream computing. Now we have an increasingly active community and a growing number of users, and we believe that in the future, it can become a core component and essential textbook content in big data architecture.

Finally, your contribution to Mock testing has greatly advanced the debugging direction of stream computing. In the future, my personal contribution focus will also be on this direction, such as implementing mock testing of application patterns through websock and other methods, capturing and storing Schema Change Events in FlinkCDC for auditing and tracing, supporting Mock testing of FlinkCDCPipeline, and tracing the performance of stream data changes through row level blood relations.

Welcome to continue communication~

@MactavishCui
Copy link
Contributor Author

@aiwenmo Hi~ I have studied the code of ValuesDatabase from flink-cdc and made some redesigns for this feature. After several attempts, here are still some points puzzling me.

The first problem I want to figure out is that what is the more suitable position for our MockDatabase? A util that can apply change events? Or more like a cache database, here lists my attempt schemes:

image
As is shown above, change events is transformed between front and backend part, and flink tasks can generate results or simulate change events by these information.
The disadvantage of this scheme is that all info should be persisted and mockdatabase object should be created for every requests.

image
The second scheme is based on a singleton database instance exists in memory, the database has a map whose key is task id and value is tables and its records. Tasks can get there own mock-database by MockDataBase.getInstance(taskId). But I'm not very clear about how to manage the k-vs when there are too many entries. Will it be okay to manage entries like LRU and the capacity of mock database can be set by configuration center?

Because I'm not much familiar with cdc tasks and dinky-cdc model now(Surely I'll learn them these days, lol), the second point I'm worried about is that will these schemes good fo rfurther expansion like the source mock for cdc tasks?

Which scheme do you think is better? Or is there some better schemes and suggestions?

Looking forward to your reply.

@aiwenmo
Copy link
Contributor

aiwenmo commented Dec 1, 2024

Due to attending FFA 2024 in the past two days, there is no equipment available for design and response. I apologize for this.

Your design ideas are great. Here are some new ideas I have based on your design.

In order to better design this content, I suggest starting with a macro level design. What we expect is that Dinky can provide a complete Mock Test, including Source and Sink, Data and Event. At this point, I call it MockDataBase, which simulates all behaviors in stream processing. MockDataBase includes two write behaviors: Append (Changelog) and Upsert. MockDataBase is an abstract design that cannot be instantiated. It is instantiated by defining persistent implementation classes, such as memory schema or MySQL schema.

image

This may not be easy to understand, but I can give an example to express this idea very well. We can deploy a Kafka cluster specifically for mock sources and sinks, but we need to modify the connector configuration to match that of the Kafka cluster. So at this point, Dinky can proxy similar to its implementation of the RS file system protocol, with only one MockDataBase, and its specific implementation is configured through the configuration center.

This is a preliminary idea, I will continue to update the details. What is your opinion?

Look forward to your reply.

@MactavishCui
Copy link
Contributor Author

MactavishCui commented Dec 2, 2024

Due to attending FFA 2024 in the past two days, there is no equipment available for design and response. I apologize for this.

Your design ideas are great. Here are some new ideas I have based on your design.

In order to better design this content, I suggest starting with a macro level design. What we expect is that Dinky can provide a complete Mock Test, including Source and Sink, Data and Event. At this point, I call it MockDataBase, which simulates all behaviors in stream processing. MockDataBase includes two write behaviors: Append (Changelog) and Upsert. MockDataBase is an abstract design that cannot be instantiated. It is instantiated by defining persistent implementation classes, such as memory schema or MySQL schema.

image

This may not be easy to understand, but I can give an example to express this idea very well. We can deploy a Kafka cluster specifically for mock sources and sinks, but we need to modify the connector configuration to match that of the Kafka cluster. So at this point, Dinky can proxy similar to its implementation of the RS file system protocol, with only one MockDataBase, and its specific implementation is configured through the configuration center.

This is a preliminary idea, I will continue to update the details. What is your opinion?

Look forward to your reply.

@aiwenmo Thanks very much for your reply! Your response really helps me out of my confusions! Now I can understand that a MockDataBase design means dinky provides a 2x2 abilities (schemes & data x query & write, like what ValuesDataBase of flink-cdc does) by interfaces. And the specific implements are based on user configurations. As a result of this scheme, dinky can build a IO control bridge between users and tasks. From the perspective of users, task input can be customized and output can be previewed, while from the perspective of tasks, test cases can be read and execute result can be insert. I'm very glad to update my codes based this new scheme!

Based on this design, I think not only for this issue, but there are many optimizations can be made, like refactoring mock data sink, making mock database implement extension, supporting mock source for cdcsource tasks and etc.
Following the principle of KISS (Keep it simple and stupid), may I submit necessary codes just related to this issue in linked PR and make other optimizations in other issues and PRs?

Looking forward to your reply!

@aiwenmo
Copy link
Contributor

aiwenmo commented Dec 3, 2024

Of course, Looking forward to your contribution. @MactavishCui

We expect to complete this core feature in Dinky-1.3.0. In order to facilitate code review, this feature needs to be broken down into multiple small pull requests.

For example:

  1. Design the interface for MockDataBase and implement memory mode (instead of ResultPool).
  2. Modify MockSink and Select to be implemented based on MockDataBase.
  3. Add MockSource based on MockDataBase implementation.
  4. MockSink and Select implement socket mode to support application mode.
  5. Expand the implementation of MockDataBase, such as Kafka or Fluss.

This does not represent the final result, we can continue to update the content of MockDataBase in this issue. Next, I think we can discuss the design of MockDataBase.

Looking forward to your reply.

@MactavishCui
Copy link
Contributor Author

MactavishCui commented Dec 3, 2024

Of course, Looking forward to your contribution. @MactavishCui

We expect to complete this core feature in Dinky-1.3.0. In order to facilitate code review, this feature needs to be broken down into multiple small pull requests.

For example:

  1. Design the interface for MockDataBase and implement memory mode (instead of ResultPool).
  2. Modify MockSink and Select to be implemented based on MockDataBase.
  3. Add MockSource based on MockDataBase implementation.
  4. MockSink and Select implement socket mode to support application mode.
  5. Expand the implementation of MockDataBase, such as Kafka or Fluss.

This does not represent the final result, we can continue to update the content of MockDataBase in this issue. Next, I think we can discuss the design of MockDataBase.

Looking forward to your reply.

@aiwenmo Thx for your reply! I agree that it will be better to implement a MockDataBase first.
I have made some attempts last night and have tried to implement a memory database, my attempts are as follows:

The interfaces of MockDatabase:
I designed a abstract class MockDatabase, it has following methods:

  • A static method build(Integer taskId, MockDatabase type): like the implement of ResultBuilder.build, call different inherator class' instantiation method by mockDatabase type, for example memory database can provide a get singleton instance method, jdbc databases can provide a packaged jdbc-driver and etc.
  • A field taskId. Every tasks could has a database instance or implement classes can convert table identifiers with a task id suffix or use other methods to distinguish tables of different tasks, especially when different tasks has same tables
  • Methods for data
    • A public method applyDataChangeEvents(DataChangeEvent event): Based on event.op(An enum), call different specific data change handle methods and throw an Exception if events is not supported. In this method, appendDataChangeEvent(event) is also called for record change events
      • Several protected abstract methods called by applyDataChangeEvents need to implemented by inheritor classes, including insert(tableId, Map<String, String> data), update(tableId, Map<String, String> before, Map<String, String> after), delete(tableId, Map<String, String> data) and appendDataChangeLog(DataChangeEvnet dataChangeEvent)
    • public abstract data query methods including query results and data change events
    • public abstract data clear methods including clear results and data change events
  • Methods for schema
    • Similar with the design of apply data change events, a public methods applySchemaChangeEvents(SchemaChangeEvent event) routing events and related protected abstract methods called by it are designed
    • Also query and clear interfaces are designed
      What is your opinion for the interfaces of MockDatabase?

The Models used by MockDatabase:
Currently I reuse the models defined in dinky-common for mock database like Table.class, Columns.class, of course some un-existed models are created like DataChangeEvents, SchemeChangeEvents. These models are mainly used by dinky-meta-data model, I think MockDatabase is also a kind of datasources, so I reuse these models for MockDatabase, and it can provides more convenience for front backend interaction, besides based on define a new field for Table.class, we can easily find out where a table is mocked for source or sink, which provides convenience for select which tables to show sink results or select which schemes for users to submit testcases. But something I hesitated for a long time is that will it be more suitable to reuse the models in flink-cdc? like TableId.class, RecordData.class, although me need do some more work for customized info apply, the models of flink-cdc has more comprehensive fields and methods. This is another point I want to discuss with you.

These mentioned above are my preliminary designs for MockDatabase, looking forward to your reply and suggestions!

@aiwenmo
Copy link
Contributor

aiwenmo commented Dec 3, 2024

@MactavishCui Your design is excellent!

Today I have a bold idea: It would be more appropriate to define MockDataBase as Sandbox.

Sandbox is a hybrid data storage system for simulating data.

It has the following capabilities:

  1. Support and simulate data generation. Data formats include snapshot data, ChangeLog, stream data, vector data, etc.
  2. Support and simulate data writing. Writing methods include append, update, zipper, logical deletion, etc.
  3. Support unified metadata management.
  4. Support high-performance comparison of two data sets.
  5. Support real-time capture of end-to-end data latency.
  6. Support the mutual conversion of data in different database formats and forms.
  7. Support data rollback at a specified point in time.
  8. Support OLAP.
    image

We can implement an MVP version (Minimum Viable Product version) first, including the following capabilities:

  1. Support metadata management.
  2. Support high-performance writing of simulated data (separation of structure and data), and the writing methods include append, update, replacing ResultPool.
  3. Support single-table query.
  4. Support simulation data generation, including snapshot data and ChangeLog.
  5. Support writing in socket mode.

In addition, for the questions you raised, I have the following responses:

  1. The method parameters of Sandbox should be decoupled from the Task of Dinky, and replace taskId with the string boxName (even though they are the same value).
  2. Map<String, String> data should include the original data types, not just strings.
  3. Do not reference the classes of FlinkCDC, but can be implemented by imitation.

This innovative new feature is really attractive. Details will be continuously updated.
Looking forward to your reply.

@MactavishCui
Copy link
Contributor Author

MactavishCui commented Dec 4, 2024

@MactavishCui Your design is excellent!

Today I have a bold idea: It would be more appropriate to define MockDataBase as Sandbox.

Sandbox is a hybrid data storage system for simulating data.

It has the following capabilities:

  1. Support and simulate data generation. Data formats include snapshot data, ChangeLog, stream data, vector data, etc.
  2. Support and simulate data writing. Writing methods include append, update, zipper, logical deletion, etc.
  3. Support unified metadata management.
  4. Support high-performance comparison of two data sets.
  5. Support real-time capture of end-to-end data latency.
  6. Support the mutual conversion of data in different database formats and forms.
  7. Support data rollback at a specified point in time.
  8. Support OLAP.
    image

We can implement an MVP version (Minimum Viable Product version) first, including the following capabilities:

  1. Support metadata management.
  2. Support high-performance writing of simulated data (separation of structure and data), and the writing methods include append, update, replacing ResultPool.
  3. Support single-table query.
  4. Support simulation data generation, including snapshot data and ChangeLog.
  5. Support writing in socket mode.

In addition, for the questions you raised, I have the following responses:

  1. The method parameters of Sandbox should be decoupled from the Task of Dinky, and replace taskId with the string boxName (even though they are the same value).
  2. Map<String, String> data should include the original data types, not just strings.
  3. Do not reference the classes of FlinkCDC, but can be implemented by imitation.

This innovative new feature is really attractive. Details will be continuously updated. Looking forward to your reply.

@aiwenmo Thank you very much for your suggestions.
I agree with your idea about sandbox! Because different type data have different features, like the event or handle time and the message order of stream data, different features will cause different task execute results, these feature can be greatly simulated by a sandbox design! Also, with the help of OLAP functions, we can build 2 tasks in different versions and compare the result differ of them before deployment, which really help users to analyze their tasks better! Mutal comnication and dataset comparison can also helps a lot for sandbox transfer!

About our MVP sandbox abilities, I made following detail designs and here are something I want to discuss with you:

  1. Support metadata management. A abstract method getMetadataManager need to be implemented by inheritor classes, based on this method a instance implement interface MetadataManager can be got for metadata management. The abilities of MetadataManager including the management of sandbox metadata and table metadata. Meta data of sandbox including sandBoxName, data type(A enum including stream data for flink sql task in stream mode, snapshot data for flink sql task in batch mode, and cdc change log data for cdc task, different types of data also have different strategy of primary key generations, for example stream data can use message order as primary key and cdc change log data can generate keys by key columns) and supported data change events. Meta data of tables including table info like ids, namespaces, schemes and whether it is for mock-sink or mock-source.
  2. Support high-performance writing of simulated data (separation of structure and data), and the writing methods include append, update, replacing ResultPool. Following existed design, a default method applyChangeEvents and several protected abstract change methods.
  3. Support single-table query. Get results by tableId. (Maybe we could design a DSL(Domain Specific Languages) or use other methods to support complex query further to support OLAP).
  4. Support simulation data generation, including snapshot data and ChangeLog. Generate data by snapshot ability can be offered by a default method that generate several data insert events by snapshot data.
  5. Support writing in socket mode. This is my confusion point, in my understanding mockDataSandBox is running in dinky-admin application. Mock data management abilities is provided by web and open-apis, mock insert of application jobs can insert results by these apis, a web socket can build between dinky-admin application and flink jobs, insert operation can be forwarded by application, so will it be necessary to build web socket between sand box and Flink job directly? What is your opinion.

All mentioned above is my design and confusion, looking forward to your reply!

@aiwenmo
Copy link
Contributor

aiwenmo commented Dec 4, 2024

@MactavishCui Thank you for your reply! The following is my response:

Dinky Entity Relations: SQLTask -1:n-> History -1:1-> JobInstance

  1. The definition of the sandbox is closer to a collection of catalogs and does not include data types. The sandbox metadata includes the sandbox name and table metadata, and the unique table is determined by boxName.catalogName.databaseName.tableName. Data types (stream data, snapshot data, Event, Zipper data) should be bound to the table object, which provides the basis for future hybrid analysis. (For single job debugging, the sandbox is like a message queue (there is no relationship between two topics). By default, a dinky task will create a new exclusive sandbox instance through the jobInstanceId, and the sandbox instances are completely isolated; for the debugging of multiple jobs in the data warehouse, the sandbox is more like a database, and more attention is paid to the result after the influence of multiple jobs, that is, the dinky task may write data to the existing shared sandbox instance. With the evolution of functions, in the future, Sandbox may manage all users' message queues and data lake warehouses in an abstract way and provide read and write capabilities in a proxy way, and even implement hybird reading and hybird writing. At this time, business users only know the existence of one Sandbox, that is, HBox (Hybird Box).)
  2. The high-performance structure of data is mainly the separation of schema and data. The data type of the data part can be compressed into binary. This design inspiration comes from the high-performance data structure of the FlinkCDC YAML job. So the length of N data list captured by mock data can be optimized to one schema and N-1 data. For data tables of the Event type (including SchemaChangeEvent and DataChangeEvent), its structure can refer to the Event in FlinkCDC.
  3. Single-table query is related to the implementation method of storage. Query in memory, hdfs, kafka, etc. can be achieved through Apache Calcite, and others can be converted into corresponding query statements for querying.
  4. Data generation is an auxiliary function. When used as a data source, its data sources may be script generation, csv import, page editing, and current storage tables. In addition, regarding the table in the first point of metadata, I think it should not be distinguished whether it is a simulation source or a target, because a table can be used as both a source and a target, such as building a minute-level hierarchical data warehouse in Apache Paimon. It is more appropriate to change the description here to support being a data source for stream computing.
  5. When the job runs in yarn/k8s applictaion mode, mock sink supports writing the captured data to the sandbox in real time through the web socket method.

Looking forward to your reply.

@MactavishCui
Copy link
Contributor Author

@MactavishCui Thank you for your reply! The following is my response:

Dinky Entity Relations: SQLTask -1:n-> History -1:1-> JobInstance

  1. The definition of the sandbox is closer to a collection of catalogs and does not include data types. The sandbox metadata includes the sandbox name and table metadata, and the unique table is determined by boxName.catalogName.databaseName.tableName. Data types (stream data, snapshot data, Event, Zipper data) should be bound to the table object, which provides the basis for future hybrid analysis. (For single job debugging, the sandbox is like a message queue (there is no relationship between two topics). By default, a dinky task will create a new exclusive sandbox instance through the jobInstanceId, and the sandbox instances are completely isolated; for the debugging of multiple jobs in the data warehouse, the sandbox is more like a database, and more attention is paid to the result after the influence of multiple jobs, that is, the dinky task may write data to the existing shared sandbox instance. With the evolution of functions, in the future, Sandbox may manage all users' message queues and data lake warehouses in an abstract way and provide read and write capabilities in a proxy way, and even implement hybird reading and hybird writing. At this time, business users only know the existence of one Sandbox, that is, HBox (Hybird Box).)
  2. The high-performance structure of data is mainly the separation of schema and data. The data type of the data part can be compressed into binary. This design inspiration comes from the high-performance data structure of the FlinkCDC YAML job. So the length of N data list captured by mock data can be optimized to one schema and N-1 data. For data tables of the Event type (including SchemaChangeEvent and DataChangeEvent), its structure can refer to the Event in FlinkCDC.
  3. Single-table query is related to the implementation method of storage. Query in memory, hdfs, kafka, etc. can be achieved through Apache Calcite, and others can be converted into corresponding query statements for querying.
  4. Data generation is an auxiliary function. When used as a data source, its data sources may be script generation, csv import, page editing, and current storage tables. In addition, regarding the table in the first point of metadata, I think it should not be distinguished whether it is a simulation source or a target, because a table can be used as both a source and a target, such as building a minute-level hierarchical data warehouse in Apache Paimon. It is more appropriate to change the description here to support being a data source for stream computing.
  5. When the job runs in yarn/k8s applictaion mode, mock sink supports writing the captured data to the sandbox in real time through the web socket method.

Looking forward to your reply.

@aiwenmo
Thanks very much for your reply!
I'm so excited that I can understand the design of sand box and how it works in dinky application better! Previously I only focus on the debug process in only one Tasks, you provide a pretty much better perspective! The optimization designs of it really attracts me too!
In these one or two days I'll organize our discussions conclusions and upload a new design scheme with more details in pic based on the latest conclusion. Please review at that time!
THANKS VERY MUCH again!

@aiwenmo
Copy link
Contributor

aiwenmo commented Dec 4, 2024

@MactavishCui Thank you for your participation! Regarding this, I am also very excited. This is the spark generated by the collision of two ideas, and both are indispensable.
Continuous communication provides more details for the concretization of the sandbox. The sandbox brings innovative and unique capabilities to Dinky, and I believe it will attract more people.
Currently, I am preparing for the official release of Dinky-1.2.0. After the release is completed, I will devote most of my energy to the construction of the sandbox.
I am looking forward to your new design scheme with more details, and I am willing to review it. thx

@MactavishCui MactavishCui changed the title [Feature][core] Flink sql task support customized test case input [Feature][core] Flink sql task support customized test case input and refactor mocksink Dec 4, 2024
@MactavishCui MactavishCui changed the title [Feature][core] Flink sql task support customized test case input and refactor mocksink [Feature][core] Flink sql task support customized test case input Dec 4, 2024
@MactavishCui
Copy link
Contributor Author

MactavishCui commented Dec 9, 2024

Hi! @aiwenmo I have made some attempts there days about sandbox and how to use it. According to our discussion last week, In the MVP version, sandbox has the abities including metadata, schema and records management. Based on this new design, l think there are following points that can be worked on:

  1. Refactor mock sink. currently mock sink only supports Local, Standalone, Yarn ession, Kubernetes Session, but now with the help of sand box web socket sink, fink jobs in all modes can be supported.
  2. Implement for the new feature of this issue: support user customized test case input.

And here lists my new designs of the 2nd point.
Firstly, a prototype design was carried out for the front-end page.
image
I designed a indepence mock configuration page. lt has 3 options: mock sink option, customized mock source option and sandbox selection.

  • mock sink option: Like what it does currently, mock sink operator will be used instead of the real operator when this set enabled.
  • customized source option: When this setting is enabled, jobs will get data from sandbox, while jobs will get data from real source data when it is disabled.
  • Sandbox selection: Users can select a existed sandbox by sandBoxName. input data used from this sand box can be previewed (Filter data that only used in current task and show them).
    ffa58f812f3b5cab64e6dbd8f625cf7
    The left containers has a sand box management page, this page is farmiliar with the datasource management page, but has more functions. In this page, users can edit sanbox table schemas and records. Schemas can be initialized by users input, init from Flink Sql task statements or from datasource connection. Metadata records can be edited by web or uploaded by files.
    2d0d19eb133b39e49bcfb357992435f

About the backend implement
SandBoxController and SandBoxService is designed for sandbox management, whose MVP version has following interfaces:

  • Init schema by taskld and sandboxName
  • insert record into sandBoxTableld(sandboxname.namespace.schema.tableName)
  • update record from sandBoxTableld
  • delete record from sandBoxTableld
  • get all sand boxes
  • get tables by sandboxName
  • get schema and all records by sandboxName
  • An open api of ApiController get records by sandBoxTableld

Besides, task debug process is designed as follows:
image
The connector of source table will be convert to sanbox-mock-connector, the connector params including:

  • sandBoxTableld for this table.
  • data type: snapshot/change log
  • host and port information to build websoct connection and visit openapi
  • Some alternative options like the delay of stream data

All mentioned above is my new design of this feature. Is this scheme acceptable, and may l submit a PR for this feature based on sandbox? I find that you will work on the sandbox construction, so if I can submit codes for this feature, before the construction of sandbox finished, I'll mock the usage of sandboxes temporarily, focus on the front end codes and backend codes for sand box application of this feature and make synchronous changes when sandbox is ready. I also find that there are some code optimization after my last new feature PR, so l'll study these optimization codes and pay more attention to the scheme and code style.
If l can submit a PR for this new feature, l'll be really grateful if you give me a notification when sandbox is ready. Additionally, l'll also continue working on other issues.
Looking forward to your reply.

@aiwenmo
Copy link
Contributor

aiwenmo commented Dec 10, 2024

Thank you very much for your attempt and design. I agree with your scheme. Besides, there is a small issue below that needs correction:

host and port information to build websocket connection and visit openapi.
The connector of source table will be converted to sandbox-mock-connector, not sanbox-mock-connector.

Finally, I will give you a notice when the sandbox is ready. thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature New feature
Projects
Status: Doing
Development

Successfully merging a pull request may close this issue.

2 participants