AMR调度系统性能优化/AMR Dispatch System Performance Optimization

背景

转眼工作两年了，优必选的AMR调度系统（后划为优奇）作为主要的开发项目，经历了这个系统从0到1的过程，也熟悉了软件开发中：调研-设计-开发-测试-迭代的过程。

但随着交付环境的变化和对系统性能及稳定性的要求，原始的系统框架设计以及开发过程中打补丁的方法无法满足新的需要，为了跟上业务的发展脚步，需要重构/优化现有系统，由于优化方向主要是个人的想法，不会照顾到现行的本体逻辑，而是对接之前开发的AMR单车仿真模型，这样自己可以决定上下游的设计，自由度更高。

The larger the project and the worse the code, the more difficult it is to clean it up and the more resources it may require

—- refactor-like-a-superhero

还有一点，作为面试内容，性能就是最好的面试题，它从算法到架构，既能考察候选人的潜力，也能考察候选人的工程能力

一个“耗时太长”的bug

首次在二维码大场景下测试时，出现了寻路耗时太久的bug，表现在较近的目标点能及时响应，很远的话就会出行下发路径耗时较长，触发任务失败。

最后定位bug原因是拼接json格式的路径信息时使用了eval()方法去除数据结构最外层的“”，不使用json.loads的原因是数据中存在不符合json规范的字段：None，该信息通过http协议从后台获取，而最早开发时并未完全遵守相关的协议规范，当时实际的环境为slam场景，路径点相对稀松，问题就没有暴露，加上联调时间紧，采用这种hack的方法去掉外层引号。

这说明在开发伊始，指定规范的协议标准是必须的，不同业务层面需要严格遵守，保证南向数据给出去符合协议要求。

A good system design requires us to think about everything, from infrastructure all the way down to the data and how it’s stored

—– System Design Course

分析

To refactor code faster, without regressions, and without “too much extra work,” we need to first study and prepare the code.

Define the refactoring scope and boundaries.

Cover the code under refactoring with tests.

Configure linters and, if needed, the compiler more strictly.

—- refactor-like-a-superhero

现行的系统架构是根据根据不同功能划分成不同的模块，当各自触发条件生效时，异步操作对象，使用同步锁确保数据的一致和有效；使用redis实现线程间通讯，通过单例的类和锁完成redis数据的读写；接受任务为响应式。

这样造成的问题有：

冷热数据没有隔离，IO读写压力大
与AMR本体通讯在逻辑上没有做握手，丢包后无响应
无任务池，无法在更高尺度优化任务
没有更深的利用OOP思想，部分数据结构冗余低效
代码/架构有“shit山”倾向，后续发展不便于维护，部分线程缺少退出逻辑
部分算法缺乏动态性设计，过度依赖解决冲突
部分数据没有善用缓存，读取频率较刷新频率更频繁
现行单例实现不稳定

在重构之前，为节约时间和有限度的修改代码便于跟踪，需要定义重构的范围和边界，制定测试用例。

同时重构不能改变原功能，最小化git提交

~~以任务模块为例，判断本体上报子任务~~

思路

针对出现的问题，主要优化思路如：

To refactor code faster, without regressions, and without “too much extra work,” we need to first study and prepare the code. In particular, we should:

Define the refactoring scope and boundaries.

Cover the code under refactoring with tests.

Configure linters and, if needed, the compiler more strictly.

—- refactor-like-a-superhero

设计并验证调度系统内基于线程锁的“发布-订阅”范式，通过设计的AMR的class()实现数据读写，部分共享信息通过较低刷新频率写入redis，实现冷热数据隔离；
设计消息队列和与本体的应答逻辑，实现本体失联后的消息重发，减少对本体状态的依赖
设计完善的线程池逻辑，针对不同地图及AMR在线情况决定线程存活，进一步引入aio生态
设计相关的数据类替代内部逻辑强关联的数据结构，封装处理逻辑
设计任务池及相关寻路/交管算法，在单一空间冲突上加入时序特征，减缓或降低冲突发生
优化全局路径规划算法，统一二维码/slam环境下不同拓扑关系下的寻路效果，优化长路径、转弯角度、空/负载情况下路径生成，保证耗时/能耗最优

未完待续

Background

After working for two years, Ubiquity’s AMR Dispatch System (later classified as Uki) has been the main development project, and I have experienced the process of this system from 0 to 1. I am also familiar with the process of research-design-development-test-iteration in software development.

However, with the change of delivery environment and the requirement of system performance and stability, the original system framework design and the method of patching during the development process cannot meet the new needs. In order to keep pace with the development of the business, the existing system needs to be refactored/optimized, and since the optimization direction is mainly a personal idea, it will not take care of the existing ontology logic, but dock the previously developed AMR single vehicle simulation model, so that one can decide the upstream and downstream design with more freedom.

The larger the project and the worse the code, the more difficult it is to clean it up and the more resources it may require

—- refactor-like-a-superhero

One more thing, as an interview content, performance is the best interview question, it goes from algorithm to architecture, it can examine both the potential of the candidate and the engineering ability of the candidate

A bug that "takes too long"

The first time I tested it in a large QR code scenario, there was a bug that the pathfinding took too long, as shown in the nearer target points could respond in time, but if it was far away, it would take longer to travel down the path and trigger the task to fail.

Finally, the reason for the bug is that the eval() method is used to remove the outermost "" of the data structure when splicing the path information in json format, and the reason for not using json.loads is that there is a field in the data that does not conform to the json specification: None, which is obtained from the background through the http protocol from the background, and the earliest development did not fully comply with the relevant protocol specifications, the actual environment at that time for slam scenarios, the path point is relatively sparse, the problem is not exposed, coupled with the tight time of the coupling, using this hack method to remove the outer quotes.

This shows that at the beginning of development, specifying the protocol standard is necessary, and different business levels need to strictly comply with it to ensure that the southbound data is given out in accordance with the protocol requirements.

A good system design requires us to think about everything, from the infrastructure all the way down to the data and how it’s stored

—– System Design Course

Analysis

To refactor code faster, without regressions, and without "too much extra work," we need to first study and prepare the code.

Define the refactoring scope and boundaries.

Cover the code under refactoring with tests.

Configure linters and, if needed, the compiler more strictly.

Configure linters and, if needed, the compiler more strictly.
—- refactor-like-a-superhero

The current system architecture is divided into different modules based on different functions, operating objects asynchronously when their respective trigger conditions take effect, using synchronous locks to ensure consistent and valid data; using redis to achieve inter-thread communication, completing the reading and writing of redis data through a single instance of the class and lock; accepting tasks as responsive.

This causes the following problems.

No isolation of hot and cold data, high IO read and write pressure
Communication with AMR ontology does not logically do handshake, no response after packet loss
No task pool, unable to optimize tasks at higher scales
No deeper use of OOP thinking, some data structures are redundant and inefficient
The code/architecture has the tendency of "shit mountain", the subsequent development is not easy to maintain, some threads lack exit logic
Part of the algorithm lacks dynamic design, over-reliance on conflict resolution
Part of the data is not well used cache, read frequency is more frequent than refresh frequency
The current single instance implementation is unstable

Before refactoring, the scope and boundary of refactoring need to be defined and test cases need to be developed in order to save time and modify the code in a limited way for easy tracking.

At the same time refactoring cannot change the original functionality and minimizes git commits

~~ Take task module as an example, determine the ontology to report subtasks ~~

Thinking

The main optimization ideas for the problems that arise are as follows.

To refactor code faster, without regressions, and without "too much extra work," we need to first study and prepare the In particular, we should:

Define the refactoring scope and boundaries.

Cover the code under refactoring with tests.

Configure linters and, if needed, the compiler more strictly.

Configure linters and, if needed, the compiler more strictly.
—- refactor-like-a-superhero

Design and validate a thread-lock based "publish-subscribe" paradigm in the scheduling system, with data read and write via the designed AMR class(), with some shared information written to redis at a lower refresh rate to achieve hot and cold data isolation.
Design message queue and answer logic with the ontology to achieve message resending after the ontology is lost, reducing the dependence on the state of the ontology
Design a perfect thread pool logic to decide thread survival for different maps and AMR online situations, and further introduce aio ecology
Design related data classes to replace the internal logic of strongly related data structures, encapsulating the processing logic
Design task pools and related pathfinding / traffic management algorithms to add timing features to single spatial conflicts to slow down or reduce the occurrence of conflicts
Optimize the global path planning algorithm, unify the pathfinding effect under different topological relationships in 2D/slam environment, optimize the path generation under long path, turn angle and empty/load conditions, and ensure the optimal time/energy consumption

To be continued

AMR调度系统性能优化/AMR Dispatch System Performance Optimization

背景

一个“耗时太长”的bug

分析

思路

Background

A bug that "takes too long"

Analysis

Thinking

Nemo

文章作者

推荐文章

发表回复取消回复

AMR调度系统性能优化/AMR Dispatch System Performance Optimization

背景

一个“耗时太长”的bug

分析

思路

Background

A bug that "takes too long"

Analysis

Thinking

Nemo

文章作者

推荐文章

发表回复 取消回复

发表回复取消回复