Directions

Light weight formal approaches applicable to problems in V&V of distributed CPS with (some of) the following traits:

Evaluation and validation of our approaches using ROS Gazebo simulations

Decomposition of Communication and Continuous Dynamics

Extension/revision of Koord

Message passing style, Actor model
- Thespian, a Python package for programming in Actor model
Asynchrony, partial synchrony
Fault tolerant, failure recovery
- Reliable vs out of order delivery
Safety and Liveness guarantees for distributed coordination under local assumptions
- Precise safety and liveness for an agent can be very difficult to provide
- Empirical/statistical/probabilistic provided by local monitoring or local models

See previous posts.

Debug or verify ROS ActionLib

Predicate detection problem

Deterministic scheduling and replay in different abstraction levels and areas

Operating system level support
- Liblog https://static.usenix.org/events/usenix06/tech/geels/geels.pdf
- Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging
Distributed realtime systems
- Towards Lightweight Logging and Replay of Embedded, Distributed Systems, https://hal.archives-ouvertes.fr/hal-00848180
- Using deterministic replay for debugging of distributed real-time systems, Euromicro RTS 2000

Web

Software failure reproduction

A Search-Based Framework for Failure Reproduction, SSBSE 2012
ReCrash: Making Software Failures Reproducible by Preserving Object States, ECOOP 2008

ROS Gazebo discussion

https://answers.gazebosim.org//question/25010/repeatability-of-experiments-and-determinism-of-gazebo-simulation/
https://discourse.ros.org/t/deterministic-replay-and-debugging/1316
https://roscon.ros.org/2018/presentations/ROSCon2018_Mozilla_rr.pdf
- Video: https://vimeo.com/293623186
- Only for single ROS node with multiple threads

Based on our work “SkyTrakx: A Toolkit for Simulation and Verification of Unmanned Air-Traffic Management Systems”, ITSC 2021

UTM sets up a good narrative and problem scope for centralized approaches

It must be centralized because UTM is likely coordinating aircraft from different delivery companies for example.
- Distinguish us from Drona and many other decentralized approaches for distributed planning
- More related to TCAS and ACAS family, but deals with finer space-time partitions and low-altitude terrain and obstacles.
Using abstraction of continuous dynamics, path planning, and controllers is natural and necessary
- because UTM and its agent-side protocol have to support various kinds of unmanned aircraft.
- Besides distributed coordination, difficulties also come from less control and information on aircraft states.

Ideas and extensions from the submission

New protocols

Violation detection, safety under bounded violation of contract

Violation can be detected by agent side protocol
Bounded violation means violation is always resolved within some time period OR some distance traveled.
- Violation not caused by failure but by overly strict contracts are mostly resolved shortly

Failure report, safety under failure with or without recovery

Failure should be decided and reported by aircraft. Failure recovery is also done by the aircraft and can be abstracted.
- Failure can induce violation, but it is neither a necessary nor sufficient condition.
- UTM observes only violations and needs a report to know the failure.
Notification to other aircraft and reassign contracts
However, (hardware) failure for CPS is typically very specific to the dynamics and the model
- Decompose fail action (general comm.) from its message (specific datatype with semantics)

Centralized monitoring and prediction with limited state information

Using only observed positions and approved contracts instead of full dynamic model

Improvements on the submission

Measuring failure related behavior
- It is done in the last minute, and we didn’t discuss them thoroughly.
Unclear “formal” models
- We didn’t know whether the agent automaton is a hybrid I/O automaton or not.
- We should be able to move the abstraction of trajectory control out of the agent automaton, and make it a simple IOA only for communication.
- Locally monitoring the violation can be a separated HIOA with only clock and position as continuous variables
Lack of discussion in the following
- Static obstacles
- From a global clock to local clocks with skews
- Communication delays in liveness proof

Improvements and extensions on implementation and running experiments

Reliable and correct implementation/refinement of the protocol
It seems a fixed delta period is assumed between each iteration of actions, but the implementation does not
Introduce communication delays and message drops
Evaluate scalability over more metrics
- Flight paths
Ad-hoc logging which makes collecting extra data from experiment result not easy