Multi-Robot Home Surveillance
Table of Contents
- Problem Statement
- Challenges
- Video Demo
- Program Structure
- TF Tree
- RViz Interface
- Leader Election Algorithm
- Resiliency and Node Failure Detection
- Limitations
- Applications
- Team Members
- Special Thanks
Problem Statement
- The home above is divided into four security zones.
- We can think of them as a sequence, ordered by their importance: [A, B, C, D].
- If we have n live robots, we want the first n zones in the list to be patrolled by at least one robot.
- n== 2 → patrolled(A, B);
- n == 3 → patrolled(A, B, C)
- 1 <= n <= 4
- The goal is for this condition to be an invariant for our program, despite robot failures.
- Let robotz refer to the robot patrolling zone Z.
- Illustration:
- n == 4 → patrolled(A, B, C, D)
- robotA dies
- n == 3 → patrolled(A, B, C)
- robotB dies
- n == 2 → patrolled(A, B)
- robotA dies
- n == 1 → patrolled(A)
Challenges
- We need to:
- map the home
- SLAM (Simultaneous Localization and Mapping) with the gmapping package
- have multiple robots patrol the security zone in parallel
- AMCL (Adaptive Monte Carlo Localization) and the move_base package
- One thread per robot with a SimpleActionClient sending move_base goals
- have the system be resilient to robot failures, and adjust for the invariant appropriately
- Rely on Raft’s Leader Election Algorithm (‘LEA’) to elect a leader robot.
- The remaining robots will be followers.
- The leader is responsible for maintaining the invariant among followers.
- More on system resiliency:
- The leader always patrols zone A, which has the highest priority.
- On election, the leader assigns the remaining zones to followers to maintain the invariant.
- On leader failure:
- global interrupts issued to all patrol threads (all robots stop).
- the LEA elects a new leader
- On follower failure:
- local interrupts issued by leader to relevant patrol threads (some robots stop).
- Example:
- n==4 → patrol(A, B, C, D)
- robotB dies
- leader interrupts robotD
- leader assigns robotD to zone B
- n==3 → patrol(A, B, C)
Video Demo
Program Structure
- Two points of entry: real_main.launch and sim_main.launch.
- real_main.launch:
- used for real-world surveillance of our home environment
- runs the map_server with the home.yaml and home.png map files obtained by SLAM gmapping.
- All robots will share this map_server
- launches robots via the real_robots.launch file (to be shown in next slide)
- launches rviz with the real_nav.rviz configuration
- sim_main.launch:
- used for simulated surveillance of gazebo turtlebot3_stage_4.world
- works similarly to real_main.launch except it uses a map of stage_4 and sim_robots.launch to run gazebo models of the robots via tailored urdf files.
TF Tree
RViz Interface
Leader Election Algorithm
- Based on Diego Ongaro et al. 2014, “In Search of an Understandable Consensus Algorithm”
- The full Raft algorithm uses a replicated state machine approach to maintain a set of nodes in the same state. Roughly:
- Every node (process) starts in the same state, and keeps a log of the commands it is going to execute.
- The algorithm elects a leader node, who ensures that the logs are properly replicated so that they are identical to each other.
-
Raft’s LEA and log replication algorithm are fairly discrete. We will only use the LEA.
-
Background:
- Each of the robots is represented as an mp (member of parliament) node.
- mp_roba, mp_rafael, etc.
- Each node is in one of three states: follower, candidate, or leader.
- If a node is a leader, it starts a homage_request_thread that periodically publishes messages to the other nodes to maintain its status and prevent new elections.
- Every mp node has a term number, and the term numbers of mp nodes are exchanged every time they communicate via ROS topics.
- If an mp node’s term number is less than a received term number, it updates its term number to the received term number.
- LEA:
- On startup, every mp node is in the follower state, and has:
- a term number; and
- a duration of time called the election timeout.
- When a follower node receives no homage request messages from a leader for the duration of its election timeout, it:
- increments its term number;
- becomes a candidate;
- votes for itself; and
- publishes vote request messages to all the other mp nodes in parallel.
- When a node receives a request vote message it will vote for the candidate iff:
- the term number of the candidate is at least as great as its own; and
- the node hasn’t already voted for another candidate.
- Once a candidate mp node receives a majority of votes it will:
- become a leader;
- send homage request messages to other nodes to declare its status and prevent new elections.
- The election timeout duration of each mp node is a random quantity within 2 to 3 seconds.
- This prevents split votes, as it’s likely that some follower will timeout first, acquire the necessary votes, and become the leader.
- Still, in the event of a split vote, candidates will:
- reset their election timeout durations to a random quantity within the mentioned interval;
- wait for their election timeout durations to expire before starting a new election.
- On startup, every mp node is in the follower state, and has:
Resiliency and Node Failure Detection
- The election timeout feature is part of what enables leader resiliency in my code.
- When a leader node is killed, one of the follower nodes time out and become the new leader.
- roscore keeps a list of currently active nodes. So:
- followers, before transitioning to candidate state, update their list of live mp nodes by consulting roscore.
- leaders do the same via a follower_death_handler thread, which runs concurrently with main thread and the homage_request thread, and maintains the invariant.
Limitations
- The mp nodes are all running on the remote VNC computer, which also runs the roscore.
- So killing the mp node is just a simulation of robot failure, not true robot failure.
- We’re just killing a process on the vnc computer, not the robot itself
- We can remedy this by binding the mp node’s life to any other nodes that are crucial to our robot fulfilling its task.
- E.g., for our project the /roba/amcl and /roba/move_base nodes are critical.
- So we can consult roscore periodically to see if any of these nodes die at any given point. And if they do, we can kill mp_roba.
- We can also define and run more fine-grained custom nodes that keep track of any functional status of the robot we’re interested in, and bind the life of those custom nodes to the mp nodes.
- So killing the mp node is just a simulation of robot failure, not true robot failure.
- Running each mp node on its corresponding robot is a bad idea!
- Only if a robot gets totally destroyed, or the mp process fails, will the algorithm work.
- Performance will take a hit:
- bandwidth must be consumed for messages between mp nodes
- robot’s computational load will increase
- Our system has a single point of failure; the VNC computer running roscore.
- So we have to keep the VNC safe from whatever hostile environment the robots are exposed to.
- Maybe this can be remedied in ROS2, which apparently does not rely on a single roscore.
- Our patrol algorithm is very brittle and non-adversarial
- Depends on AMCL and move_base, which do not tolerate even slight changes to the environment, and which do not deal well with moving obstacles.
- There are no consequences to disturbing the patrol of the robots, or the robots themselves (e.g. no alarm or ‘arrests’)
Applications
- The LEA is fairly modular, and the algorithm can be adapted to tasks other than patrolling.
- It would work best for tasks which are clearly ordered by priority, and which are hazardous, or likely to cause robot failures.
- Completely fanciful scenarios:
- Drone strikes: Maybe certain military targets are ordered by priority, and drones can be assigned to them in a way that targets with a higher priority receive stubborn attention.
- Planetary exploration: Perhaps some areas of a dangerous planet have a higher priority that need to be explored than others. Our system could be adapted so that the highest priority areas get patrolled first.
Team Members
- James Lee (leejs8128@brandeis.edu)
Special Thanks
- A special thanks to Adam Ring, the TA for the course, who helped with the project at crucial junctures.