Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft - Bloom Filter Probabilistic Routing #5697

Closed
wants to merge 77 commits into from

Conversation

medentem
Copy link
Contributor

@medentem medentem commented Dec 30, 2024

This is a draft for discussion and review:

Proposal

Expand relay_node to 4 bytes and include a 13-byte (104-bit) Bloom filter in the packet header. Each hop adds up to N uint32_t node IDs. The filter uses 2 hash functions.

Why a Bloom Filter?

  • Space Efficiency: Instead of listing 60 node IDs (240 bytes), a 13-byte filter is far smaller.
  • Reduce Redundant Rebroadcasts: We use the filter to decide whether a new hop adds sufficient new coverage to justify forwarding.
  • Acceptable False Positives: The ~37% false positive rate (worst case - assumes each node has 20 recently seen nodes to add) is fine for our probabilistic approach; it still drastically reduces traffic.

False Positives - A Known Limitation

Bloom filters can give false positives (claiming “node X is in the set” when it actually isn’t), but they never give false negatives (if it says “not in the set,” then it’s definitely not in the set).

For our routing use case, these occasional false positives are acceptable because we only need a rough idea of which nodes are covered to make probabilistic rebroadcast decisions.

Impact

  • Fewer Duplicates: Less congestion and lower power consumption. Simulations suggest ~30% fewer transmissions, depending on topology and node density. Results below.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 30, 2024

First thing: this can only be done for version 3.0, because extending the header would be a breaking change.

Sounds interesting, I’m curious to the results of simulations. Note that the interactive simulator is hard to use for scale though, as you’ll be simulating all nodes on one machine.

One thing I would change is the number of neighbors per hop. 20 is really a lot, while it’s much more common you exceed 3 hops (which is actually 2 hops because you'll need to add the first neighbors already when the hop limit is still 3).
Also, while the bloom filter doesn’t give false positives, if you can’t hear a node directly it doesn’t necessarily mean it cannot hear you.

I’m also not sure if this overhead (duplicating the header size) is worth it if you only base the probability of rebroadcasting on it. With this you could also implement rebroadcasting until you don’t have unique receivers to serve after receiving other nodes’ rebroadcasts.

@medentem
Copy link
Contributor Author

medentem commented Dec 30, 2024

First thing: this can only be done for version 3.0, because extending the header would be a breaking change.

Makes sense.

Sounds interesting, I’m curious to the results of simulations. Note that the interactive simulator is hard to use for scale though, as you’ll be simulating all nodes on one machine.

For sure. Will need some real world after simulating the effects. By the way, I'm having trouble with the simulator. Is there anything off the top of your head that might need to change given the expanded header? I changed the config value to 32 bytes and looked through the code but nothing jumped out at me. But for some reason the packet is losing the header value when flowing through the simulator.

One thing I would change is the number of neighbors per hop. 20 is really a lot, while it’s much more common you exceed 3 hops (which is actually 2 hops because you'll need to add the first neighbors already when the hop limit is still 3). Also, while the bloom filter doesn’t give false positives, if you can’t hear a node directly it doesn’t necessarily mean it cannot hear you.

This will be the interesting part. In an ideal state, all a nodes current neighbors should be in the filter. But the best we can do is approximate current neighbors because we don't have frequent heartbeats to say "I'm still here". Right now I'm filtering down to nodes heard directly within the last hour because that is the minimum interval a nodeinfo packet goes out. Open to thoughts on this. I don't think we can be sure that 20 is too much. For example, the Chicago mesh is very dense. I think during their scheduled NETs, 20 is low.

And also, we can't pack the bloom filter with a fixed size with too many nodes or the False Positive Rate will increase. A node will more often think "My neighbor, X node, was already covered and is in the filter" when it actually is not. A non-zero false positive rate should be fine, but we wouldn't want it to climb too high.

I’m also not sure if this overhead (duplicating the header size) is worth it if you only base the probability of rebroadcasting on it. With this you could also implement rebroadcasting until you don’t have unique receivers to serve after receiving other nodes’ rebroadcasts.

Probabilistic forwarding can be aggressive but still leave room for some level of potential duplication. I view it as being almost inversely proportional to the false positive rate. For as often as we may accidentally think we cover a node, we allow some potential for rebroadcast. But your idea is super interesting. Would have to think about that.

@GUVWAF
Copy link
Member

GUVWAF commented Dec 30, 2024

Is there anything off the top of your head that might need to change given the expanded header? I changed the config value to 32 bytes and looked through the code but nothing jumped out at me. But for some reason the packet is losing the header value when flowing through the simulator.

I also responded to your question here: #5629 (comment)

The header is not sent in "raw" form to the simulator (or any client app); the fields are parsed and then added to the MeshPacket protobuf message. So you either have to add your field to that protobuf message and forward it in the simulator, or as workaround send it with the payload and parse it from there.

@garthvh garthvh requested review from thebentern, caveman99 and GUVWAF and removed request for thebentern and caveman99 December 31, 2024 01:20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should not be a part of this feature, was already closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove. This is not ready to merge right now.

@medentem
Copy link
Contributor Author

medentem commented Jan 9, 2025

UPDATED APPROACH

This has been a saga of changes based on excellent input from @GUVWAF. I wanted to capture the latest approach because it has evolved quite a bit since the beginning. Here is the "How it works":

Packet Header

  • The packet header was expanded by 16 bytes to accommodate two changes:
    • relay_node was expanded to 4 bytes from 1 byte which ensures when a packet arrives, you know exactly who gave it to you. This is critical to a node's knowledge of its immediate neighbors.
    • A 13 byte Bloom Filter was added (coverage_filter) and when I node receives a packet, that tells the receiving node "Here are the nodes that have already been covered".

How a Node's Coverage Is Maintained:

  • Every packet that arrives will now have relay_node which may or may not differ from the from node. The important this is that the relay_node is always a direct neighbor as it is the actual transmitter of the packet a node receives.
  • Theses relay node ids are added to a data structure with the last time they were heard. This is effectively the node's coverage list uncompressed.

How a Node's Coverage is Applied:

  • When a packet is received the node determines how much new coverage it can add by comparing it's local coverage knowledge against the bloom filter (aka. coverage_filter) in the packet header.
  • Due to the low bandwidth nature of the mesh, which means periodic automatic packets do not occur often, and potential for stale coverage particularly in sparse, highly dynamic meshes, we need additional mechanisms to mitigate degrading coverage knowledge. Summary of those mechanisms:
    • When a node has no knowledge of any direct neighbors, it falls back to a high probability of rebroadcasting. This is to mitigate first startup, or when we haven't heard from others in a long time there is effectively no harm of broadcasting.
    • Coverage is calculated as a weighted ratio of new coverage we offer (when checked against the bloom filter) divided by the total coverage we're aware of - our direct neighbors
      • The weighted ratio is a decay mechanism whereby the longer its been since we've heard a node, the less it counts towards new coverage we offer. Effectively, our confidence in it being nearby starts to shrink.
  • After computing our weighted additional coverage, a scaled probability of rebroadcasting is determined based on that additional coverage

What hasn't changed:

  • All rebroadcasts are still subject to the SNR-based delay mechanism. In other words, retransmissions will still get stopped if a further node rebroadcasts before our delay expires. The coverage filter is a layer on top of that.

Main Benefits

  • hop_limit is no longer needed with the coverage filter in place which means the mesh is more adaptable to changes over time. Hops stop when coverage stops.
  • Airtime utilization is less because wasteful rebroadcasts are clamped by coverage
  • Reachability increases because hops that would otherwise have never happened can now happen without risk of unnecessary chatter
  • May be possible that device roles are less important, though more testing is necessary.

@GUVWAF
Copy link
Member

GUVWAF commented Jan 9, 2025

I think this would majorly improve this routing algorithm because our knowledge of coverage would increase a lot in these other cases. Thoughts?

Yes, I think extending the relay_node to its full ID is something to consider for 3.0. It also has the benefit that you know who sent the “implicit ACK” on a broadcast.

While this might not be so important now that the full relay_node ID is added, I don’t think we can assume that every moving node uses smart position broadcast. Many nodes don’t have GPS or don’t use it because it uses too much power.

Also, have you every changed the hop limit for MANAGED_FLOOD? I’m wondering whether the low reachability in case there’s a lot of nodes is just because the hop limit is too low (see also the figure in the README of the simulator). Also strategically placed ROUTERs may help, but it's a bit hard to decide whether it's realistic people assign these correctly.

Next to this, I’m a bit skeptical about the hard-coded parameters like COVERAGE_RATIO_SCALE_FACTOR, UNKNOWN_COVERAGE_REBROADCAST_PROBABILITY and RECENCY_THRESHOLD. Are these still valid for different LoRa modem presets, different PERIOD or PACKET_LENGTH, etc.?

Looking forward to the new results.

@medentem
Copy link
Contributor Author

While this might not be so important now that the full relay_node ID is added, I don’t think we can assume that every moving node uses smart position broadcast. Many nodes don’t have GPS or don’t use it because it uses too much power.

I can vary this as well.

Also, have you every changed the hop limit for MANAGED_FLOOD? I’m wondering whether the low reachability in case there’s a lot of nodes is just because the hop limit is too low (see also the figure in the README of the simulator). Also strategically placed ROUTERs may help, but it's a bit hard to decide whether it's realistic people assign these correctly.

No. I have not. I will try with a higher hop limit on managed flooding, but I don't think the purpose of this test is to see if we can craft a more perfect mesh using hop limits and node roles. The reality is most meshes lack the organization and knowledge to achieve this.

Next to this, I’m a bit skeptical about the hard-coded parameters like COVERAGE_RATIO_SCALE_FACTOR, UNKNOWN_COVERAGE_REBROADCAST_PROBABILITY and RECENCY_THRESHOLD.

These are necessary to make testing more convenient and those values have a material impact on the coverage based algorithm. Each represents a careful mitigation of the edge cases where coverage based routing is weak due to the lower bandwidth less frequent updates from neighboring nodes.

The effect of COVERAGE_RATIO_SCALE_FACTOR can be seen here in the linear line in red. It is used to vary how quickly a 100% probability of rebroadcast should happen. For example, if 20% of my nodes are new for this packet, we want that to rebroadcast. If the scale factor was 1, only 100% new coverage would guarantee rebroadcast.

image

UNKNOWN_COVERAGE_REBROADCAST_PROBABILITY is used when a node first boots, or all of its coverage has fully aged out. Effectively, the node has 0 confidence as to other nodes around it. Therefore, we should prefer that it rebroadcasts because it is an ignorant node for the time being.

RECENCY_THRESHOLD is the amount of time a neighboring node remains in a nodes coverage list. As the node nears the threshold, its contribution to the probability decays because our confidence in it still being nearby decays. Again, this is a mitigation tactic based on our imperfect means of tracking coverage.

Are these still valid for different LoRa modem presets, different PERIOD or PACKET_LENGTH, etc.?

I will need to run more tests, but I don't think the lora modem presets or packet length should impact this because all of these deal with coverage which doesn't change based on those two things. PERIOD may matter more because it is directly tied to establishing coverage knowledge.

@medentem
Copy link
Contributor Author

medentem commented Jan 10, 2025

Here is an interesting test that shows the effect of asymmetric links. This is with 50 nodes. The % of "I don't cover this node, but I think I can" is directly tied to the % of links that are asymmetric. 8.41 / (8.41+13.31) = 38.7%

UPDATED - Correction below. The language in the log output is not right.
So the impact of asymmetric links is that I receive a packet, think I can't reach this node, but I could have. That would cause loss of coverage.

Screenshot 2025-01-09 at 10 26 19 PM Screenshot 2025-01-09 at 9 40 07 PM

@medentem
Copy link
Contributor Author

Sorry for delay. Simulations are coming, but running them is becoming very time consuming. I am almost done with some changes to the batch simulator so I can run these tests with a little more scale and less manual touch.

@garthvh
Copy link
Member

garthvh commented Jan 11, 2025

It would be good to add more mobile nodes, in many meshes a high percentage of nodes are mobile (and increasing) without position knowledge.

@medentem
Copy link
Contributor Author

medentem commented Jan 13, 2025

So I've been deep in this for 2+ weeks and I think I need to reset my brain a bit to make sense of the results - so many iterations, so many tests.

What I think I've learned so far is... anything that could degrade moment-in-time knowledge of direct neighbors will degrade the performance of the coverage router. I am working to find the inflection point where the size of the mesh is so small that the sensitivity to coverage is too high, but it seems like < 35.

In general, results degrade with the new layers of more "real world" simulation - which is intuitive, because things like mobile nodes w/out GPS and asymmetric links degrade any given node's ability to sense direct neighbors accurately over time.

Here are the baseline results comparing managed flood (3 hops, 16byte header) with bloom router (15 hops, 32byte header) and a few other tunable constants that affect rebroadcast probability.

NEW IMAGES COMING....

@medentem
Copy link
Contributor Author

Well friends. After a lot of additional testing I have come to the conclusion that the coverage based router is not viable. Here are the main reasons:

  1. It is not possible to reliably detect the conditions under which we would expect a given node's knowledge of its coverage to degrade.
  2. When coverage knowledge degrades, rebroadcasting decisions are often wrong.
  3. It appears the coverage filter would offer roughly the same performance as the managed flood routing without requiring hop limit. That said, it adds 16 bytes and offers no real NET benefit.
  4. At extremely high mesh density (500 nodes), coverage knowledge degrades because too many nodes are added to the compressed data space, which in turn increases the false positive rate.

In short, there is a narrow sweet spot where this works, but we can't detect when the network falls outside the efficacy range of the implementation.

@GUVWAF - I hope the modifications to the simulator are still useful.

@GUVWAF
Copy link
Member

GUVWAF commented Jan 20, 2025

That’s really a shame. It really looked promising at some point, but if it turns out not to give significant improvements in realistic scenarios, indeed I don’t think it’s worth investigating further. It’s anyway good to have considered the option and really want to thank you for doing this so extensively.

The additions to the simulator are definitely useful. If you have time to separate those additions from the bloom filter implementation, that would be great, otherwise I will try to do it sometime.

@medentem
Copy link
Contributor Author

The additions to the simulator are definitely useful. If you have time to separate those additions from the bloom filter implementation, that would be great, otherwise I will try to do it sometime.

I will do that. Thank you!

@medentem
Copy link
Contributor Author

Farewell bloom.

@medentem medentem closed this Jan 20, 2025
@medentem medentem deleted the feature-bloomrouter branch January 20, 2025 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants