Draft - Bloom Filter Probabilistic Routing #5697

medentem · 2024-12-30T01:15:18Z

This is a draft for discussion and review:

Proposal

Expand relay_node to 4 bytes and include a 13-byte (104-bit) Bloom filter in the packet header. Each hop adds up to N uint32_t node IDs. The filter uses 2 hash functions.

Why a Bloom Filter?

Space Efficiency: Instead of listing 60 node IDs (240 bytes), a 13-byte filter is far smaller.
Reduce Redundant Rebroadcasts: We use the filter to decide whether a new hop adds sufficient new coverage to justify forwarding.
Acceptable False Positives: The ~37% false positive rate (worst case - assumes each node has 20 recently seen nodes to add) is fine for our probabilistic approach; it still drastically reduces traffic.

False Positives - A Known Limitation

Bloom filters can give false positives (claiming “node X is in the set” when it actually isn’t), but they never give false negatives (if it says “not in the set,” then it’s definitely not in the set).

For our routing use case, these occasional false positives are acceptable because we only need a rough idea of which nodes are covered to make probabilistic rebroadcast decisions.

Impact

Fewer Duplicates: Less congestion and lower power consumption. Simulations suggest ~30% fewer transmissions, depending on topology and node density. Results below.

… auto position updates

GUVWAF · 2024-12-30T09:03:31Z

First thing: this can only be done for version 3.0, because extending the header would be a breaking change.

Sounds interesting, I’m curious to the results of simulations. Note that the interactive simulator is hard to use for scale though, as you’ll be simulating all nodes on one machine.

One thing I would change is the number of neighbors per hop. 20 is really a lot, while it’s much more common you exceed 3 hops (which is actually 2 hops because you'll need to add the first neighbors already when the hop limit is still 3).
Also, while the bloom filter doesn’t give false positives, if you can’t hear a node directly it doesn’t necessarily mean it cannot hear you.

I’m also not sure if this overhead (duplicating the header size) is worth it if you only base the probability of rebroadcasting on it. With this you could also implement rebroadcasting until you don’t have unique receivers to serve after receiving other nodes’ rebroadcasts.

medentem · 2024-12-30T12:55:41Z

First thing: this can only be done for version 3.0, because extending the header would be a breaking change.

Makes sense.

Sounds interesting, I’m curious to the results of simulations. Note that the interactive simulator is hard to use for scale though, as you’ll be simulating all nodes on one machine.

For sure. Will need some real world after simulating the effects. By the way, I'm having trouble with the simulator. Is there anything off the top of your head that might need to change given the expanded header? I changed the config value to 32 bytes and looked through the code but nothing jumped out at me. But for some reason the packet is losing the header value when flowing through the simulator.

One thing I would change is the number of neighbors per hop. 20 is really a lot, while it’s much more common you exceed 3 hops (which is actually 2 hops because you'll need to add the first neighbors already when the hop limit is still 3). Also, while the bloom filter doesn’t give false positives, if you can’t hear a node directly it doesn’t necessarily mean it cannot hear you.

This will be the interesting part. In an ideal state, all a nodes current neighbors should be in the filter. But the best we can do is approximate current neighbors because we don't have frequent heartbeats to say "I'm still here". Right now I'm filtering down to nodes heard directly within the last hour because that is the minimum interval a nodeinfo packet goes out. Open to thoughts on this. I don't think we can be sure that 20 is too much. For example, the Chicago mesh is very dense. I think during their scheduled NETs, 20 is low.

And also, we can't pack the bloom filter with a fixed size with too many nodes or the False Positive Rate will increase. A node will more often think "My neighbor, X node, was already covered and is in the filter" when it actually is not. A non-zero false positive rate should be fine, but we wouldn't want it to climb too high.

I’m also not sure if this overhead (duplicating the header size) is worth it if you only base the probability of rebroadcasting on it. With this you could also implement rebroadcasting until you don’t have unique receivers to serve after receiving other nodes’ rebroadcasts.

Probabilistic forwarding can be aggressive but still leave room for some level of potential duplication. I view it as being almost inversely proportional to the false positive rate. For as often as we may accidentally think we cover a node, we allow some potential for rebroadcast. But your idea is super interesting. Would have to think about that.

GUVWAF · 2024-12-30T13:24:27Z

Is there anything off the top of your head that might need to change given the expanded header? I changed the config value to 32 bytes and looked through the code but nothing jumped out at me. But for some reason the packet is losing the header value when flowing through the simulator.

I also responded to your question here: #5629 (comment)

The header is not sent in "raw" form to the simulator (or any client app); the fields are parsed and then added to the MeshPacket protobuf message. So you either have to add your field to that protobuf message and forward it in the simulator, or as workaround send it with the payload and parse it from there.

garthvh · 2024-12-31T01:21:45Z

src/modules/NodeInfoModule.cpp

This change should not be a part of this feature, was already closed.

I will remove. This is not ready to merge right now.

… longer

medentem · 2025-01-09T15:52:37Z

UPDATED APPROACH

This has been a saga of changes based on excellent input from @GUVWAF. I wanted to capture the latest approach because it has evolved quite a bit since the beginning. Here is the "How it works":

Packet Header

The packet header was expanded by 16 bytes to accommodate two changes:
- relay_node was expanded to 4 bytes from 1 byte which ensures when a packet arrives, you know exactly who gave it to you. This is critical to a node's knowledge of its immediate neighbors.
- A 13 byte Bloom Filter was added (coverage_filter) and when I node receives a packet, that tells the receiving node "Here are the nodes that have already been covered".

How a Node's Coverage Is Maintained:

Every packet that arrives will now have relay_node which may or may not differ from the from node. The important this is that the relay_node is always a direct neighbor as it is the actual transmitter of the packet a node receives.
Theses relay node ids are added to a data structure with the last time they were heard. This is effectively the node's coverage list uncompressed.

How a Node's Coverage is Applied:

When a packet is received the node determines how much new coverage it can add by comparing it's local coverage knowledge against the bloom filter (aka. coverage_filter) in the packet header.
Due to the low bandwidth nature of the mesh, which means periodic automatic packets do not occur often, and potential for stale coverage particularly in sparse, highly dynamic meshes, we need additional mechanisms to mitigate degrading coverage knowledge. Summary of those mechanisms:
- When a node has no knowledge of any direct neighbors, it falls back to a high probability of rebroadcasting. This is to mitigate first startup, or when we haven't heard from others in a long time there is effectively no harm of broadcasting.
- Coverage is calculated as a weighted ratio of new coverage we offer (when checked against the bloom filter) divided by the total coverage we're aware of - our direct neighbors
  - The weighted ratio is a decay mechanism whereby the longer its been since we've heard a node, the less it counts towards new coverage we offer. Effectively, our confidence in it being nearby starts to shrink.
After computing our weighted additional coverage, a scaled probability of rebroadcasting is determined based on that additional coverage

What hasn't changed:

All rebroadcasts are still subject to the SNR-based delay mechanism. In other words, retransmissions will still get stopped if a further node rebroadcasts before our delay expires. The coverage filter is a layer on top of that.

Main Benefits

hop_limit is no longer needed with the coverage filter in place which means the mesh is more adaptable to changes over time. Hops stop when coverage stops.
Airtime utilization is less because wasteful rebroadcasts are clamped by coverage
Reachability increases because hops that would otherwise have never happened can now happen without risk of unnecessary chatter
May be possible that device roles are less important, though more testing is necessary.

Update protobufs and classes

GUVWAF · 2025-01-09T18:54:33Z

I think this would majorly improve this routing algorithm because our knowledge of coverage would increase a lot in these other cases. Thoughts?

Yes, I think extending the relay_node to its full ID is something to consider for 3.0. It also has the benefit that you know who sent the “implicit ACK” on a broadcast.

While this might not be so important now that the full relay_node ID is added, I don’t think we can assume that every moving node uses smart position broadcast. Many nodes don’t have GPS or don’t use it because it uses too much power.

Also, have you every changed the hop limit for MANAGED_FLOOD? I’m wondering whether the low reachability in case there’s a lot of nodes is just because the hop limit is too low (see also the figure in the README of the simulator). Also strategically placed ROUTERs may help, but it's a bit hard to decide whether it's realistic people assign these correctly.

Next to this, I’m a bit skeptical about the hard-coded parameters like COVERAGE_RATIO_SCALE_FACTOR, UNKNOWN_COVERAGE_REBROADCAST_PROBABILITY and RECENCY_THRESHOLD. Are these still valid for different LoRa modem presets, different PERIOD or PACKET_LENGTH, etc.?

Looking forward to the new results.

medentem · 2025-01-10T04:37:52Z

While this might not be so important now that the full relay_node ID is added, I don’t think we can assume that every moving node uses smart position broadcast. Many nodes don’t have GPS or don’t use it because it uses too much power.

I can vary this as well.

Also, have you every changed the hop limit for MANAGED_FLOOD? I’m wondering whether the low reachability in case there’s a lot of nodes is just because the hop limit is too low (see also the figure in the README of the simulator). Also strategically placed ROUTERs may help, but it's a bit hard to decide whether it's realistic people assign these correctly.

No. I have not. I will try with a higher hop limit on managed flooding, but I don't think the purpose of this test is to see if we can craft a more perfect mesh using hop limits and node roles. The reality is most meshes lack the organization and knowledge to achieve this.

Next to this, I’m a bit skeptical about the hard-coded parameters like COVERAGE_RATIO_SCALE_FACTOR, UNKNOWN_COVERAGE_REBROADCAST_PROBABILITY and RECENCY_THRESHOLD.

These are necessary to make testing more convenient and those values have a material impact on the coverage based algorithm. Each represents a careful mitigation of the edge cases where coverage based routing is weak due to the lower bandwidth less frequent updates from neighboring nodes.

The effect of COVERAGE_RATIO_SCALE_FACTOR can be seen here in the linear line in red. It is used to vary how quickly a 100% probability of rebroadcast should happen. For example, if 20% of my nodes are new for this packet, we want that to rebroadcast. If the scale factor was 1, only 100% new coverage would guarantee rebroadcast.

UNKNOWN_COVERAGE_REBROADCAST_PROBABILITY is used when a node first boots, or all of its coverage has fully aged out. Effectively, the node has 0 confidence as to other nodes around it. Therefore, we should prefer that it rebroadcasts because it is an ignorant node for the time being.

RECENCY_THRESHOLD is the amount of time a neighboring node remains in a nodes coverage list. As the node nears the threshold, its contribution to the probability decays because our confidence in it still being nearby decays. Again, this is a mitigation tactic based on our imperfect means of tracking coverage.

Are these still valid for different LoRa modem presets, different PERIOD or PACKET_LENGTH, etc.?

I will need to run more tests, but I don't think the lora modem presets or packet length should impact this because all of these deal with coverage which doesn't change based on those two things. PERIOD may matter more because it is directly tied to establishing coverage knowledge.

medentem · 2025-01-10T04:42:56Z

Here is an interesting test that shows the effect of asymmetric links. This is with 50 nodes. The % of "I don't cover this node, but I think I can" is directly tied to the % of links that are asymmetric. 8.41 / (8.41+13.31) = 38.7%

UPDATED - Correction below. The language in the log output is not right.
So the impact of asymmetric links is that I receive a packet, think I can't reach this node, but I could have. That would cause loss of coverage.

medentem · 2025-01-11T18:53:52Z

Sorry for delay. Simulations are coming, but running them is becoming very time consuming. I am almost done with some changes to the batch simulator so I can run these tests with a little more scale and less manual touch.

garthvh · 2025-01-11T18:56:51Z

It would be good to add more mobile nodes, in many meshes a high percentage of nodes are mobile (and increasing) without position knowledge.

medentem · 2025-01-13T19:37:22Z

So I've been deep in this for 2+ weeks and I think I need to reset my brain a bit to make sense of the results - so many iterations, so many tests.

What I think I've learned so far is... anything that could degrade moment-in-time knowledge of direct neighbors will degrade the performance of the coverage router. I am working to find the inflection point where the size of the mesh is so small that the sensitivity to coverage is too high, but it seems like < 35.

In general, results degrade with the new layers of more "real world" simulation - which is intuitive, because things like mobile nodes w/out GPS and asymmetric links degrade any given node's ability to sense direct neighbors accurately over time.

Here are the baseline results comparing managed flood (3 hops, 16byte header) with bloom router (15 hops, 32byte header) and a few other tunable constants that affect rebroadcast probability.

NEW IMAGES COMING....

medentem · 2025-01-20T17:27:07Z

Well friends. After a lot of additional testing I have come to the conclusion that the coverage based router is not viable. Here are the main reasons:

It is not possible to reliably detect the conditions under which we would expect a given node's knowledge of its coverage to degrade.
When coverage knowledge degrades, rebroadcasting decisions are often wrong.
It appears the coverage filter would offer roughly the same performance as the managed flood routing without requiring hop limit. That said, it adds 16 bytes and offers no real NET benefit.
At extremely high mesh density (500 nodes), coverage knowledge degrades because too many nodes are added to the compressed data space, which in turn increases the false positive rate.

In short, there is a narrow sweet spot where this works, but we can't detect when the network falls outside the efficacy range of the implementation.

@GUVWAF - I hope the modifications to the simulator are still useful.

GUVWAF · 2025-01-20T19:15:54Z

That’s really a shame. It really looked promising at some point, but if it turns out not to give significant improvements in realistic scenarios, indeed I don’t think it’s worth investigating further. It’s anyway good to have considered the option and really want to thank you for doing this so extensively.

The additions to the simulator are definitely useful. If you have time to separate those additions from the bloom filter implementation, that would be great, otherwise I will try to do it sometime.

medentem · 2025-01-20T19:32:29Z

The additions to the simulator are definitely useful. If you have time to separate those additions from the bloom filter implementation, that would be great, otherwise I will try to do it sometime.

I will do that. Thank you!

medentem · 2025-01-20T21:54:02Z

Farewell bloom.

medentem and others added 18 commits December 15, 2024 19:06

adding userpref option for decoupling the auto discovery channel from…

2dc00ea

… auto position updates

fixed comparison

6172975

switched to channel hash for durability

89a3d3e

hash pref

dd1698a

fallback to parameter channel value

3af8316

revert

651c246

updated var name

f2eb049

Merge branch 'master' into master

4d372ce

Merge branch 'master' into master

04b658e

Merge branch 'master' into master

1301218

Merge branch 'master' into master

ef81f1e

initial bloom filter to optimize mesh

4718aa7

adjusting params

6a8e6c1

added debugging and fixed byte writes to coverage_filter

f9c7471

fixed missing bytes from radio packet

5755eba

map coverage filter on the way in

bb46618

map coverage filter

88d9c38

Merge branch 'master' into feature-bloomrouter

a4f50ec

use custom protobufs to add coverage_filter

9d44433

garthvh requested review from thebentern, caveman99 and GUVWAF and removed request for thebentern and caveman99 December 31, 2024 01:20

garthvh requested changes Dec 31, 2024

View reviewed changes

medentem added 3 commits January 9, 2025 08:44

decay coverage knowledge which allows us to keep coverage data around…

eb0ab06

… longer

ignore relay node as well

ac68435

need different data structure locally for coverage.

5b82301

medentem and others added 11 commits January 9, 2025 09:59

update protobufs

06e4383

Remove protobufs submodule

60512da

Re-add protobufs submodule

227e2aa

[create-pull-request] automated change

c895cf0

Merge pull request #1 from medentem/create-pull-request/patch

da5b4bf

Update protobufs and classes

[create-pull-request] automated change

0dc2874

Merge pull request #2 from medentem/create-pull-request/patch

42b855d

Update protobufs and classes

[create-pull-request] automated change

157eaf6

Merge pull request #3 from medentem/create-pull-request/patch

8a500f1

Update protobufs and classes

moved relay node to its own lightweight storage mechanism

9d46a5a

get rid of extra Vscode BS

74c0be6

medentem added 3 commits January 9, 2025 23:10

fixed baseline probability

4d4dec1

clarified comment. updated recency

5ab322a

this might work... still need to be tested

2f33202

medentem closed this Jan 20, 2025

medentem deleted the feature-bloomrouter branch January 20, 2025 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft - Bloom Filter Probabilistic Routing #5697

Draft - Bloom Filter Probabilistic Routing #5697

medentem commented Dec 30, 2024 •

edited

Loading

GUVWAF commented Dec 30, 2024

medentem commented Dec 30, 2024 •

edited

Loading

GUVWAF commented Dec 30, 2024

garthvh Dec 31, 2024

medentem Dec 31, 2024

medentem commented Jan 9, 2025 •

edited

Loading

GUVWAF commented Jan 9, 2025 •

edited

Loading

medentem commented Jan 10, 2025

medentem commented Jan 10, 2025 •

edited

Loading

medentem commented Jan 11, 2025

garthvh commented Jan 11, 2025

medentem commented Jan 13, 2025 •

edited

Loading

medentem commented Jan 20, 2025

GUVWAF commented Jan 20, 2025

medentem commented Jan 20, 2025

medentem commented Jan 20, 2025

Draft - Bloom Filter Probabilistic Routing #5697

Draft - Bloom Filter Probabilistic Routing #5697

Conversation

medentem commented Dec 30, 2024 • edited Loading

GUVWAF commented Dec 30, 2024

medentem commented Dec 30, 2024 • edited Loading

GUVWAF commented Dec 30, 2024

garthvh Dec 31, 2024

Choose a reason for hiding this comment

medentem Dec 31, 2024

Choose a reason for hiding this comment

medentem commented Jan 9, 2025 • edited Loading

GUVWAF commented Jan 9, 2025 • edited Loading

medentem commented Jan 10, 2025

medentem commented Jan 10, 2025 • edited Loading

medentem commented Jan 11, 2025

garthvh commented Jan 11, 2025

medentem commented Jan 13, 2025 • edited Loading

medentem commented Jan 20, 2025

GUVWAF commented Jan 20, 2025

medentem commented Jan 20, 2025

medentem commented Jan 20, 2025

medentem commented Dec 30, 2024 •

edited

Loading

medentem commented Dec 30, 2024 •

edited

Loading

medentem commented Jan 9, 2025 •

edited

Loading

GUVWAF commented Jan 9, 2025 •

edited

Loading

medentem commented Jan 10, 2025 •

edited

Loading

medentem commented Jan 13, 2025 •

edited

Loading