Dealing with the scale of Lens #24

il3ven · 2023-05-03T11:05:39Z

il3ven
May 3, 2023
Maintainer

The scale of the Lens protocol is much greater than what we have seen with existing protocols such as Sound.

My proposed approach for crawling

We can listen for the PostCreated event. The event includes a content URI and upon visiting the link we get the metadata. The metadata has information if the post is audio or not. We have to download the metadata for each post.

Downloading the metadata and checking for each post will test our scaling capability.

I haven't checked if Lens API provides a better way to sort through posts. I have only looked through on-chain events as they are more preferable than centralized APIs.

Potential solutions to reduce information

@neatonk suggested the following:

I wonder if one of the module addresses in that event would be relevant. E.g. this address is used for songs posted via riffapp and is likely music.

Failing that, it might be reasonable to initially limit the crawl to profileids of known musicians. Then address scaling challenges and workout how to detect posts containing music.

il3ven · 2023-05-03T11:12:15Z

il3ven
May 3, 2023
Maintainer Author

I wonder if one of the module addresses in that event would be relevant. E.g. this address is used for songs posted via riffapp and is likely music.

We have two fields in the event that can potentially be used for this - collectModule and referenceModule. As far as I am aware we can provide our own modules while creating a post but usually a default module that Lens has created is used. So, this option is not very promising.

Failing that, it might be reasonable to initially limit the crawl to profileids of known musicians. Then address scaling challenges and workout how to detect posts containing music.

Yes, this can be done.

My initial thought is to crawl for all of Lens and see the performance. If we hit significant problems then we can limit to only known musicians.

CC: @Watcher-eth

0 replies

neatonk · 2023-05-03T18:31:55Z

neatonk
May 3, 2023
Maintainer

My initial thought is to crawl for all of Lens and see the performance. If we hit significant problems then we can limit to only known musicians.

Good call. No need to fix it if it isn't broken.

0 replies

il3ven · 2023-05-12T10:42:22Z

il3ven
May 12, 2023
Maintainer Author

I have started stress testing the rough Lens implementation I have written. In around 5 hours Lens protocol sees 7000 posts. I feel that our crawler is robust, now that we have concurrency and retry functionalities in extraction-worker. So, given enough time we should be able to crawl Lens completely.

Currently, it is taking 4 minutes to crawl 7000 posts. Therefore, it will take 12 mins to crawl all posts in a day and 6 hours to crawl all posts created in a month.

This is slow. If someone was starting their crawl from the starting it will take 4-6 days, assuming we have to crawl a year worth of posts. Also worth noting that if someone does not use their own Arweave/IPFS gateway or a public gateway they will encounter a lot of costs.

Right now, I am inclining towards the approach @neatonk proposed that is to limit crawl to known musicians. What do you all think? (@djfnd)

14 replies

neatonk May 16, 2023
Maintainer

This might be what I had in mind. It's a bit weird because the results are sorted by LATEST. Probably good enough to run manually with a high limit and dump uniq ids to a file.

curl 'https://api.lens.dev/' -H 'Accept-Encoding: gzip, deflate, br' -H 'Content-Type: application/json' -H 'Accept: application/json' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Origin: https://api.lens.dev' --data-binary '{"query":"query Publications {\n  explorePublications(request: {\n    publicationTypes: [POST],\n    sortCriteria: LATEST,\n    limit: 10,\n    sources: [\"beats\", \"riff\"],\n  }) {\n    items {\n      ... on Post {\n        profile {\n          id\n        }\n      }\n    }\n    pageInfo {\n      prev\n      next\n      totalCount\n    }\n  }\n}"}' --compressed

You might have more luck with the BigQuery data set for this.

il3ven May 16, 2023
Maintainer Author

Do you think it would help to include a startBlock with each profileID to limit the impact of each recrawl?

I don't think this will help. Nothing is stopping a user to provide a startBlock of zero for a new profile ID and recrawling.

il3ven May 16, 2023
Maintainer Author

Probably good enough to run manually with a high limit and dump uniq ids to a file.

This query can help us get the initial profile IDs but how to get new profile IDs?

neatonk May 18, 2023
Maintainer

Probably good enough to run manually with a high limit and dump uniq ids to a file.

This query can help us get the initial profile IDs but how to get new profile IDs?

That query fetches the latest N posts. The first time you run it you can set a high limit or use the cursor to page through all of the results. Record the createdAt date of the latest result. On subsequent runs, you can use a lower limit and page through the results until you reach a post with a createdAt dates less than the one you recorded.

Should work, but I have not tried.

il3ven May 18, 2023
Maintainer Author

Yes, should work. In your method we are getting the posts anyway so we might as well rely on the API completely instead of getting only the profile IDs

Watcher-eth · 2023-05-12T19:52:43Z

Watcher-eth
May 12, 2023

Hey guys great discussion. I agree no need to crawl all of lens with the speed at how things Safe growing this is gonna her costly very fast! You can actually even Index by App as the api and schemas have a source or appID resolver that let’s you filter by app. In our case it would be "beats" and our new one is "riff" Any news on the feed strategy? Getting music chronologically based on an array of wallet addresses. Will talk with Dan later this week too. Hope everyone is doing good. Let me know if I can be of any assistance. Watcher

…

On Fri 12. May 2023 at 05:37, Dan Fowler ***@***.***> wrote: That sounds like a good idea to me. I will spin up a thread on how to use the money when we have a clear idea about exactly what it'll be, but I reckon it'll be of the order of $2.5k. That said, as this is a Lens specific thing, then I think there's also good rationale to put some of the next month's tranch of Lens grant towards it. — Reply to this email directly, view it on GitHub <#24 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZXIWAC6A3CACUP5BOEKTTTXFYVHDANCNFSM6AAAAAAXUH2RDY> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

1 reply

il3ven May 15, 2023
Maintainer Author

Using Lens API how can we get news posts? Suppose we run the API once to get all the posts by Riff App. Can we then ask the API for only new posts? Or will we have to get all the posts again and save only the new ones? In the Get Publications endpoint I don't see an option to filter posts by block-number/timestamp.

djfnd · 2023-05-16T09:49:53Z

djfnd
May 16, 2023
Maintainer

FYI I have made a post in the lens dev tg group and linked to this discussion to seek some feedback and input

1 reply

djfnd May 16, 2023
Maintainer

Some input from Stani - "specific app IDs, i.e. Riff produces a lot of music, but other fields of Metadata for example what is the content focus might be helpful and of course the file format type as audio or video (in case of Lenstube for example). Happy to see that you are pulling directly from on-chain!"

cc: @il3ven @neatonk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neume

Dealing with the scale of Lens #24

{{title}}

Replies: 5 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

neume

Dealing with the scale of Lens #24

il3ven May 3, 2023 Maintainer

My proposed approach for crawling

Potential solutions to reduce information

Replies: 5 comments · 16 replies

il3ven May 3, 2023 Maintainer Author

neatonk May 3, 2023 Maintainer

il3ven May 12, 2023 Maintainer Author

neatonk May 16, 2023 Maintainer

il3ven May 16, 2023 Maintainer Author

il3ven May 16, 2023 Maintainer Author

neatonk May 18, 2023 Maintainer

il3ven May 18, 2023 Maintainer Author

Watcher-eth May 12, 2023

il3ven May 15, 2023 Maintainer Author

djfnd May 16, 2023 Maintainer

djfnd May 16, 2023 Maintainer

il3ven
May 3, 2023
Maintainer

Replies: 5 comments 16 replies

il3ven
May 3, 2023
Maintainer Author

neatonk
May 3, 2023
Maintainer

il3ven
May 12, 2023
Maintainer Author

neatonk May 16, 2023
Maintainer

il3ven May 16, 2023
Maintainer Author

il3ven May 16, 2023
Maintainer Author

neatonk May 18, 2023
Maintainer

il3ven May 18, 2023
Maintainer Author

Watcher-eth
May 12, 2023

il3ven May 15, 2023
Maintainer Author

djfnd
May 16, 2023
Maintainer

djfnd May 16, 2023
Maintainer