Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help request: traffic-split plugin occurs error #9831

Closed
jujiale opened this issue Jul 13, 2023 · 21 comments · Fixed by #11932
Closed

help request: traffic-split plugin occurs error #9831

jujiale opened this issue Jul 13, 2023 · 21 comments · Fixed by #11932

Comments

@jujiale
Copy link
Contributor

jujiale commented Jul 13, 2023

Description

hello.
as this issue methioned: #5607

I also suffered such a situation. but I can not reproduce it. the error log is :
`

/traffic-split.lua:209: attempt to index field 'upstream' (a string value)

`

according the source code. when the traffic-split plugin config "weighted_upstreams" . it's element has no property "upstream". but have property "upstream_id", it should not invode the follwing code (line 209, because elseif upstream_obj.upstream then return false )
image

so I confused why line 209 is invoked.

traffic_split config is below:
`

{
"traffic-split": {
	"disable": false,
	"rules": [{
		"weighted_upstreams": [{
			"weight": 0
		}, {
			"upstream_id": "466807194923303750",
			"weight": 100
		}, {
			"upstream_id": "466807194721978186",
			"weight": 0
		}]
	}]
}

}

`

Environment

  • APISIX version (run apisix version): v2.12.0
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):3.5.0
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@jujiale
Copy link
Contributor Author

jujiale commented Jul 13, 2023

what I concern is in the following code: traffic-split.lua

code A:
`

      local function new_rr_obj(weighted_upstreams)
      local server_list = {}
      for i, upstream_obj in ipairs(weighted_upstreams) do
          if upstream_obj.upstream_id then
              server_list[upstream_obj.upstream_id] = upstream_obj.weight
          elseif upstream_obj.upstream then
              -- Add a virtual id field to uniquely identify the upstream key.
              upstream_obj.upstream.vid = i
              -- Get the table id of the nodes as part of the upstream_key,
              -- avoid upstream_key duplicate because vid is the same in the loop
              -- when multiple rules with multiple weighted_upstreams under each rule.
              -- see https://github.com/apache/apisix/issues/5276
              local node_tid = tostring(upstream_obj.upstream.nodes):sub(#"table: " + 1)
              upstream_obj.upstream.node_tid = node_tid
              server_list[upstream_obj.upstream] = upstream_obj.weight
          else
              -- If the upstream object has only the weight value, it means
              -- that the upstream weight value on the default route has been reached.
              -- Mark empty upstream services in the plugin.
              upstream_obj.upstream = "plugin#upstream#is#empty"
              server_list[upstream_obj.upstream] = upstream_obj.weight
  
          end
      end
  
      return roundrobin:new(server_list)
  end

`

it also have a lru cache
codeB:
`

 local rr_up, err = lrucache(weighted_upstreams, nil, new_rr_obj, weighted_upstreams)

`

when traffic-split config is the following:
`

          {
          "traffic-split": {
	          "disable": false,
	          "rules": [{
		          "weighted_upstreams": [{
			          "weight": 0
		          }, {
			          "upstream_id": "466807194923303750",
			          "weight": 100
		          }, {
			          "upstream_id": "466807194721978186",
			          "weight": 0
		          }]
	          }]
          }
      }

`

if the request hit the following config:
`

     {
			          "weight": 0
		          }

because upstream_obj.upstream is nil .so it enters:

        upstream_obj.upstream = "plugin#upstream#is#empty"
        server_list[upstream_obj.upstream] = upstream_obj.weight

`

so this time. the upstream_obj has upstream_obj.upstream. it is "plugin#upstream#is#empty". and it's type is string.

so next time. when another request comes. the lrucache also exists. so it hit :
`

 elseif upstream_obj.upstream then
        -- Add a virtual id field to uniquely identify the upstream key.
        upstream_obj.upstream.vid = i

`
because the upstream_obj.upstream is "plugin#upstream#is#empty". so error: /traffic-split.lua:209: attempt to index field 'upstream' (a string value) occurs.

I don't know if it is true, so please evalute it. thanks a lot.

@jujiale jujiale changed the title help request: traffic-split plugin ocuurs error help request: traffic-split plugin occurs error Jul 13, 2023
@Revolyssup
Copy link
Contributor

@lingsamuel Please assign this to me.

@Revolyssup
Copy link
Contributor

Revolyssup commented Jul 14, 2023

@jujiale I am investigating this. Although what you said made sense at first but the weighted_upstreams table is created on each request so even if it is manipulated, on the next request we will have the weighted_upstreams table exactly the same as provided in your config. So it cannot be a caching issue.

@Revolyssup
Copy link
Contributor

@jujiale I am not able repro this with your configuration. Can you send the exact route configuration present in running APISIX on the route where you get this error?

@jujiale
Copy link
Contributor Author

jujiale commented Jul 17, 2023

@jujiale I am not able repro this with your configuration. Can you send the exact route configuration present in running APISIX on the route where you get this error?

thanks a lot. I will try my best to reproduce it. once get it, I will present it.

@jujiale
Copy link
Contributor Author

jujiale commented Jul 17, 2023

@jujiale I am investigating this. Although what you said made sense at first but the weighted_upstreams table is created on each request so even if it is manipulated, on the next request we will have the weighted_upstreams table exactly the same as provided in your config. So it cannot be a caching issue.

@Revolyssup hello, as you said. though the weighted_upstreams table is created on each request, why apisix use lrucache to cache it. the lrucache in traffic-split plugin, it's cache "key" is table weighted_upstreams,though each request has its own weighted_upstreams, why should we cache it.

hope for your reply, thanks

@Revolyssup
Copy link
Contributor

@jujiale The caching is done on the whole plugin config and not just the weighted upstreams field, so that if the config passed is the same then we can use the result in the cache.

@jujiale
Copy link
Contributor Author

jujiale commented Jul 17, 2023

@Revolyssup this problem occurs in our prod env, and it is not occurs always. after several days. it will have some error like this. I try my best to reproduce it in my local env. but failed. I will do some research on this. it seems this could occurs in some occasional situation. just as I methioned in #5607 , there is other man who also suffered such problem.

@leslie-tsang leslie-tsang moved this to 🏗 In progress in Apache APISIX backlog Jul 20, 2023
@leslie-tsang leslie-tsang added the bug Something isn't working label Jul 20, 2023
@Sn0rt
Copy link
Contributor

Sn0rt commented Jul 20, 2023

@Revolyssup this problem occurs in our prod env, and it is not occurs always. after several days. it will have some error like this. I try my best to reproduce it in my local env. but failed. I will do some research on this. it seems this could occurs in some occasional situation. just as I methioned in #5607 , there is other man who also suffered such problem.

what you want to do ?

@moonming moonming added can't reproduce and removed bug Something isn't working labels Jul 26, 2023
@moonming moonming moved this from 🏗 In progress to 📋 Backlog in Apache APISIX backlog Jul 26, 2023
@jujiale
Copy link
Contributor Author

jujiale commented Jul 28, 2023

currently,we guess perhaps it is the apisix and etcd , sync data occurs error, because we have a lot of configs in etcd. and there is occurs some error like below:
image

@jujiale
Copy link
Contributor Author

jujiale commented Jul 28, 2023

when the error occurs:
/traffic-split.lua:209: attempt to index field 'upstream' (a string value)

if we reput the config in etcd again(config is not changed, such as just replace A with A) through apisix-dashboard, the error immediately vanish, and everything is ok, but after several time, the error occurs yet.

@Revolyssup
Copy link
Contributor

@jujiale Before reputting the config in etcd, can you check what is the config value for traffic split plugin in etcd at the time of error?

@Revolyssup Revolyssup added the wait for update wait for the author's response in this issue/PR label Aug 9, 2023
@Revolyssup Revolyssup moved this from 📋 Backlog to 🏗 In progress in Apache APISIX backlog Aug 9, 2023
@Revolyssup
Copy link
Contributor

Though I created a PR here #9995 to make sure that upstream is an object for the logic to go there, it still needs to be debugged on etcd's end on why and how the config is being manipulated.

@Revolyssup Revolyssup moved this from 🏗 In progress to 👀 In review in Apache APISIX backlog Aug 10, 2023
@Revolyssup
Copy link
Contributor

Summarising the issue and fix:
The upstream field is of type object so if normally user enters a string, it will give error on schema validation and will not be stored in etcd. But if, after storing in etcd, the config is manipulated such that upstream is set as string then this will give error here[1] as the code tries to access a field on type "string". To avoid this scenario, the above mentioned PR adds a check which makes sure that the field is accessed only if upstream is an object(table in lua).

  1. upstream_obj.upstream.vid = i

@jujiale
Copy link
Contributor Author

jujiale commented Aug 14, 2023

@jujiale Before reputting the config in etcd, can you check what is the config value for traffic split plugin in etcd at the time of error?

no, because I donnot have the prod env permission. in my dev env, I cannot reproduce it. so I also add some logic in order the upstream is obj , is not string

@github-actions github-actions bot added user responded and removed wait for update wait for the author's response in this issue/PR labels Aug 14, 2023
@monkeyDluffy6017
Copy link
Contributor

@jujiale Did you set the upstream via the dashboard on the prod env?

@jujiale
Copy link
Contributor Author

jujiale commented Aug 22, 2023

@jujiale Did you set the upstream via the dashboard on the prod env?

as we know, dashboard is composed of dashboard-portal and dashboard-backend-api, we use our own portal and dashboard-api set the config.

@monkeyDluffy6017
Copy link
Contributor

Did you add the schema check for the property upstream in your dashboard?
https://github.com/apache/apisix/blob/master/apisix/plugins/traffic-split.lua#L56
As you see, it cannot be a string

@monkeyDluffy6017 monkeyDluffy6017 added wait for update wait for the author's response in this issue/PR and removed user responded labels Aug 22, 2023
@jujiale
Copy link
Contributor Author

jujiale commented Aug 22, 2023

Did you add the schema check for the property upstream in your dashboard? https://github.com/apache/apisix/blob/master/apisix/plugins/traffic-split.lua#L56 As you see, it cannot be a string

if this problem reason is your thought. I think it should be occur frequently, but when we reput the config, the problem is vanish, after several days or sever week, it suddenly occurs.

@github-actions github-actions bot added user responded and removed wait for update wait for the author's response in this issue/PR labels Aug 22, 2023
@Revolyssup Revolyssup moved this from 👀 In review to 📋 Backlog in Apache APISIX backlog Aug 25, 2023
@shreemaan-abhishek
Copy link
Contributor

I will close this issue now considering that this is an issue related to data corruption in etcd. We can reopen this is other users experience the same issue.

@nic-chen
Copy link
Member

fixed by #11932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment