Docker中程序读取CPU数量有误的问题

有个新项目,是用Egg.js框架写的,我们本地测试无误后,发布正式环境(2个docker实例),发现有一个docker里面的程序起来了,但是有一个起不来。

查看日志,发现控制台报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2019-08-26 19:50:30,248 ERROR 13111 nodejs.AppWorkerDiedError: [master] app_worker#33:13466 died (code: null, signal: SIGKILL, suicide: false, state: dead), current workers: ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","34","35","36","37","38","39","40"]
at Master.onAppExit (/var/www/bigscreen-quote/server/node_modules/egg-cluster/lib/master.js:426:21)
at emitOne (events.js:116:13)
at Master.emit (events.js:211:7)
at Messenger.sendToMaster (/var/www/bigscreen-quote/server/node_modules/egg-cluster/lib/utils/messenger.js:137:17)
at Messenger.send (/var/www/bigscreen-quote/server/node_modules/egg-cluster/lib/utils/messenger.js:102:12)
at EventEmitter.cluster.on (/var/www/bigscreen-quote/server/node_modules/egg-cluster/lib/master.js:295:22)
at emitThree (events.js:141:20)
at EventEmitter.emit (events.js:217:7)
at ChildProcess.worker.process.once (internal/cluster/master.js:185:13)
at Object.onceWrapper (events.js:317:30)
name: "AppWorkerDiedError"
pid: 13111
hostname: iwc-datav-bigscreen-quote-5fcdbf76d4-bhnj7

经排查,我们用到了redis,而三个redis的ip和端口,有两个是不能telnet通的,因此我们怀疑是这2个IP端口有权限限制,发了邮件让运维帮忙开通权限:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
web:
您好,我们这边有个XX的项目需要用到redis集群,请帮忙给docker开通访问redis集群的权限,谢谢。

dockerIP:
10.201.73.114, 10.201.73.115, 10.201.73.116

需要访问的集群信息:
xx.xx.xx.xx:7007
xx.xx.xx.xx:7006
xx.xx.xx.xx:7008

预计需要存储的数据大小:
1M以内


非常感谢。

然后运维反馈redis没有做权限限制,全部都是对内放开的。

后来又让运维帮忙看了下,发现是我们把其中2个IP的端口写反了。

redis问题解决后,我们发现程序还是起不来,还是报这样的错误。

然后Google查询,看网上有人反馈是worker进程开启太多导致的。我们怀疑是不是程序自动识别服务器cpu数量的时候,误读成了docker宿主机的cpu数量,导致开启的worker进程过多,CPU资源不足,所以没起来。

查看了另外一个成功启动的docker,发现居然起了20多个worker进程!然后我们在启动命令中手动指定了worker进程数:

1
npm run tsc &&  egg-scripts start --daemon --title=bigscreen-quote --env=${THS_TIER} --workers=2

这里有个比较尴尬的事情,我把–workers参数误写成了–worker,导致加了后进程限制没生效,还以为是这个参数不起作用。

添加参数后重新启动,发现进程数正常了,之前无法启动的docker,也能正常跑起来了。

至此,原因终于找到了。