VideoCache

RPC server freeze

by imriz on 14 Dec 2008

Hi,

Sometimes the RPC server stalls/freezes. During this time, the processes running with squid_part cannot query the server.

Strace looks something like this (notice the 8 seconds delay at the recvfrom):

Process 4515 attached with 4 threads - interrupt to quit
[pid  4515]      0.000000 recvfrom(7,  <unfinished ...>
[pid  4507]      0.000041 connect(8, {sa_family=AF_INET, sin_port=htons(9100), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
[pid  4506]      0.000047 recvfrom(6,  <unfinished ...>
[pid  4496]      0.000035 futex(0x117bc760, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid  4506]      8.027006 <... recvfrom resumed> "POST /RPC2 HTTP/1.0\\r\\nHost: 127.0"..., 8192, 0, NULL, NULL) = 146
[pid  4506]      0.000113 futex(0x1168ce70, FUTEX_WAKE, 1) = 0

27 Answers

by Kulbir Saini on 15 Dec 2008

Thanks Imriz for providing the trace. Right now I am busy with some other work. Will have a look at it soon (within a week).

Thanks for providing wonderful feedback :)

by imriz on 15 Dec 2008

No problems :)

I think that it might be a good idea to use forking instead of threads - It seems that this issue might be related to threads sync.

by Kulbir Saini on 16 Dec 2008

Previously I was forking the XMLRPC server, then switched to threading due to process sync problem. Anyways I'll retry it. Thanks again :)

by imriz on 18 Feb 2009

Hi,

We're still having these issues, even after upgrading to the latest version today...

Any ideas?

by Kulbir Saini on 18 Feb 2009

Imriz,

Well, not much can be done at programming level in this case, but at system level you can get rid of this problem very easily. Set the TCP timeout to a lower value so that sockets are freed early and your system doesnt run out of sockets.

Use the following command (don't use if you don't understand the impact)

echo 20 > /proc/sys/net/ipv4/tcp_fin_timeout

That will reduce the free time from 60 to 20 seconds and you'll have lot more sockets to work with.

Thank You!

by imriz on 19 Feb 2009

I already did that. It didn't help. I doubt this is a file handlers issue.

by Kulbir Saini on 19 Feb 2009

Imriz,

I tried to explore more and it turns out the tcp_fin_timeout is not the option we should be using. Setting that to 20 doesn't help at all. Please revert it back to 60.

But no need to worry. After exploring some more, I found out that tcp_tw_recycle is the option we need.

Use the following command (use it only if you understand what it does)

echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle

Then restart squid or use this command for a complete clean up

killall -9 squid python; service squid start

Now see if there are too many unused TIME_WAIT sockets in your system.

Please report how it goes.

Thank you for using videocache :)

by bellera on 19 Feb 2009

Hello!

I posted a suggestion about XMLRPC Server stability.

Regards,

Josep Pujadas

by Kulbir Saini on 19 Feb 2009

Josep,

Thank you very much for the suggestion. The thread is open for discussion. Please pen down your views about my comment. In the meantime, did you try the above hack. I am using it on my test system and I don't see any TIME_WAIT sockets. Even if some are there, they disappears in less than a second. May be this hack can work.

Thank you for the efforts :)

by imriz on 19 Feb 2009

No go - same result:

[root@videocache1 ~]# netstat -lnp | grep 9100
tcp        0      0 127.0.0.1:9100              0.0.0.0:*                   LISTEN      1018/(python)       
[root@videocache1 ~]# strace -Ffp 1018
Process 1018 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
futex(0x11a62280, FUTEX_WAIT_PRIVATE, 0, NULL
by bellera on 19 Feb 2009

Kulbir,

For FreeBSD the equivalent to tcp_tw_recycle I think it is ...

On the fly:

proxy# sysctl net.inet.tcp.msl=7500
net.inet.tcp.msl: 30000 -> 7500

At startup. Add to /etc/sysctl.conf the following line:

net.inet.tcp.msl=7500

For FreeBSD 6.2 or greater there is the possibility not create TIME_WAIT state for localhost connections ...

On the fly:

proxy# sysctl net.inet.tcp.nolocaltimewait=1
net.inet.tcp.nolocaltimewait: 0 -> 1

At startup. Add to /etc/sysctl.conf the following line:

net.inet.tcp.nolocaltimewait=1

I choosed the second way because I think is better. More information at:

http://rerepi.wordpress.com/2008/04/19/tuning-freebsd-sysoev-rit/

http://spatula.net:8000/blog/2007/04/freebsd-network-performance-tuning.html

Tomorrow I will see if it helps. Do you think can I increase url_rewrite_children from 10 to 20 at squid.conf?

Regards,

Josep Pujadas

by Kulbir Saini on 19 Feb 2009

Josep,

Thank you very very much for explaining things in FreeBSD context. I hope the explanation will be useful to other users.

Try increasing the url_rewrite_children. The only worry was that they'll increase the TIME_WAIT sockets and I think that is not a problem anymore.

Thanks again :)

by Kulbir Saini on 19 Feb 2009

Imriz,

At least the TIME_WAIT sockets are gone. So, we are now one step ahead in tackling this hang up problem. I don't have much knowledge about strace thing. Can you explain what that output says exactly.

Thank you for your efforts :)

by imriz on 19 Feb 2009

strace shows the syscalls a process is making. Basically what we see here is that the process is stuck on a certain syscall (http://linux.die.net/man/2/futex).

by Kulbir Saini on 19 Feb 2009

Imriz,

I think it must be os.stat (called by cache size calculator). I'll try to come up with some cache-the-cache-size thing.

Thank You!

by bellera on 19 Feb 2009

Kulbir,

Not create TIME_WAIT state for localhost connections with FreeBSD works. I allways see:

proxy# netstat -an | grep "127.0.0.1.9100"
tcp4  0  0  127.0.0.1.9100  *.*  LISTEN

But:

url_rewrite_children 20

doesn't work:

2009-02-19 21:00:53,961 22249 - - XMLRPCSERVER - Starting XMLRPCServer on port 9100.
2009-02-19 21:00:54,007 22275 - - STRAT_XMLRPC_SERVER_ERR - Cannot start XMLRPC Server - Exiting

I returned to:

url_rewrite_children 10

Regards,
Josep Pujadas

by Kulbir Saini on 19 Feb 2009

Josep,

That may be a false error reported by plugin. I'll request you to try this.

  1. Increase url_rewrite_children to 20.
  2. Restart squid.
  3. Don't look at videocache log yet.
  4. Use "netstat -an | grep 9100" and check if somebody is listening on port 9100. a. if videocache/xmlrpc listening on port 9100, its fine. b. if not, reduce the url_rewrite_children.
  5. Check videocache.log, it might have shown xmlrpc startup failure even when it started successfully. I know this issue. It occurs when you have a large number of rewrite children.

Thank You!

by bellera on 19 Feb 2009

Kulbir,

I had many consoles to see the logs (squid and redirectors) with:

tail -f [log_file]

I think this was my problem!!!!, as you suggested saying:

"3. Don't look at videocache log yet."

Everything ok! Thanks for your pacience!

Regards,

Josep Pujadas

by imriz on 19 Feb 2009

If you mean that the os.stat is hanging the XMLRPC, I think you're wrong, as it happens even if dir_size = 0

I'm thinking you should revert to using UDP (UDP will allow you to get a 'fire and forget' work flow) and a simpler "command" server. I also think that for robustness, the XMLRPC (or it's successor), must be split into an independent daemon.

What do you think?

by Kulbir Saini on 20 Feb 2009

Josep,

This XMLRPC is a ghost and troubling me more than it should. Anyways you got it working finally.

Thank you for using videocache :)

by Kulbir Saini on 20 Feb 2009

Imriz,

os.stat also needs to be optimized. But yeah, XMLRPC is a bigger problem than that. If you know of some mechanisms for inter process communication via sockets, do let me know.

Thank You!

by bellera on 20 Feb 2009

Kulbir,

"This XMLRPC is a ghost"

Really!

Next day ... I have a maintenance cron for squid at midnight (stop + statistics + rotate + squidGuard DB update + start). With url_rewrite_children 20 I had a new crash for XMLRPC:

2009-02-20 00:06:20,957 26903 - - XMLRPCSERVER - Starting XMLRPCServer on port 9100.
2009-02-20 00:06:20,958 26920 - - STRAT_XMLRPC_SERVER_ERR - Cannot start XMLRPC Server - Exiting

Now I put url_rewrite_children 15. It seems to work. I will conserve this value to see next days if it is good for me.

Thanks a new time!

Regards,
Josep Pujadas

by Kulbir Saini on 20 Feb 2009

Josep,

Again that warning error might have been false. Did you check "netstat -atpn |grep 9100" even after the failure. I would suggest you to revert back to 20 and check.

Thank You!

by imriz on 20 Feb 2009

I think that the download manager should be separated from the squidpart(). The squidpart() should use a fast and "cheap" mechanisim to let the download manager know if someone requested a video. I suggest UDP for this mission.

by Kulbir Saini on 20 Feb 2009

Imriz,

Decoupling XMLRPC server or moving to some UPD-based-ipc itself will make the squipart() function's job really easy. I think we need to explore more on that part.

Thank You!

by imriz on 21 Feb 2009

I have made a prototype of my suggested idea in perl. I'm using a 3 parts model. The first part is the squipart() - It sends a UDP message to a seperate daemon, which stores the request in a MySQL database.The third part is the download manager, which connects to the MySQL database, and downloads the video with the most requests. The downlad manager is using a multi process model, to allow X concurrent downloads.

If you want I can send you the scripts offline (contact me via email)..

by Kulbir Saini on 22 Feb 2009

Imriz,

We can't really use MySQL because that would be another dependency and people find it really hard to install softwares with tons of dependencies.

I am sending you a mail for the model.

Thank you!

You need to sign in. Please sign in to add answer to this question.