Hi,
Sometimes the RPC server stalls/freezes. During this time, the processes running with squid_part cannot query the server.
Strace looks something like this (notice the 8 seconds delay at the recvfrom):
Process 4515 attached with 4 threads - interrupt to quit
[pid 4515] 0.000000 recvfrom(7, <unfinished ...>
[pid 4507] 0.000041 connect(8, {sa_family=AF_INET, sin_port=htons(9100), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
[pid 4506] 0.000047 recvfrom(6, <unfinished ...>
[pid 4496] 0.000035 futex(0x117bc760, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 4506] 8.027006 <... recvfrom resumed> "POST /RPC2 HTTP/1.0\\r\\nHost: 127.0"..., 8192, 0, NULL, NULL) = 146
[pid 4506] 0.000113 futex(0x1168ce70, FUTEX_WAKE, 1) = 0
27 Answers
Thanks Imriz for providing the trace. Right now I am busy with some other work. Will have a look at it soon (within a week).
Thanks for providing wonderful feedback :)
No problems :)
I think that it might be a good idea to use forking instead of threads - It seems that this issue might be related to threads sync.
Previously I was forking the XMLRPC server, then switched to threading due to process sync problem. Anyways I'll retry it. Thanks again :)
Hi,
We're still having these issues, even after upgrading to the latest version today...
Any ideas?
Imriz,
Well, not much can be done at programming level in this case, but at system level you can get rid of this problem very easily. Set the TCP timeout to a lower value so that sockets are freed early and your system doesnt run out of sockets.
Use the following command (don't use if you don't understand the impact)
echo 20 > /proc/sys/net/ipv4/tcp_fin_timeout
That will reduce the free time from 60 to 20 seconds and you'll have lot more sockets to work with.
Thank You!
Imriz,
I tried to explore more and it turns out the tcp_fin_timeout is not the option we should be using. Setting that to 20 doesn't help at all. Please revert it back to 60.
But no need to worry. After exploring some more, I found out that tcp_tw_recycle is the option we need.
Use the following command (use it only if you understand what it does)
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
Then restart squid or use this command for a complete clean up
killall -9 squid python; service squid start
Now see if there are too many unused TIME_WAIT sockets in your system.
Please report how it goes.
Thank you for using videocache :)
Hello!
I posted a suggestion about XMLRPC Server stability.
Regards,
Josep Pujadas
Josep,
Thank you very much for the suggestion. The thread is open for discussion. Please pen down your views about my comment. In the meantime, did you try the above hack. I am using it on my test system and I don't see any TIME_WAIT sockets. Even if some are there, they disappears in less than a second. May be this hack can work.
Thank you for the efforts :)
No go - same result:
[root@videocache1 ~]# netstat -lnp | grep 9100
tcp 0 0 127.0.0.1:9100 0.0.0.0:* LISTEN 1018/(python)
[root@videocache1 ~]# strace -Ffp 1018
Process 1018 attached - interrupt to quit
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
--- SIGSTOP (Stopped (signal)) @ 0 (0) ---
futex(0x11a62280, FUTEX_WAIT_PRIVATE, 0, NULL
Kulbir,
For FreeBSD the equivalent to tcp_tw_recycle I think it is ...
On the fly:
proxy# sysctl net.inet.tcp.msl=7500
net.inet.tcp.msl: 30000 -> 7500
At startup. Add to /etc/sysctl.conf the following line:
net.inet.tcp.msl=7500
For FreeBSD 6.2 or greater there is the possibility not create TIME_WAIT state for localhost connections ...
On the fly:
proxy# sysctl net.inet.tcp.nolocaltimewait=1
net.inet.tcp.nolocaltimewait: 0 -> 1
At startup. Add to /etc/sysctl.conf the following line:
net.inet.tcp.nolocaltimewait=1
I choosed the second way because I think is better. More information at:
http://rerepi.wordpress.com/2008/04/19/tuning-freebsd-sysoev-rit/
http://spatula.net:8000/blog/2007/04/freebsd-network-performance-tuning.html
Tomorrow I will see if it helps. Do you think can I increase url_rewrite_children from 10 to 20 at squid.conf?
Regards,
Josep Pujadas
Josep,
Thank you very very much for explaining things in FreeBSD context. I hope the explanation will be useful to other users.
Try increasing the url_rewrite_children. The only worry was that they'll increase the TIME_WAIT sockets and I think that is not a problem anymore.
Thanks again :)
Imriz,
At least the TIME_WAIT sockets are gone. So, we are now one step ahead in tackling this hang up problem. I don't have much knowledge about strace thing. Can you explain what that output says exactly.
Thank you for your efforts :)
strace shows the syscalls a process is making. Basically what we see here is that the process is stuck on a certain syscall (http://linux.die.net/man/2/futex).
Imriz,
I think it must be os.stat (called by cache size calculator). I'll try to come up with some cache-the-cache-size thing.
Thank You!
Kulbir,
Not create TIME_WAIT state for localhost connections with FreeBSD works. I allways see:
proxy# netstat -an | grep "127.0.0.1.9100"
tcp4 0 0 127.0.0.1.9100 *.* LISTEN
But:
url_rewrite_children 20
doesn't work:
2009-02-19 21:00:53,961 22249 - - XMLRPCSERVER - Starting XMLRPCServer on port 9100.
2009-02-19 21:00:54,007 22275 - - STRAT_XMLRPC_SERVER_ERR - Cannot start XMLRPC Server - Exiting
I returned to:
url_rewrite_children 10
Regards,
Josep Pujadas
Josep,
That may be a false error reported by plugin. I'll request you to try this.
Thank You!
Kulbir,
I had many consoles to see the logs (squid and redirectors) with:
tail -f [log_file]
I think this was my problem!!!!, as you suggested saying:
"3. Don't look at videocache log yet."
Everything ok! Thanks for your pacience!
Regards,
Josep Pujadas
If you mean that the os.stat is hanging the XMLRPC, I think you're wrong, as it happens even if dir_size = 0
I'm thinking you should revert to using UDP (UDP will allow you to get a 'fire and forget' work flow) and a simpler "command" server. I also think that for robustness, the XMLRPC (or it's successor), must be split into an independent daemon.
What do you think?
Josep,
This XMLRPC is a ghost and troubling me more than it should. Anyways you got it working finally.
Thank you for using videocache :)
Imriz,
os.stat also needs to be optimized. But yeah, XMLRPC is a bigger problem than that. If you know of some mechanisms for inter process communication via sockets, do let me know.
Thank You!
Kulbir,
"This XMLRPC is a ghost"
Really!
Next day ... I have a maintenance cron for squid at midnight (stop + statistics + rotate + squidGuard DB update + start). With url_rewrite_children 20 I had a new crash for XMLRPC:
2009-02-20 00:06:20,957 26903 - - XMLRPCSERVER - Starting XMLRPCServer on port 9100.
2009-02-20 00:06:20,958 26920 - - STRAT_XMLRPC_SERVER_ERR - Cannot start XMLRPC Server - Exiting
Now I put url_rewrite_children 15. It seems to work. I will conserve this value to see next days if it is good for me.
Thanks a new time!
Regards,
Josep Pujadas
Josep,
Again that warning error might have been false. Did you check "netstat -atpn |grep 9100" even after the failure. I would suggest you to revert back to 20 and check.
Thank You!
I think that the download manager should be separated from the squidpart(). The squidpart() should use a fast and "cheap" mechanisim to let the download manager know if someone requested a video. I suggest UDP for this mission.
Imriz,
Decoupling XMLRPC server or moving to some UPD-based-ipc itself will make the squipart() function's job really easy. I think we need to explore more on that part.
Thank You!
I have made a prototype of my suggested idea in perl. I'm using a 3 parts model. The first part is the squipart() - It sends a UDP message to a seperate daemon, which stores the request in a MySQL database.The third part is the download manager, which connects to the MySQL database, and downloads the video with the most requests. The downlad manager is using a multi process model, to allow X concurrent downloads.
If you want I can send you the scripts offline (contact me via email)..