VideoCache

Script to delete files not requested for a long time

by lopan on 23 Jan 2009

Hello there,

When I'd created the: Statistics from youtube_cache.log by day (http://cachevideos.com/forum/post/statistics-youtubecachelog-day), I put a routine to alter the date of modified of file when it is request on day. With this is possible to delete the files that no have access for a long time.

For execute this, use: ./script.sh A B
where:
A = File's data was last modified A*24 hours ago.
B = Use Y to delete without ask you.

#!/bin/sh
# lopan dot eti at gmail dot com (Author: Lopan)
# GPL2

#Variables
VIDEO_CACHE_DIR=/var/spool/squid/video_cache
DAYS=$1
DL=$2

#Select file's data was last modified $DAYS ago
for MB in $(find $VIDEO_CACHE_DIR -mtime +$DAYS -exec ls -l {} ';' | awk '{print $5}'); do
        MBT=$((MBT+MB))
done

#Print total in GB of selected files
echo "You are selected *`echo $MBT/1024/1024 | bc`GB* of videos to delete!"

#Ask about if can erase
if [ -z "$DL" ]; then
        read -p "Can I delete the selected files? [Y/n]" -n 1 DL
fi

#Delete files? Are you sure?
if [ "$DL" == "Y" ]; then
        echo "Wait! Deleting the selected files..."
        echo "This process can take long time!"
        find $VIDEO_CACHE_DIR -mtime +$DAYS -exec rm -rf {} \\;
fi

19 Answers

by salah on 23 Jan 2009

Hi lopan

please can you explain more about your script. for newbies to this. how to use this script ,where to paste it , how to run it .
you will be thankfull for it in advanced.

by lopan on 23 Jan 2009

Salah,

To use this script you need run this other script (http://cachevideos.com/forum/post/statistics-youtubecachelog-day) every day.

The script (http://cachevideos.com/forum/post/statistics-youtubecachelog-day) alter the modified of date of file when this file is request.

Finally, this script (http://cachevideos.com/forum/post/script-delete-files-not-requested-long-time), select files without request for N days ago and delete this files.

To run this script (http://cachevideos.com/forum/post/script-delete-files-not-requested-long-time) use the syntax:

./script.sh A B
A = Select file's data was last modified A*24 hours ago.
B (optional) = Use Y to delete without ask you.

I think to merge 2 scripts (http://cachevideos.com/forum/post/script-delete-files-not-requested-long-time) and (http://cachevideos.com/forum/post/statistics-youtubecachelog-day) soon.

by salah on 23 Jan 2009

Thanks man these good scripts will help a lot. and they are what we are missing.

10x again and wish you more good ideas and scripts :)

by Kulbir Saini on 23 Jan 2009

Lopan,

Great work man!!! You offloaded a lot of my work :) Thanks again and keep up the good work !!!!

by lopan on 24 Jan 2009

Hey Kulbir,

I think that I can merge the two scripts, Statistic and Cleaner.

What are you think?

So, I work on it!

by Kulbir Saini on 24 Jan 2009

Hi!

I think that would be another feather in the cap :) Go ahead!

Thank you for the support!!!

by thiago on 5 Feb 2009

based on your script I've made one, but in python...

[SEE SCRIPT BELOW]

some parts could be remade... like improve the regexps and some code in the main class, but it's working...
for those who don't now how to run, just type...

python script.py N

where N is the number of days witch a video wasn't requested anymore

Bye...

Edited by admin : Added script here.

#!/usr/bin/env python
# videocache cleaner

import re
import time
import os
import sys

"version" = 0.01
log_dir = '/var/log/videocache/'
log_dir_files = os.listdir(log_dir)
cache_dir = '/var/spool/videocache/'
hit_pattern = '(\\d{4})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2}),\\d{3} \\w+ \\d+\\.\\d+\\.\\d+\\.\\d+ ([a-zA-Z0-9\\._-]+) CACHE_HIT (\\w+)'
download_pattern = '(\\d{4})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2}),\\d{3} \\w+ \\d+\\.\\d+\\.\\d+\\.\\d+ ([a-zA-Z0-9\\._-]+) DOWNLOAD (\\w+)'

class TouchNotCompleteError(Exception):
  pass

class GetHitsNotCompleteError(Exception):
  pass

class CacheCleaner(object):

  def "init"(self, delete_age=90):
    self.all_logs = ''
    self.touch_complete = False
    self.get_hits_complete = False
    self.delete_age = delete_age
    self.now = time.mktime(time.localtime())
    self.last_hits = {}
    self.last_downloads = {}

  def _get_downloads(self):
    self._check_get_hits()
    download_list = re.findall(download_pattern, self.all_logs)
    for download in download_list:
      t = tuple([int(i) for i in download[:6]]) + (2, 35, 1)
      date = '%s%s%s%s%s.%s' %(str(download[0])[-2:], download[1], 
                   download[2], download[3], download[4], 
                   download[5])
      if download[6] not in self.last_hits:
    if self.last_downloads.has_key(download[6]):
      delta = self.now - time.mktime(t)
      if self.last_downloads[download[6]]['last_download'] > delta:
        self.last_downloads[download[6]]['last_download'] = delta
        self.last_downloads[download[6]]['last_download_date'] = date
    else:
        self.last_downloads[download[6]] = {
        'site':download[7].lower(),
        'last_download':self.now - time.mktime(t),
        'last_download_date':date
        }

  def _get_hits(self):
    hit_list = re.findall(hit_pattern, self.all_logs)
    for hit in hit_list:
      t = tuple([int(i) for i in hit[:6]]) + (2, 35, 1)
      date = '%s%s%s%s%s.%s' %(str(hit[0])[-2:], hit[1], 
                   hit[2],hit[3], hit[4], hit[5])
      if self.last_hits.has_key(hit[6]):
    delta = self.now - time.mktime(t)
    if self.last_hits[hit[6]]['last_hit'] > delta:
      self.last_hits[hit[6]]['last_hit'] = delta
      self.last_hits[hit[6]]['last_hit_date'] = date
      else:
      self.last_hits[hit[6]] = {
      'site':hit[7].lower(),
      'last_hit':self.now - time.mktime(t),
      'last_hit_date':date
      }
    self.get_hits_complete = True

  def _touch_files(self):
    for download in self.last_downloads:
      if self.last_downloads[download]['site'] == 'youtube':
    cmd = 'touch %s%s/%s -t %s' %(cache_dir, self.last_downloads[download]['site'],
                     download, self.last_downloads[download]['last_download_date'])
      else:
    cmd = 'touch %s%s/%s.flv -t %s' %(cache_dir, self.last_downloads[download]['site'],
                      download, self.last_downloads[download]['last_download_date'])
      os.system(cmd)
    for hit in self.last_hits:
      if self.last_hits[hit]['site'] == 'youtube':
    cmd = 'touch %s%s/%s -t %s' %(cache_dir, self.last_hits[hit]['site'],
                     hit, self.last_hits[hit]['last_hit_date'])
      else:
    cmd = 'touch %s%s/%s.flv -t %s' %(cache_dir, self.last_hits[hit]['site'],
                     hit, self.last_hits[hit]['last_hit_date'])
      os.system(cmd)
    self.touch_complete = True


  def run(self):
    self._get_logs()
    self._get_hits()
    self._get_downloads()
    self._touch_files()
    self._clear()

  def _check_get_hits(self):
    if not self.get_hits_complete:
      raise GetHitsNotCompleteError

  def _check_touch(self):
    if not self.touch_complete:
      raise TouchNotCompleteError

  def _clear(self):
    self._check_touch()
    cmd = "find %s -mtime +%s -exec rm -rf {} ';'" %(cache_dir, self.delete_age)
    os.system(cmd)

  def _clear_logs(self):
    """
    TODO
    """
    pass

  def _get_logs(self):
    self.all_logs = ''
    for log in log_dir_files:
      self.all_logs += open(log_dir+log).read()

def main():
  if len(sys.argv) > 1:
    c = CacheCleaner(delete_age=sys.argv[1])
  else:
    c = CacheCleaner()
  c.run()

if "name" == '"main"':
  main()
by Kulbir Saini on 5 Feb 2009

Thiago,

Cool script!!! Thank you very much for taking some time out to write this script. I hope this will be helpful for users :)

PS : If you register, you can nicely format the code snippets :)

by k1mbl3 on 7 Feb 2009

Hi, I'm the that unregistered user...
I had another idea... if you execute a touch command when the video is downloaded and in every hit in your main script, these scripts could be replace by a simple find command...

Bye...

EDIT: after googleing a bit i found a python solution for the touch, becoming os independent, you can use os.utime(file, None)

by Kulbir Saini on 7 Feb 2009

Kimble,

Thats a nice idea indeed. We can modify the last modified time every time there is cache hit and then while cleaning we can remove videos based on their last modified time instead of using last access time. Because last access time may have changed due to several reasons. For example, when you take a backup of something, last access time gets updated but the video was not served.

I'll try to incorporate this in next version :)

Thank You!

by lopan on 8 Feb 2009

Thiago, Kimble and Kulbir,

Wonderful!
Now is easy to clean old videos in cache.
Is a very nice function! :P

os.utime(Code_This, NOW) lol

c u

by k1mbl3 on 9 Feb 2009

I'm the thiago... i wasn't registered :P

by bellera on 10 Feb 2009

Hello!

Good idea to refresh the cache to save disk!

However I think this job should be done without system commands, only with Python code. I'm not a Python expert but I tried this example:

#!/usr/local/bin/python2.6
import os,time
video_file = '/var/spool/videocache/youtube/0954c0554eb4d59f'
print time.ctime(os.stat(video_file).st_ctime)
print time.ctime(os.stat(video_file).st_mtime)
print time.ctime(os.stat(video_file).st_atime)
./test.py
Mon Feb  9 17:49:48 2009
Thu Jan  1 20:33:59 2009
Tue Feb 10 04:43:32 2009

The result shows:

- The video was download at 2009-02-09_17:49:48
- The video is at youtube from 2009-01_01-20:33:59
- The video was last accessed at 2009-02-10_04:43:32

I hope thi helps!

Regards,
Josep Pujadas

by Kulbir Saini on 10 Feb 2009

Josep and all,

I have completed a script (still in testing stage) to remove unused videos from the cache. It'll be included in the next version. So, I'll request all of you to invest time in stats calculation part. I have not thought anything about it yet.

Thank you for your hard work guyz!!!

by imriz on 21 Feb 2009

Hi Kulbir,

There's a small bug in the script:
if cur_time - os.stat(video)[stat.ST_MTIME] > expire*86400:
age = int((cur_time - os.stat(video)[stat.ST_ATIME]) / 86400)

The if should check ATIME and not MTIME

by Kulbir Saini on 22 Feb 2009

Imriz,

Actually thats a hack to get around the access done by other agents like copy command (cp) when backing up the cached videos. So, if you see videocache.py, you'll notice that we update the access time and modification time whenever there is a CACHE_HIT. So, using MTIME in vccleaner doesn't harm at all.

Thank you for looking at the code :)

by Dale on 14 Mar 2009

Vccleaner Problem

Hi,

Been setting up a new squid server and using videocache to cache videos, great work thanks!

I found a problem in the vccleaner script, it uses the "base_dir" parameter from videocache.conf which works fine, however if the base_dir is setup with a maximum cache size, i.e. base_dir = /videocache/:35000, then vccleaner tries to use the directory for the cache as /videocache/:35000/youtube etc. This obviously doesn't work very well!

Thanks again
Dale

by Kulbir Saini on 14 Mar 2009

Dale,

Please apply the following patch to vccleaner file.

diff --git a/scripts/vccleaner b/scripts/vccleaner
index 8bae0be..2512b70 100644
--- a/scripts/vccleaner
+++ b/scripts/vccleaner
@@ -115,7 +115,7 @@ def main(root, etc_dir):
         return (None, None, None, None)
     else:
         video_lifetime = int(mainconf.video_lifetime)
-    base_dir = [apply_install_root(root, dir.strip()) for dir in mainconf.base_dir.split('|')]
+    base_dir = [apply_install_root(root, dir_tup.split(':')[0].strip()) for dir_tup in mainconf.base_dir.strip().split('|')]
     logdir = apply_install_root(root, mainconf.logdir)

     # Youtube specific options

It'll work fine after that.

Thank you for reporting the problem.

by Dale on 18 Mar 2009

Great, thank you. I will give it a go when I have time!

Dale

You need to sign in. Please sign in to add answer to this question.