The Squid Proxy Cache

      Documentation version 0.3

  1. Introduction
    1. What is squid?
    2. The ideas behind internet caching
      1. Dynamic pages and how they are handled
      2. Cache hierarchies
    3. What systems does squid run on?
    4. Where do I get the source?
      1. How do I compile it?
        1. It didn't work!
      2. I really can't figure this "compile thing out - can I get compiled versions for my operating system?
    5. Changing operating system settings to enhance squid
  2. Cache configuration
    1. Hardware configuration
      1. Recommended hardware for a given number of estimated hits.
    2. Server Configuration
      1. Directory layout
      2. The Configuration file
        1. Running squid as a cache
          1. A very basic configuration file
          2. A standard configuration file
          3. Parents and siblings
            1. ICP
            2. I want to use a parent, but they use this software, not squid
              1. Harvest
              2. Netscape
              3. Microsoft Cache
              4. Apache
              5. Cern
        2. Running squid as a web server accelerator
    3. Client software configuration
      1. Autoconfiguration with a Java script
        1. Netscape
        2. Internet Explorer 3.01
      2. Manual configuration
        1. Netscape
        2. Internet Explorer
        3. Unix envoriment variables
  3. What next?
    1. Preloading web pages
    2. Performance measuring
    3. Stats
    4. Optimising heavily used caches
      1. malloc
      2. DNS load balancing
      3. Diagnosing slow caches
        1. vmstat / iostat
        2. strace / truss
  4. Related Software
    1. Harvest
    2. Cern
    3. Apache
    4. Netscape Cache
    5. Microsoft Cache
  5. More information
    1. The web page
    2. Related web pages
    3. The mailing list
      1. What to include when you mail the list
  6. Credits


Introduction

What is squid?

Squid is a server that caches the data returned by often accesed web pages, ftp files and other internet services. Whenever a client requests a web page (for example) their request, instead of going to the web server directly, is processed by a "local" server, running squid. This server then connects to the original page and requests the document, and passes it onto the client software. If the document is cacheable (ie it is not a CGI-BIN program, or something similar) it then stores a local copy of the document. Next time the document is retrieved, the cache doesn't need to connect to the original site, and just serves the copy from disk. Currently squid handles the "http", "ftp", "gopher" and "wais" protocols.

The ideas behind internet caching:

Dynamic pages and how they are handled:

Not all pages are cacheable. Some web sites, for example, will return a page that is dynamically changed based on the user's interests. There are certain standards (and exceptions to these) that servers use to inform caches that these pages shouldn't be cached, and that next time someone wants that page, the cache should re-request it. Some companies also use these headers to ensure that their "hit count" is accurate, for marketing purposes. Some do it at the expense of their bandwidth and others.

Cache hierarchies:

Squid and harvest support advanced hierarchies. They both support a common protocol, called the "Internet Cache Protocol" (or ICP) which uses an efficient method for figuring out if a nearby cache has a copy of a document it is looking for. This is useful if you are running load balancing caches (see the section on this for more information) or if you have caches spread over a local network, where local access is faster than connecting to the original site. Squid also supports this system in multicast, an efficient protocol for this kind of use. A nearby cache can be a "sibling" or a "parent". If you set a cache up as your parent it means that your cache will always connect to it to get a page if it can't be served locally. If a sibling cache doesn't have a page though, your cache then ignores it and connects directly to the original server, or it's parent cache.

What systems does squid run on?

Squid runs on most unix systems without modification to the source, though sometimes it is a little more difficult to compile than others. So far as when know it runs on (in alphabetical order): BSDI, Digital Unix, FreeBSD, Irix, Linux, SCO (with a few difficulties, unless you use GCC), Solaris and SunOS. It is developed using (mostly) GCC, so it may be best to try and compile it using this compiler.

Where do I get squid?

Have a look at ?? http://squid.nlanr.net/ for a list of servers that mirror squid. They are stored with a consistent naming convention so that you can find the latest version. For example squid 1.1.5 is called "squid-1.1.5.tar.gz"

How do I compile it?

By default squid installs itself in /usr/local/squid You need to get this directory created, with the permissions set so that you (as a normal user) can write to it (I try to not compile anything as root). Then change to the directory where squid is, and "gunzip" the file. You will then have an uncompressed file called "squid-1.1.5.tar", which you can then extract into it's directory structure with the "tar" command ("tar xvf squid.1.1.5.tar" should extract the file for you).

Once squid is extracted, you should change to the directory that it created and run "./configure". There are numerous options to the configure script that allow you to do things like change the directory it will install too, and so forth. Run "./configure --help" to get a list of them.

The configure script will then analyze your system to find your compiler and figure out how squid should compile. All should hopefully go well, and you should end up with a "Makefile" that allows you to compile the script itself. To do this, just type "make". If it compiles cleanly, you can type "make install" to install the cache software, which will make the required directories and copy the binaries and demo configuration file into them

It didn't work!

If squid didn't compile correctly, you can try the following:

I really can't figure this "compile thing out - can I get compiled versions for my operating system?

There aren't many precompiled versions available... Sometimes, if you ask, people will compile a version for you. Some binary versions are available on the ftp site (and it's mirrors). Quite often people would be happy to compile a version for you, but the version of their kernel or operating system is different to yours. Thus you must make sure that you give sufficient details when someone is trying to compile a version for you... Things like where squid is going to be installed are important.

Changing operating system settings to enhance squid

Many operating systems are oriented to handling large numbers of processes efficiently. They limit the number of files a single process can open (or the number of network connections it can handle). They may limit the amount of disk space or memory that a user can use. This section is going to be specifically oriented to Linux, because it is the operating system that our company (The Internet Solution, in South Africa) runs. As people send me more information, I will add more information to this section. (So get cracking, all you squid experts!)

Linux: Linux treats open files and open sockets (network connections) the same, and, unfortunately, it puts a rather low limit on the numer of "filedescriptors" that a single process can open. The default in kernel 1.2.3 -> 2.0.x is 256. This effectively means that the most concurrent connections squid can handle is about 75 or so. A patch of the kernel allows you to increase this to 1024, allowing about 300 concurrent connections. This patch is available at http://www.linux.org.za/ Once you have applied the patch to the kernel, you will need to reboot the machine and recompile squid.

FreeBSD:With FreeBSD it is significantly easier to increase the number of filehandles.


Cache Configuration:

Hardware Configuration:

This part will be a bit of a touchy subject. People will recomend specific hardware because it is what they run, and it works, rather than based on a "side by side" comparative test.

The following is based on my personal experience. It may not work for you, it may... All I can suggest is that you go for a lowish end machine before you spend huge amounts of money. You can always upgrade later.

There are effectively two ways to go:

A small cache can be very effective. A 750M cache can (and does on our system) get a hit rate of 30% (at least with Squid 1.0.x) On the other hand, a 30 gig disk can get a hit rate of 60% With some international links this difference in bandwidth usage will warrant the extra disk space. Many people overestimate the amount of disk space needed to get (for example) a 30% hit rate. The top 30% of accessed pages will normally fit in a 500 meg partition.

One of the most important things with the current version of squid is Ram. Squid-NOVM should sort this out, because it doesn't keep objects in Ram, it writes them straight to disk. This means that the type of disk will become more important in the near future (SCSI vs IDE). Currently IDE handles all right, as long as Squid doesn't swap to disk.

Make sure that you have enough RAM! If someone downloads a 100M file through the cache, squid will get a virtual size of over 100M! It doesn't use temporary files, so make sure that you have sufficient swap space, even if you have only a smallish amount of Ram. Just because squid has an option to "use x megs of ram" doesn't mean that it will. See the manual section for more information on that option

Recommended hardware for a given number of estimated hits

(Still need to do this)

Server Configuration

Directory layout

With the default configuration, Squid installs itself to "/usr/local/squid". Under this, there are normally 3 directories, "bin", "etc" and "cache". I normally keep the squid source in "/usr/local/squid/src/"

"bin" contains the executables "dnsserver", "ftpget", "squid" and a few other programs, such as scripts to restart squid if it dies.

"etc" contains the Squid configuration file, "squid.conf"

"cache" contains the actual data retrieved through the cache. All the actual files are kept in further directories where they can be retrieved with little system overhead.

I normally create a further directory, called "src" where I extract the actual source to squid.

The Configuration file

The squid configuration file is called "squid.conf", and is normally in the directory "/usr/local/squid/conf". Each of the possible options have a short (and hopefully clear) description of what the option does. Squid has essentially two options as to how it runs. The first is as a caching server. The second is as a program that sits in front of your web server (answering all connections on port 80) and then answering the queries from an "in memory" cache of pages that have been previously downloaded. This second option is known as a "proxy accelerator", as it speeds the downloading of web pages. Some cache servers are inherently slow, and holding them in the squid cache can be more efficient.

Running squid as a cache

With the default config file, squid runs as a caching proxy server, rather than a proxy accelerator

A very basic configuration file

You can download a very basic config file here This config will allow all people access to the cache, create a 100M cache area, use 8M of ram, keep minimal logs, and store all files in the default places, which are normally beneath "/usr/local/squid" where all files will be stored as user nobody, group nogroup. All client requests will come in on port 3128, and all "inter-cache" traffic will go to port 3130, using the UDP protocol.

A standard configuration file

This is supposed to be a standard config file for an ISP user, rather than an ISP or suchlike. The ISP setup probably be more complex, with more acls, more cache_dirs and so on.

This config allows access to only your network, will set up a parent cache if your service provider has squid or harvest, uses 1G of disk space for the actual cached pages, tries to use only 4M of ram, runs as user "nobody " with the group "nogroup" and keeps minimal logs. I have tried to comment changes that I have made, but the useful info as to the other options will still be in the default squid config. Have a look at the options I have chosen and read squid.conf.orig and see what the other options are, and if they suit you more, use them. You might want to try a "basic" config like this one first.

Parents and siblings
ICP

ICP is an abbreviation for Internet Cache Protocol, a protocol specifically designed to allow caches to talk to one another efficiently. This allows all sorts of hierarchies of caches, with siblings, multiple parents, and is the basis for multicast cache systems. Currently there are only a few caches that support ICP. Harvest was the base system ??? for ICP. Squid can talk to both the commercial version of Harvest, and to Harvest shareware. Unfortunately the Netscape and Microsoft proxies don't talk ICP at the time of the writing. The Microsoft Proxy doesn't even support any kind of parenting. Netscape at least supports parents. Having the ability to do sibling systems can be very useful if you have (a simple example) two cache machines which must load balance. If you set them up as siblings, they can check with the other cache if it doesn't have an object, and if the remote server has it, it will download it, instead of going over the "long wire", and it can be served as fast as if it is on the local machine.

I want to use a parent, but they use this software, not squid
Harvest

This shouldn't be a problem. Squid is largely based on Harvest. Make sure that you allow ICP queries on the harvest cache, and that you put the correct port numbers in the "cache_host" line in your squid.conf

Netscape

If squid tries to talk ICP to the echo port (UDP echo) on a server, the server seems to be saying "I Have that" for every query. Thus if you want to use a server that doesn't support ICP, you need them to enable the UDP echo port on their server. You then need to setup squid so that in the "cache_host" line it says that the ICP port is port 7. Squid then thinks that the netscape server is answering ICP queries for all requests, and squid will use it as a parent.

Microsoft Cache

Sorry. You cannot do this with the current microsoft cache version, and I don't know if you ever will be able to either.

Apache

Set the Netscape note. The Apache proxy module has no concept of parents/ siblings etc

Cern

Try and get them to move away from CERN :) Also see the Netscape note, since Cern doesn't support ICP etc.

Running squid as a web server accelerator

Squid can act as a front end to your web server too, basically storing oft requested objects in ram, doing all of it's multi-threading stuff and possibly allowing you to use squid as a fast front end to your web server. It can also allow you to run a proxy server on the same port as your web server, a bit easier to remember, but I think it's a better idea to split your cache and web serving to seperate ports, and seperate DNS entries (ie you need cache.mydomain.com, so that when/if you decide to run the cache on a dedicated machine, you don't have to get people to change their settings, but just do a DNS change and everything will work.

Set http_port to port 80, so that it will answer as if it were a web server Set http_accel realwebservername portnumberonthatserver, so you have something like "http_access hiddenwww.mydomain.com 80" (if it's the same machine try "http_access 127.0.0.1 8000". If you want to run the web server and a cache on the same machine, set httpd_accel_with_proxy to "on". The rest can probably be left as defaults.

Temporary End

/P>