I have a scrapper on a search engine that does ~1.5M requests per month.
It currently consumes around 13TB/month of outgoing bandwith. (inbound is free so I don't care about it).
The page get does to a proxy server 1st, this is a must that can't be removed. The proxy is also a 3rd party so it can't be configured diffrently or provide other connections but the normal http proxy by squid or similar.
The way the program currenly works is by simple php curl requests, each page is requested sepratly by a new curl request.
I am looking for ANY sollution that can reduce the outgoing bandwith by 30% or more.
Sollutions could be code based, server config or anything that I can actully change.
Main request code:
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_HEADER, $header);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, $referer);
$result['EXE'] = curl_exec($ch);
$result['INF'] = curl_getinfo($ch);
$result['ERR'] = curl_error($ch);