In this short tutorial I am going to explain you how you can parse a webpage in PHP using the cURL library.
First of all, you want to remove any css/html formats from your results, this will ensure you a very clean result, so we will set an text/plain header format.
header('Content-Type: text/plain; charset=utf-8;');
Now, it’s time to create our function wich will scrape the content of a website.
function getUrl($url) {
if(@function_exists('curl_init')) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (compatible; Crawler version 1)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$site = curl_exec($ch);
curl_close($ch);
} else {
global $site;
$site = file_get_contents($url);
}
return $site;
};
Notice that we have an else condition, where we use file_get_contents, so if cURL fails, then you will still get the content of the website.
We can crawl multiple sites at once, by defining an array with them, so let’s do that:
$url = array('http://pricop.net', 'http://YourWebSite.com');
After we created the array, now we must take each value from it, and scrape the content in order to obtain the Title, Meta Description, Meta Keywords, Url’s and ofcourse the actual content.
So last part of the code:
foreach($url as $x) {
$content = getUrl($x);
preg_match('#<title>(.*)</title>#i', $content, $title);
preg_match('/<head>.+<meta name="description" content=.([^"\']+)/is', $content, $description);
preg_match('/<head>.+<meta name="keywords" content=.([^"\']+)/is', $content, $keywords);
preg_match_all('/href=.([^"\' ]+)/i', $content, $anchor);
preg_match('/<body.*?>(.*?)<\/body>/is', $content, $body);
echo "\nTitle: "; print_r($title[1]);
echo "\nDescription: "; print_r($description[1]);
echo "\nKeywords: "; print_r($keywords[1]);
echo "\nUrls: "; print_r($anchor[1]);
echo "\nContent: "; print_r($body[1]);
echo "************************************************************************************";
}
You can even store the informations into a database, and create your own search engine really easly.