Extract Links From A HTML File With PHP

6 March, 2008 | PHP

Use the following function to extract all of the links from a HTML string.

function linkExtractor($html){
 $linkArray = array();
 if(preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i’,$html,$matches,PREG_SET_ORDER)){
  foreach($matches as $match){
   array_push($linkArray,array($match[1],$match[2]));
  }
 }
 return $linkArray;
}

To use it just read a web page or file into a string, and pass that string to the function. The following example reads a web page using the PHP CURL functions and then passes the result into the function to retrieve the links.

$url = 'http://www.talkincode.com';
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12');
curl_setopt($ch,CURLOPT_HEADER,0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
curl_setopt($ch,CURLOPT_TIMEOUT,120);
$html = curl_exec($ch);
curl_close($ch);
echo '<pre>'.print_r(linkExtractor($html),true).'<pre>';

The function will return an array, with each element being an array containing the link location and the text that the link contains.

Comments

Comment from Mark James
Date: September 4, 2008, 9:27 am

Cool Scripts.

Write a comment