Archive

Posts Tagged ‘sitemap’

XML Sitemap Date Format In PHP

April 3rd, 2009 No comments

To format the current timestamp in W3C Datetime encoding (as used in sitemap.xml files) use the following parameters.

echo date('Y-m-dTH:i:sP', time());

As of PHP5 you can also use the c format character to print the exact same string.

echo date('c',time());

These would both print out the following:

2009-04-03T11:49:00+01:00

Categories: PHP Tags: , , , , ,

Convert A sitemap.xml File To A HTML Sitemap With PHP

August 13th, 2008 No comments

I have already talked about converting a sitemap.xml file into a urllist.txt file, but what if you want to create a HTML sitemap? If you have a sitemap.xml file then you can use this to spider your site, scrape the contents of each page and populate the HTML file with this information.

The following code does this. For every page it looks for the <title> tag, the description meta tag and the first <h2> tag on the page. These items are then used to construct a segment of HTML for that page.

<?php
$header = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>HTML Sitemap</title>
</head>
<body>';
 
set_time_limit(400);
 
$currentElement = '';
$currentLoc = '';
 
$map = "<h1>HTML Sitemap</h1>"."\n";
 
function parsePage($data){
 global $map;
 /*
 if you want to trap a certain file extention then use the syntax below...
 stripos($data,".php")>0
 stripos($data,".htm")>0
 stripos($data,".asp")>0
 */
 if(stripos($data,".pdf")>0){
  // if the url is a pdf document.
  $map .= '<p><a href="'.$data.'">PDF document.</a></p>'."\n";
  $map .= '<p>A pdf document.</p>'."\n";
 }elseif(stripos($data,".txt")>0){
  // if the url is a text document
  $map .= '<p><a href="'.$data.'">Text document.</a></p>'."\n";
  $map .= '<p>A text document.</p>'."\n";
 }else{
  // try to open it anyway...
  // make sure that you can read the file
  if($urlh = @fopen($data, 'rb')){
   $contents = '';
   //check php version
   if(phpversion()>5){
    $contents = stream_get_contents($urlh);
   }else{
    while(!feof($urlh)){
     $contents .= fread($urlh, 8192);
    };
   };
 
   // find the title
   preg_match('/(?<=\<[Tt][Ii][Tt][Ll][Ee]\>)\s*?(.*?)\s*?(?=\<\/[Tt][Ii][Tt][Ll][Ee]\>)/U',$contents,$title);
   $title = $title[0];
 
   // find the first h1 tag
   $header = array();
   preg_match('/(?<=\<[Hh]2\>)(.*?)(?=\<\/[Hh]2\>)/U',$contents,$header);
   $header = strip_tags($header[0]);
 
   if(strlen($title)>0 && strlen($header)>0){
    // print the title and h1 tag in combo
    $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.(strlen($header)>0?trim($header):trim($title)).'">'.trim($title).(strlen($header)>0?" - ".trim($header):'').'</a></p>'."\n";
   }elseif(strlen($title)>0){
    $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.trim($title).'">'.trim($title).'</a></p>'."\n";
   }elseif(strlen($header)>0){
    $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.trim($header).'">'.trim($header).'</a></p>'."\n";
   };
 
   // find description
   preg_match('/(?<=\<[Mm][Ee][Tt][Aa]\s[Nn][Aa][Mm][Ee]\=\"[Dd]escription\" content\=\")(.*?)(?="\s*?\/?\>)/U',$contents,$description);
   $description = $description[0];
 
   // print description
   if(strlen($description)>0){
    $map .= '<p class="desc">'.trim($description).'</p>'."\n";
   };
   // close the file
   fclose($urlh);
  };
 };
};
 
/////////// XML PARSE FUNCTIONS HERE /////////////
// the start element function
function startElement($xmlParser,$name,$attribs){
 global $currentElement;
 $currentElement = $name;
};
 
// the end element function
function endElement($parser,$name){
 global $currentElement,$currentLoc;
 if($currentElement == 'loc'){
  parsePage($currentLoc);
  $currentLoc = '';
 };
 $currentElement = '';
};
 
// the character data function
function characterData($parser,$data){
 global $currentElement,$currentLoc;
 // if the current element is loc then it will be a url
 if($currentElement == 'loc'){
  $currentLoc .= $data;
 };
};
 
// create parse object
$xml_parser = xml_parser_create();
// turn off case folding!
xml_parser_set_option($xml_parser,XML_OPTION_CASE_FOLDING, false);
// set start and end element functions
xml_set_element_handler($xml_parser,"startElement","endElement");
// set character data function
xml_set_character_data_handler($xml_parser,"characterData");
 
// open xml file
if(!($fp = fopen('sitemap.xml',"r"))){
 die("could not open XML input");
};
 
// read the file - print error if something went wrong.
while($data = fread($fp,4096)){
 if(!xml_parse($xml_parser,$data,feof($fp))){
  die(sprintf("XML error: %s at line %d",xml_error_string(xml_get_error_code($xml_parser)),xml_get_current_line_number($xml_parser)));
 };
};
 
// close file
fclose($fp);
 
$footer = '</body>
</html>';
 
// write output to a file
$fp = fopen('sitemap.html',"w+");
fwrite($fp,$header.$map.$footer);
fclose($fp);
 
// print output
echo $header.$map.$footer;
?>

This script prints out the sitemap and also saves the sitemap to a file for later use. This is essential as the script can take a long time to run due to all of the page accessing that it has to do.

This script is failry complicated and has gone through several versions since I first created it so if you find any improvements or bugs then let me know and I will incorporate them.

Categories: PHP Tags: , , , , , , ,

Convert A sitemap.xml File To A urllist.txt File Using PHP

August 12th, 2008 1 comment

If you create a script that produces a sitemap.xml file there is no point in adapting this script so that it creates a urllist.txt file. The best solution is to use this sitemap.xml file to create the urllist.txt. The following script will do exactly this.

$lines = file('sitemap.xml');
$allMatches = array();
 
foreach($lines as $line_number => $line){
 $line = trim($line);
 preg_match_all('/(?<=\<loc\>)(.*?)(?=\<\/loc\>)/U',$line,$matches,PREG_SET_ORDER);
 if($matches){
  if($matches[0][0] != ''){
   $allMatches[] = $matches[0][0];
  };
 };
};
 
$list = '';
foreach($allMatches as $url){
 $list .= $url."\n";
};
$fh = fopen('urllist.txt',"w+");
fwrite($fh,$list);
fclose($fh);
 
// print out list to provide some feedback...
echo $list;

The script works by first loading the sitemap.xml file into an array using the file() function. The script then goes through all of the items in the array and picks out everything between the <loc> tags and puts these into an array. It then adds these to a file called urllist.txt but also prints out the output to provide some indication that the script has run. This can be removed if you want to incorporate it into a larger script.

Categories: PHP Tags: , , , , ,