Archive

Archive for the ‘Zend Framework’ Category

Excel Document Scanning With Zend_Search_Lucene

May 11th, 2009 Tech 1 comment

Zend_Search_Lucene offers some powerful document scanning capabilities, and there are a few different formats that are useful for the search engine to index.

To allow the indexing and searching of Excel documents using Zend_Search_Lucene you need to use the Zend_Search_Lucene_Document_Xlsx class. However, to use this class you must have the Zip module installed with PHP. For Windows users this means editing your php.ini file and uncommenting the following line:

extension=php_zip.dll

For Linux users you will need to recompile PHP with the –enable-zip configure option.

Create and/or open the index in the normal way and you can index Excel documents using the following code.

$filename = 'C:\Book1.xlsx';
$doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
$index->addDocument($doc);

You can now set up a query and search for the document in the following way, although you would normally expect the input string to be some kind of user input.

$queryStr = 'wibble';
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
 
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($userQuery, true);
 
 
$hits = $index->find($query);
 
foreach ( $hits as $hit ) {
    echo $hit->score.'<br />';
    echo $hit->filename.'<br />';
}

The score is always returned with a hit object. Other parameters available to display are filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. However, some of these depend on the contents of the document. It is possible to add keywords and subjects to an Excel document, so if they are not present then you will need to check for the existence of that parameter before displaying it. The following code looks for the existence of the keyword parameter before trying to print it out.

if ( isset($hit->keywords) ) {
    echo $hit->keywords.'<br />';
}

By default, this function indexes the document meta data and will tokenise and store the tokens as an index. The loadXlsxFile() function has a second optional parameter which is by default set to false. If this is set to true the contents of the Excel document will be included in the index. You can then use the following code to print out the contents of the document.

echo .$hit->body.'<br />';

Bear in mind that this output will not contain any row or column information and will therefore look like a dump of the data.

Using mod_rewrite And Zend Framework To Display Dynamic sitemap.xml

April 3rd, 2009 Tech 1 comment

Whilst creating a site the other day I thought about how I would manage the sitemap.xml file. This file is basically a XML file containing a list of URLs. Most major search engines understand (and look for) this file, so having it present on a site is a definite must.

I have been down the route before of having a sitemap.xml file created by the application every time a new record or something was added, but as this was a high traffic, multi-user site this approach just had to many problems. The main problem (aside from the potential performance hit) was that I would have to spend hours tying the calls to the sitemap.xml creation file into my application.

I then hit upon the idea of using a RewriteRule that would mask a controller as the sitemap.xml file. This would mean that the sitemap.xml controls could be kept away from all other parts of the application (so I could use the same template again), but I could also use Zend_Cache to cache the sitemap.xml file daily and therefore save on processing time.

First I needed to create a RewriteRule that would redirect a call to sitemap.xml to the Sitemap controller.

RewriteRule ^(.*)sitemap.xml$ /sitemap/index [L]

Next I created the Sitemap controller and made sure that the index action did not show the layout. The URLs are passed as an array to the view.

class SitemapController extends Zend_Controller_Action
{
    public function indexAction()
    {
        $this->_helper->layout()->disableLayout();
        $urls = array(array('loc'=>'http://www.talkincode.com/', 'lastmod'=>'2009-04-02T11:34:48+00:00', 'changefreq'=>'daily', 'priority'=>'1.0'));
        
        $this->view->urls = $urls;
    }
}

In order for the Sitemap controller to display anything it needs to have a view to render. This creates the basic outline of the file and uses the partialLoop() function to print out the array of URLs.

<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <?php echo $this->partialLoop('sitemap/_urlItem.phtml',$this->urls); ?>
</urlset>

Here is the file _urlItem.phtml, which gets rendered for every item in the $this->urls array.

<url> 
    <loc><?php echo $this->loc; ?></loc> 
    <lastmod><?php echo $this->lastmod; ?></lastmod> 
    <changefreq><?php echo $this->changefreq; ?></changefreq> 
    <priority><?php echo $this->priority; ?></priority> 
</url>

I assumed that everything would be working nicely now, but when I went to a browser and tried to find sitemap.xml I was presented with a message that said sitemap.xml was an invalid controller.

Just to test I added a redirect to the end of the RewriteRule to make sure that the rule worked.

RewriteRule ^(.*)sitemap.xml$ /sitemap/index [R=301, L]

This redirected to the correct place, so it must have been Zend Framework that was causing the error to occur. After a bit of thinking I realised that I could create a route that would reroute the call to the missing sitemap.xml controller to the existing Sitemap controller. Here is the rule I created, just add this to your bootstrap file.

$router = $frontController->getRouter();
$router->addRoute(
    'manageSitemap',
    new Zend_Controller_Router_Route('sitemap.xml', array('controller'=>'sitemap','action'=>'index'))
);

Navigating to sitemap.xml now shows me the output of the Sitemap controller.

A Simple Introduction To Zend_Cache

April 2nd, 2009 Tech No comments

The Zend_Cache class is part of the Zend Framework and is used (as its name suggests) to cache things. This can be anything from the front end browser output to the outcome of a complex calculation or even the results of database queries. Zend_Cache is an enormous topic, not just how the class works, but what the best practices are for caching.

The best place to start with caching is one of the simpler topics of caching database queries. Normally, a call to a database table in Zend Framework might be done like this.

$houses    = new Houses();
$result      = $houses->fetchAll();

The result would then be processed. To use Zend_Cache instead of calling the database we first need to set up our Zend_Cache object so that we can use it. To do this we need to call the Zend_Cache static function factory() with a few parameters, which will give us a cache object. Here is a typical example.

$query_cache = Zend_Cache::factory('Core', 'File', $frontendoptions, $backendoptions);

The parameters are as follows:

  • 'Core' - This can be a number of different options which dictate what sort of things are cached on the frontend, the value here is mapped to a class. In this case the class is Zend_Cache_Core, but other classes are mapped to Zend_Cache_Frontend_*. The Zend_Cache_Core class is best used for database calls because there is no specific frontend class that deals with database calls.
  • 'File' - This indicates where the cache is to be stored in the backend. Again this value maps to a class, in this case Zend_Cache_Backend_File. In most cases the Zend_Cache_Backend_File class is the simplest and easiest option to use.
  • $frontendoptions - This is an array of options that relates to the frontend class you have chosen.
  • $backendoptions - This is an array of options that relates to the backend class you have chosen.

The following code sets up an instance of Zend_Cache using some common parameters. Note that different frontend and backend classes have a different set of parameters, but the parameters used below are for the Core frontend and the File backend. The APPLICATION_PATH constant just points to our application folder.

$frontendoptions = array(
    'lifetime' => 60 * 5, // 5minutes
    'automatic_serialization'=>true
);
$backendoptions = array(
    'cache_dir'=> APPLICATION_PATH . '/cache/',
    'file_name_prefix' => 'zend_cache_query',
    'hashed_directory_level' => 2
);
$query_cache = Zend_Cache::factory('Core', 'File', $frontendoptions, $backendoptions);

Here is an explanation of the frontend options used.

  • lifetime - This is self explanatory. If the cache created is greater than the number of seconds for this parameter then the cache is deleted. This can be set to null if we wan’t the cache to last forever.
  • automatic_serialization - If set to true this will automatically serialise the cache data. This allows you to store complex data like objects and arrays. If you are storing a numeric value or text string only then you can set this to false.

Here is an explanation of the backend options used.

  • cache_dir - This is the directory that the cache is to be kept in. The default to this is /tmp/ but it is best to keep the cache within the application folder so that you can manage the files manually if need be.
  • file_name_prefix - This sets the start of the filename to be used, because I want to cache database queries I have selected zend_cache_query as my prefix.
  • hashed_directory_level - Some file systems have great difficulty handling lots of files in a single directory. This option splits the cache into different levels or directories. The default is 0, but for this example I have selected 2. This means that our cache files will be stored inside 2 levels of directories.

To load a cache we use the load() function. This function takes a parameter that identifies the cache, but because we are getting all data from the houses table we don’t need to worry too much about this. If there is no cache with that name present then the function returns false. If this occurs we run our normal database query but in each case the $result variable will contain our data.

if ( !($result = $query_cache->load('allhouses')) ) {
    $houses    = new Houses();
    $result      = $houses->fetchAll();
    $query_cache->save($result, 'allhouses');
}

Once we have run the normal query we save the result to the cache using the save() function. This contains the data we want to save in the first parameter and the same cache name as the load() function in the second parameter. The next time the page is loaded the cache is loaded instead of calling the database.

We can also cache single data rows in the same way by using a unique identifier for our cache name. Assuming that have our house id we can do the following:

$cacheName = 'house'.$id;
if ( !($result = $query_cache->load($cacheName )) ) {
    $houses    = new Houses();
    $result      = $houses->fetchRow($houses->select()->where('id = ?', $id));
    $query_cache->save($result, $cacheName );
}

Note that if you want to do anything more than display the results of the query then you will need to access the database directly. It is not possible to interact with a database through the cached object.

Getting Started With Zend_Lucene

February 20th, 2009 Tech 1 comment

Zend_Lucene is an implementation of the Lucene search engine in PHP5 and is included as part of the Zend Framework from version 1.6. Lucene implements all of the standard search engine query syntaxes (eg. boolean and wildcard searches) and stores its index as files so it doesn’t need a database server to run. Lucene can be used if you want to add search functionality to a site but don’t want to go down the route of building a querying syntax from scratch.

To get started with Lucene you need to create an index. The following code has the effect of creating a directory on your server that Lucene will use to store and retrieve documents.

$index = Zend_Search_Lucene::create('/data/my-index');

To open the index use the following code.

$index = Zend_Search_Lucene::open('/data/my-index');

Of course your index will not contain anything so the next step is to add some documents to it.

To create a new document you need to create a new document object. This is done using the Zend_Search_Lucene_Document() class.

$doc = new Zend_Search_Lucene_Document();

You can then assign fields to this document using the static functions of the Zend_Search_Lucene_Field class.

$doc->addField(Zend_Search_Lucene_Field::Text('title', 'The title of the document'));
$doc->addField(Zend_Search_Lucene_Field::Text('contents', 'The contents of the document.'));

You can also use binary data, which is useful if you have used a document scanning service and want to be able to search the data at a later date.

$doc->addField(Zend_Search_Lucene_Field::Binary('originalfile', $filedata));

Any binary data you assign like this isn’t tokenized or indexed but it is stored in the index so you would need to assign other fields so that the data can be searched for.

Once you have added your fields you can add the document using the addDocument() function of the index opened index object.

$index->addDocument($doc);

If you are building a search index for a site then you might want to use the built in HTML parsing functionality. This makes it easy for you to add either a HTML string or a HTML filename that Lucene will then index. You then add this file to the index using the addDocument() function of the opened index object. Note that when adding documents in this way you should also add the URL of the document as a field so that you can retrieve it later.

$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile('http://www.talkincode.com/');
$doc->addField(Zend_Search_Lucene_Field::Text('url','http://www.talkincode.com/'));
$index->addDocument($doc);

You can also index and search Word, Excel and Powerpoint documents in much the same way as this.

Once you have the index you can search it. This is done using an opened index object, you can find out how big your index is and how many documents you have in your index by using the count() and numDocs() functions receptively.

$indexSize = $index->count();
$documents = $index->numDocs();

To construct a query and implement the boolean and wildcard searching you need to use the Zend_Search_Lucene_Search_QueryParser class, this is then passed onto the Zend_Search_Lucene_Search_Query_Boolean object using the addSubquery() function.

$queryStr = 'talk';
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
 
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($userQuery, true);
 
  // do the search
$hits = $index->find($query);

The variable $hits now contains an array of the Zend_Search_Lucene_Search_QueryHit object. This object has a property called score, which is the score of the hit result. The score is an indication (between 0 and 1) of how closely the query matched the index. The first item in the $hits array will have the highest score value. Every field that you defined for the document whilst indexing is now presented as a property of this object. So if you set a URL field for your document you can see a list of your documents using the following code:

$hits = $index->find($query);
foreach ($hits as $hit) {
 echo $hit->score.'<br />';
 echo $hit->url.'<br />';
}

Lucene can do a lot more than what I have briefly detailed here so I might write some posts in the future on how to refine updating, indexing and searching.

Using Redirection Outside Of A Controller In Zend Framework

February 2nd, 2009 Tech No comments

I had a situation the other day where I had an application in Zend Framework and I wanted to redirect a user to another page. This is fine if you are inside a controller as you can use the _redirect() controller function, but in this instance I was in a plugin that didn’t have direct access to the controller.

The solution is to use the getResponse() method, which is accessible to plugins, and which will retrieve the response object. The response object has a function called setRedirect() that is used to redirect. Any headers that have been issued will be overwritten by this function. The following code can be run inside your zend framework plugins to redirect the user to a different page.

$this->getResponse()->setRedirect('http://another/page/', 301);

The setRedirect() function takes two parameters. The first is the URL to be redirected to and the second is the HTTP response code.