Archive

Archive for May, 2009

Extract Keywords From A Text String With PHP

May 21st, 2009 1 comment

A common issue I have come across in the past is that I have a CMS system, or an old copy of WordPress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.

The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.

function commonWords($string){
    $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
 
    $string = preg_replace('/ss+/i', '', $string);
    $string = trim($string); // trim the string
    $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
    $string = strtolower($string); // make it lowercase
 
    preg_match_all('/([a-z]*?)(?=s)/i', $string, $matchWords);
    $matchWords = $matchWords[0];
    foreach ( $matchWords as $key=>$item ) {
        if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
            unset($matchWords[$key]);
        }
    }
    $wordCountArr = array();
    if ( is_array($matchWords) ) {
        foreach ( $matchWords as $key => $val ) {
            $val = strtolower($val);
            if ( isset($wordCountArr[$val]) ) {
                $wordCountArr[$val]++;
            } else {
                $wordCountArr[$val] = 1;
            }
        }
    }
    arsort($wordCountArr);
    $wordCountArr = array_slice($wordCountArr, 0, 10);
    return $wordCountArr;                
}

Here is an example of the function in action.

$text = "This is some text. This is some text. Vending Machines are great.";
$words = commonWords($text);
echo implode(',', array_keys($words));

This produces the following output.

some,text,machines,vending

Excel Document Scanning With Zend_Search_Lucene

May 11th, 2009 2 comments

Zend_Search_Lucene offers some powerful document scanning capabilities, and there are a few different formats that are useful for the search engine to index.

To allow the indexing and searching of Excel documents using Zend_Search_Lucene you need to use the Zend_Search_Lucene_Document_Xlsx class. However, to use this class you must have the Zip module installed with PHP. For Windows users this means editing your php.ini file and uncommenting the following line:

extension=php_zip.dll

For Linux users you will need to recompile PHP with the –enable-zip configure option.

Create and/or open the index in the normal way and you can index Excel documents using the following code.

$filename = 'C:\Book1.xlsx';
$doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
$index->addDocument($doc);

You can now set up a query and search for the document in the following way, although you would normally expect the input string to be some kind of user input.

$queryStr = 'wibble';
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
 
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($userQuery, true);
 
 
$hits = $index->find($query);
 
foreach ( $hits as $hit ) {
    echo $hit->score.'<br />';
    echo $hit->filename.'<br />';
}

The score is always returned with a hit object. Other parameters available to display are filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. However, some of these depend on the contents of the document. It is possible to add keywords and subjects to an Excel document, so if they are not present then you will need to check for the existence of that parameter before displaying it. The following code looks for the existence of the keyword parameter before trying to print it out.

if ( isset($hit->keywords) ) {
    echo $hit->keywords.'<br />';
}

By default, this function indexes the document meta data and will tokenise and store the tokens as an index. The loadXlsxFile() function has a second optional parameter which is by default set to false. If this is set to true the contents of the Excel document will be included in the index. You can then use the following code to print out the contents of the document.

echo .$hit->body.'<br />';

Bear in mind that this output will not contain any row or column information and will therefore look like a dump of the data.

Multi Page Forms In PHP

May 5th, 2009 1 comment

Multi pages forms are just as they sound, a single form spread across multiple pages. These are useful in terms of usability as it can break up an otherwise dauntingly big form into smaller chunks. It can also be useful if you want to process some of the results in order to determine what forms the user sees on later steps.

There are two ways in which it is possible to do this using PHP.

The first (and simplest) is just to cycle through the items submitted on a previous form and print them out as hidden fields. Our first page source code will look like this:

<form action="form2.php" method="get">
Name: <input type="text" name="name" />
<br />
<input type="submit" value="Proceed">
</form>

On submitting the form we are taken to form2.php, which asks the user a different question and prints out the hidden fields. Because we used a get request for our first form we need to use the $_GET array.

<form action="end.php" method="get">
Colour: <input type="text" name="colour" />
<br />
<?php
foreach ( $_GET as $key=>$value ) {
    if ( $key!="submit" ) {
        $value = htmlentities(stripslashes(strip_tags($value)));
        echo "t<input type="hidden" name="$key" value="$value">\n";
    }
}
?>
<input type="submit" value="Proceed">
</form>

This same code can be used on the different pages of the form. This method is fine, and works quite well, but it doesn’t account for users going back through the form and resubmitting a previous item. The $_GET array will only contain information about the previous forms.

To make this more user friendly, and robust, we need to employ a session to store our form values as the user goes through the form. Using sessions means that the user can cycle back and forward through the forms with no ill effect on the form data. The following code will take the input of the previous form and save it as a PHP session.

session_start();
foreach ( $_GET as $key=>$value ) {
    if ( $key!="submit" ) {
        $value = htmlentities(stripslashes(strip_tags($value)));
        $_SESSION[$key] = $value;
    }
}

This must be included on every page of our multi page form. If it is not then the data will simply not be saved for that step. Your users can now move back and forward through the forms, saving the information as they go. You need to supply a back button for this to work as using the browser back and forward buttons will also not save the data.

Finally, when testing this code I found that using GET rather than POST was beneficial in terms of usability. This is mainly because if you use POST requests and the user clicks the back on their browser they will be asked if they want to resubmit the information for that form.

Does anyone else have any ideas about how to do this? If so then post a comment and suggest it. You can even put a post in the forum!

Categories: PHP Tags: , , , , , , ,