Skip to main content.
Wednesday, June 29th, 2005

Google Fame New Start

UPDATE: Although it might be interesting from a technical point of view, I had to abandon the programming effort I describe herein because I found out that it violates the Google Terms of Service.

Well, as I discussed earlier, I have explored using php to generate Google searches and parsing the resultant html files to find the information I want. Parsing refers to taking a big chunk of data and analyzing it to extract needed information. In the case of my Google Fame plugin, I want to send search requests to Google and search through the results to find any links that go to my website. I also want to note Google’s estimate of how many results there are, and how far down the list the first reference to my website is.

Normally you would use a browser to connect to Google. Once connected, you can enter your search terms, maybe adjust a few parameters (like requesting only pages that are in English, for example) and search. The Google homepage is actually a form that you enter data into first, or you might use the Google Toolbar in Internet Explorer or the Googlebar extension in Mozilla or Firefox to pre-enter the form information and skip the Google front page. Either way, you’re sending a formatted request to the Google site, and Google analyzes that request, finds what you want, and sends the information back to you formatted as html code that your browser converts into a webpage.

I’m writing code that skips all the interactive, user-visible steps. I’m going to set up my own interface to determine what the search terms and parameters should be and converting that into a call to Google. For example, if I want to search using the terms “greg perry” and “san diego”, and I want 100 results each time, I could send the string

http://www.google.com/search?num=100&q=%22greg+perry%22+%22san+diego%22

to Google and it would send me the results back. There are actually many more variables I can put into the request I send to Google, but I’ll keep it simple here.

But I don’t want to look at the results, I want my program to look at them for me and find want I want. So instead of letting my browser display the results, I capture the information that Google returns and put it into a string. It’s a long string – about 100,000 characters, or 100 KB – but nowadays that’s not a problem. Then I use other commands to search through the long string to find the information I want. At this point in time, Google will only give me up to 100 results per search, so if my website isn’t in the first try, I have to generate another call asking for the next 100, and so on, until I find my site, or reach the end of the list and determine that my site didn’t make it.

The commands in php use a complex technique of wildcards, originally developed in the programming language PERL, called “regular expressions” or regex. I found regex to be pretty difficult to understand at first, but using tutorials and samples I found on the web, I was able to cobble together some workable code. It may not be elegant, and it might give errors if unexpected results are encountered, but it’s enough for now.

Now, the Google API is designed to allow programmers to do this sort of thing without parsing the html files. You make program calls to the API and get preformatted results back from Google. The trouble that I found is that the results are wrong – sometimes seriously different from what you get if you go to Google in your browser and send in the same information. You also have to register with Google when you download the API and get a key that you have to include with your API calls – the key allows Google to associate those API calls with you, and the number of calls you can make in a day is limited. So for me, the API isn’t worth the programming steps that it saves.

What’s an API, you might ask? Here’s a good definition from Arizona State University:

Short for Application Program Interface, API is a set of routines, protocols, and tools for building software applications. A good API makes it easier to develop a program by providing all the building blocks. A programmer puts the blocks together. Most operating environments, such as MS-Windows, provide an API so that programmers can write applications consistent with the operating environment. Although APIs are designed for programmers, they are ultimately good for users because they guarantee that all programs using a common API will have similar interfaces. This makes it easier for users to learn new programs.

So far, I’ve created a program that uses a simple form to set up the Google search and go through the results. The program is set up to spit a lot of info back, including the contents of various variables I use so I can check that my code is functioning properly, and the complete listing of all the websites that Google found that match my search request. My program counts and numbers the matches and sets a flag when my site is found in the results. It stops sending calls to Google when it finds a reference to my site, or if I hit the limit of how many results Google is willing to provide. (While playing around with this, I found that Google doesn’t seem to give you more than 1000 results, even if it tells you that there are a lot more.) Then it tells me my website ranked X out of XX results (zero if I didn’t appear at all) which is all I’m really looking to know. You can play with my program so far if you want.

So now I’ve pretty much caught up to where I was when I started getting disgruntled with the Google API. Next, I want to expand the code I’ve written to interact with my online database so that it retrieves preset search terms from one place, gets the results, and stores them in another place in the database. Then I have to create the interfaces. I need two – one in my WordPress Administrator area (also called the “backend”) that allows me to put the search terms into the database; and one in my blog (the “frontend”) – I point again to the the space I marked out in my right sidebar – so that people can see the results. Somewhere I have to have a way of telling my program when to run. My choices are to have it run once a day all by itself; to launch it from my backend; or to make my frontend interface figure out when it’s me looking at it, and offer me a way to run it while I’m there, without bothering to go into the backend.

The next step after that is to package up the various programs as a plugin and make them available to other WordPress users so that they can use it, too. This might be the hardest part – I also have to write programs to install the plugin and set up the required elements (such as the database tables I use) that I just manually set up on my own website. I also need to register the project with WordPress and create an area in my website to make the download available to the public. This area will probably have to include documentation of the plugin and a forum for users. I’ll have to offer technical support for people who have trouble getting it to install and work properly, and respond to feature requests from people that want it to do something else or do it differently. How much work that will require depends on how well I program it in the first place, how popular it gets, and how close I am to anticipating what other people want.

It could end up being a lot of work. So why bother? Here’s what I anticipate getting out of it:

I just spent a significant chunk of time that I could have used programming to document my efforts. Well, that’s ok, too. I also enjoy writing as well.

Posted by Greg in My Website, Programming

1 Comment »

This entry was posted on Wednesday, June 29th, 2005 at 12:41 PST and is filed under My Website, Programming. You can follow any responses to this entry through the comments RSS 2.0 feed. Both comments and pings are currently closed.

One Response to “Google Fame New Start”

  1. Ramblings » Blog Archive » Contact from NACE says:

    […] So I looked carefully at the online database and saw a way to prize the information about all licensed Corrosion PE’s, past and present, from the interface; including names, addresses (which I’m assuming are business), and license status. All I need to do is write a script that will retrieve information pages for each individual for Corrosion PE license number 1 through 1087 (the last one apparently issued), and dump the results into a comma delimited file, filtered via regular expressions, which I can them import into a database such as OpenOffice Base. This is all very much like what I had already done with my Google automated search plugin for WordPress, before that project came to a screeching stop when I learned that automated searches of Google violated their Terms of Service. […]