Preserving the World's Knowledge - Available Anytime Anywheresm Register Update Profile

Part I – “Did You Know” you can make your searches smarter?

April 19th, 2012 by Juan J. Celaya

How many times have you wondered how to improve your search results? Well this is the first of a series of articles we will share with you to give you an insight into what it takes to make your searches smarter.

Everyone wants the results of a search to present the information they are looking for as part of the first 10 – 15 items in the hit list. To be successful at this, two things (among many others) must happen: (1) the search query must have the information you are searching for and (2) the imagesearch engine must have the information that you are looking for. There are many other things that must happen but at a minimum these two things must happen or you will never get the information back and much less in the top 10 – 15 hits of the hit list.

The canned answer to the question is, “you did not formulate your query in such a manner that it could find the relevant documents.” However, while this may ultimately be true it is a copout. We really need to look at how the search engine treats each query and at how the data was indexed.

To help you answer the question posed by this article we are going to share with you those things that should happen and which we believe must happen to improve your results. Additionally, we are going to share technical as well as background information on Enterprise Search Engines.

The focus of our DYK series will be on:

  1. Search engine basics.
  2. The search engine’s three main types of searches that can be used by the to retrieve information.
  3. Things to know about the words you use in your query.
  4. The impact of metadata or document properties in improving your search results.
  5. Things you can do to enhance the data you are searching before it is indexed by your search engine.
  6. Using your company’s own vocabulary to enhance your search engine’s results.
  7. How classification and the use of taxonomies can help in the retrieving and enhancing your data.
  8. How to ensure access control to your data.

We will have more on Search Engine Basics later but here is some information to get you started.

Search engines are somewhat like cars, we can all use them but do we really know how they work? At this point you might say “I really don’t care how it works”, just give me what I am looking for when I enter a query. Again, as with cars, we all don’t have to be mechanics, but it does help to have some basic knowledge of how a car works to allow us to get the most out of the car. We believe this to be true of searching and queries as well.

First of all, search engines will look at (aka “index”) all the data to be searched and generate a set of indexes that it will use to answer your queries. There are various search engine technologies, methodologies and data structures that improve the results of a search based on different types mathematical algorithms however, regardless of what a search engine uses to respond to a query, the most important measurement for a search engine is its ability to balance the quantity of the results with the quality of the results. In the search engine world this is referred to as Precision and Relevancy.

It is also important to know that in the process of indexing the data most search engines will remove what is referred to as “stop words”. These are words that are used often, but do not add any value to the retrieval process. So when forming queries there is really no need to enter the common words that we use in everyday text since the search engine will also drop “stop words” from the query as well. By dropping “stop words” accuracy is increased and the size of indexes are reduced. “Stop Words” are words such as “the”, “a”, “an”, “and” as well as contractions such as “can’t” and “couldn’t”.

We leave you with the following as “food for thought”:

  1. There is more to successfully using a search engine to retrieve information than just indexing what you have.
  2. If you want to search for relevant information you need to ensure that all the possible relevant information is indexed.
  3. If you index all your information how do you ensure that access is restricted to only those who are allowed to get to the data?

We will share our answers to these and many other questions in future DYK Articles as we move through this DYK Series. If you have any questions you would like us to address or suggestions we will be happy to hear from you.

Three Common Misconceptions Regarding Document & Data Capture

February 24th, 2012 by Bill LaPorte

It dawned on me the other day that in one way or another, I have been capturing data from myriad sources for almost 30 years.  I’ve captured seismic data for global nuclear test monitoring, satellite and weapon system telemetry, even capturing, storing and retrieving ground test data from International Space Station components.  I’ve also dabbled with the capture of video and audio intercepts and surveillance data.  You would think that capturing paper and digital content from a business environment would seem, well rather mundane.  It isn’t. 

In this DYK, I will present the CDI’s Top Three Misconceptions about Capture.  This is similar to a David Letterman’s Top Ten lists, however the consequences of not considering these implications can mean the difference between a successful capture system and one that is virtually useless and sits idle. While certainly not as complex as designing the replacement for the Space Shuttle, the truth is that designing and implementing a successful capture system is in most instances an extremely customized venture based on the business processes, available personnel, budget, document collection characteristics and a host of other issues.   Consequently, very few capture solutions are the same with each having a unique set of challenges.  That is why my list is written in no particular order.  Any item could have more or less of an impact depending on the individual organization.

So, let’s take a look at some of the misconceptions:

1. "A scanner is a scanner; the cheaper, the better."  Scanners come in so many different varieties that you really should do some research before you buy.  Consider the method of attachment.  Most scanners now are Universal Serial Bus (USB) and can be run on current computers.  Be careful though as some scanners use a SCSI interface and require a special card or adapter for it to work.  Some scanners now can be connected to via wired or wireless network as well but this typically adds to the cost.  A BIG misconception is that the scanner comes with the necessary software when you buy it.  With few exceptions, scanners will come with software that is known as a ‘driver’.  A scanner driver basically translates the communication between the scanner and computer hardware.  While a driver set may allow some rudimentary testing of the scanner, it does NOT act as a capture application, which means you have the hardware configured but no software to make the scanner do what you need it to do.  On the other hand, some scanners come bundled with software that is limited in functionality and allows you to scan.  For most folks, this will suffice.  Another primary issue is resolution.  Resolution is a term that relates to optical quality of a scanned image and is specified in dots per inch or d.p.i.. For plain text, you can get away with 200 dpi but 300 dpi is better. If you are looking to scan pictures or photographs, you should consider at least 600 dpi and perhaps up to 1200 dpi. The lower the resolution, the grainier the image. So, as you increase the size of the digital image, it becomes grainier.

Other major scanner features can be addressed by asking some of these questions;

Do I need a scanner with an automatic feeder or can I get along with just a flatbed?  Do I need both?

Do I need duplex scanning capability (scanning both sides of a page)?

What is the minimum and maximum page size that the scanner can accommodate? Do these dimensions refer to the feeder or the flatbed?

Do I need to scan in color? 

What maintenance, warranty and support options are available to me?

2. "If you get a really fast scanner, your process will go much quicker."  Many folks believe a fast scanner will solve their paper problems in a short time.  In case you haven’t caught on to my sense of irony yet, in terms of all but the simplest capture jobs, actual scanning is the least time and resource intensive task in the overall capture process.  Other aspects of capture are much more time consuming.  Document preparation is generally the most labor and time-intensive process when scanning documents.  Let’s do some calculations.  Suppose you have one scanner that does 60 pages per minute.  OK, all things being equal, you could assume that in an 8 hour shift you could process 28,800    pages (60 pages x 60 minutes x 8 hours).  Pretty impressive, right?   Now let’s say that that the documents are stapled and average about 15 pages each.  Each staple you remove will take a typical operator about 20 seconds.  So now, the 15 pages no longer take 15 seconds to scan, they take 35 seconds.    You have just cut your daily production in half with one staple.  And you haven’t addressed how you will separate the documents or index them, or put them back together, if necessary.  The point here is that the true aggregate speed of the capture process is more dependent on factors other than the speed of the scanner.  Remember, scanners jam, pages occasionally need to be rescanned and a scanner is only as fast as you can keep it fed!

3. "If you scan it, you can find it."  I hear this quite a bit from folks; "We want to scan our documents so that we can find them out on our shared network drive."  While the premise is basically sound, the nature of it is akin to putting your documents into a black hole.  This is particularly true as the volume of the document collection increases.  Keep in mind that image files (TIF, JPG, etc.) have no external context or data to indicate what the file is.  So the only potential  clues you have at your disposal is the file name and the folder hierarchy that contains it.  Even files with textual content (MS Office, PDFs, etc.) present challenges to being found.  Many organizations want to replicate their paper file cabinets in terms of structure.  This is understandable since it reflects the most familiar method that the users have had to get to their documents.  However, if a document is misfiled, it may never be found unless by accident.  The next step up is to convert what you can to text so that can be searched against.  This is known as Full Text Search.  Searching the full text to find a document has multiple problems associated with it.  First, it is a SLOW process.  And the more content you have, the slower it is.  The main problem with FTS though lies in the nature of the organization’s business processes and the documents they utilize in their operation.  Many of the documents that an individual organization uses contains similar or identical textual content.   As an example, a bank employee searching for a document containing the word Smith, will likely get hundreds or even thousands of search results.  In addition, if the content was scanned and converted to text, there will be an inherent error rate in the spelling of the words in the document.  In summary, I consider file naming and FTS to be insufficient for effective retrieval of your content.  After all, when a user does a search for a document, the most positive result is to retrieve just the document or documents that correspond to the search criteria.  Doing this requires indexing each of your documents with meaningful metadata such that the search results will be culled down considerably.  While this can be a overwhelming thought, there are many tricks and tools you can use to make the job much easier.

Stay tuned!  I will address best practices for tagging your content in my next post.  If you can’t wait or have other questions, feel free to contact me to discuss.

Bill LaPorte – Director of Operations blaporte@cdlac.com at 540-659-5157 or 540-842-0358 (cell)

Making Your Data More Valuable

September 12th, 2011 by Juan J. Celaya

Have you been challenged with finding information in your data sources (libraries/databases)? Why is it that in an accounting, payroll, product or similar system you can always find what you are looking for if it is there?

I believe the answer is simply that in those types of data sources you not only know what you are looking for but your data has been captured and then processed in such a way to allow you to find what you are looking for.  Data stored in these types of databases have been captured with metadata that is commonly used within the organization to retrieve that particular record imageor  information. These data source’s fields and corresponding values (a.k.a. metadata) where preselected at the time the data source was created to specifically serve the function of retrieval, based on the content of the data source and how the data was going to be used. Not surprisingly you will find this type of data source to be what is commonly referred to as “structured data”.

When it comes to a data source with “unstructured data” this task is more difficult because the use of the data is generally not specific to a single function like, for instance, a “Parts” database would be. Additionally, the data making up these data sources are electronic files containing anything from office documents and emails to images, videos and anything else captured during the normal communication and collaboration processes within an organization.

The solution to dealing with unstructured data is to provide a common structure to the data being captured by using all the data available from the source and adding any business knowledge directly from the business unit that will be using the data. Additionally, the methodology supporting the solution needs to be flexible by allowing it to support changes to the current business need, future requirements and is able to address previously processed data, and last but not least keep cost down by allowing for changes to take place quickly without involving end users.

Read the rest of this entry»


(C) 1997 - 2013 COMPU-DATA International, LLC