Problem Statement

People send emails to each other every day. And many of emails, especially business ones, contain signatures. For example:

    
    Sheldon Cooper
    Senior Theoretical Physicist
    California Institute of Technology
    1200 E California Blvd,
    Pasadena, CA 91125, United States
    sheldon@example.com
    (626) 555-9157 - phone
    (626) 555-4717 - fax

Imagine you are on a sales or support team and you receive tens or even hundreds of emails from new people. And you need to maintain an internal contact database (be it CRM or ERP or anything else). Manual splitting of a signature into form fields takes time, it is boring and also error prone. In other words this is a routine task. So it is an excellent target for automation.

In some cases automation is needed on a daily basis to maintain a contact database in actual state, in other cases an organization may possess a huge legacy archive of emails that needs to be processed to mine contact data.

What can be automated? First of all an extraction tool should be able to detect if an email contains a signature or not. Second, the signature text must be identified and extracted. And third it should be parsed into fields like person name, organization name, job title, organization address, phone numbers, etc.

This task belongs to the field of text mining.

Solution Methods

Regular Expressions

One of the most powerful and frequently applied methods to analyze text and extract information from it is usage of regular expressions. Let's quickly look at the pros and cons of this method with regard to email signature extraction.

Pros

  • Works extremely well with phone numbers, emails and URLs.
  • Easy to implement.

Cons

  • This is not actually a disadvantage. But regular expressions alone can not completely solve a problem of recognition of named entities: person names, organization names, job titles, etc. A regular expression can guess and extract a named entity using word shapes but additional procedures are needed to determine a type of a named entity. E.g. is Hewlett Packard a company or a person? Is Max Planck a person or a street name?

Dictionary Search

To classify lines of a signature and identify fields one may use dictionaries. For example if a signature line contains words John and Smith, first word is found in the FirstNames dictionary and the second word is found in the LastNames dictionary then we are probably dealing with a person name.

Pros

  • It is possible to build dictionaries for many named entities: countries, cities, people names, job titles and so on.
  • Such dictionaries are relatively small in size and search through them works lightning fast.

Cons

  • Dictionaries may intersect and this should be addressed in some way. This is more like an issue one should be aware of rather than a negative characteristic.

Machine Learning

Machine Learning deals with models that are trained before being applied to a working data. For example, given a set of annotated emails (annotated means that someone marked signatures and their constituting parts) one can create a model that will be able to recognize signatures and parse them into fields providing good recall and precision metrics.

Pros

  • With appropriate training machine learning models perform really well in text analysis area.

Cons

  • Not so easy to implement as it requires special knowledge from developers.

Our Solution

At GrinMark we created a hybrid solution called TextMiner that is based on machine learning algorithms and also uses regular expressions and dictionary search for certain tasks.

We have a training corpus of thousand emails and it is growing.

Text analysis is a language specific task. We support signature extraction for emails in English and to some extent in German, French, Spanish and Italian.

TextMiner is already built into some of our products:

  • TextMiner - Email Signature Extractor add-in for Outlook. Available in the Office Store.
  • GrinMark Outlook 365 Plugin for Sugar which is distributed via SugarOutfitters marketplace.

Here is an example of how extracted signature looks like in the TextMiner add-in:

screenshot screenshot
screenshot screenshot

Future Plans

We are working hard to make our solution the best in class. Already now it has 95% F1 score. We are working on increasing this measure and also on adding support for languages besides English.

Terminology

This section serves as a quick reference for some terms used in this article.

  • Precision is the fraction of retrieved instances that are relevant.

  • Recall is the fraction of relevant instances that are retrieved.

Suppose a computer program for recognizing dogs in scenes from a video identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7 while its recall is 4/9.

  • Word shape means a pattern of how upper-case and lower-case letters are combined together to form a word. E.g. Name, CamelCase, ACRONYM, word.