People send emails to each other every day. And many of emails, especially business ones, contain signatures. For example:
Sheldon Cooper Senior Theoretical Physicist California Institute of Technology 1200 E California Blvd, Pasadena, CA 91125, United States firstname.lastname@example.org (626) 555-9157 - phone (626) 555-4717 - fax
Imagine you are on a sales or support team and you receive tens or even hundreds of emails from new people. And you need to maintain an internal contact database (be it CRM or ERP or anything else). Manual splitting of a signature into form fields takes time, it is boring and also error prone. In other words this is a routine task. So it is an excellent target for automation.
In some cases automation is needed on a daily basis to maintain a contact database in actual state, in other cases an organization may possess a huge legacy archive of emails that needs to be processed to mine contact data.
What can be automated? First of all an extraction tool should be able to detect if an email contains a signature or not. Second, the signature text must be identified and extracted. And third it should be parsed into fields like person name, organization name, job title, organization address, phone numbers, etc.
This task belongs to the field of text mining.
One of the most powerful and frequently applied methods to analyze text and extract information from it is usage of regular expressions. Let's quickly look at the pros and cons of this method with regard to email signature extraction.
To classify lines of a signature and identify fields one may use dictionaries. For example if a signature line contains words
Smith, first word is found in the
FirstNames dictionary and the second word is found in the
LastNames dictionary then we are probably dealing with a person name.
Machine Learning deals with models that are trained before being applied to a working data. For example, given a set of annotated emails (annotated means that someone marked signatures and their constituting parts) one can create a model that will be able to recognize signatures and parse them into fields providing good recall and precision metrics.
At GrinMark we created a hybrid solution called TextMiner that is based on machine learning algorithms and also uses regular expressions and dictionary search for certain tasks.
We have a training corpus of thousand emails and it is growing.
Text analysis is a language specific task. We support signature extraction for emails in English and to some extent in German, French, Spanish and Italian.
TextMiner is already built into some of our products:
Here is an example of how extracted signature looks like in the TextMiner add-in:
We are working hard to make our solution the best in class. Already now it has 95% F1 score. We are working on increasing this measure and also on adding support for languages besides English.
This section serves as a quick reference for some terms used in this article.
Precision is the fraction of retrieved instances that are relevant.
Recall is the fraction of relevant instances that are retrieved.
Suppose a computer program for recognizing dogs in scenes from a video identifies 7 dogs in a scene containing 9 dogs and some cats. If 4 of the identifications are correct, but 3 are actually cats, the program's precision is 4/7 while its recall is 4/9.