Блог компании

30.11.2011 00:00 | Блог компании

An introduction to advanced analytics – What is text mining?

Источник: www.joobworld.com
Теги: jade software, text mining, неструктурированные данные, социальные медиа

Join one of our lead Intelligence technologists, Karl Oaks to learn how to ‘extract nuggets of value’ from text mines….
The purpose of this blog will be to give a high level intro into a number of the advanced analytics technologies that are keeping us occupied here in the technology group at JOOB HQ.
Current research and development activities have us focused on a few key areas under the advanced analytics banner; these include machine learning, text mining, time series analysis and network analysis – with the idea being to apply aspects of these to bring additional value to the range of interesting investigative applications we are developing.
With that said, in this blog post we will be diving (or at least getting our toes wet) into technologies on the text mining side of things. So what is text mining? In a way its not too dissimilar to any other kind mining, where you are typically focused on extracting nuggets of value from a mine; in our case the value is information and the mine is the text itself. The text itself might take the form of documents, emails, tweets, forum posts or even blogs like this.
The wider area of text mining is broken down into a number of sub areas; but the areas we are specifically working in are entity extraction, concept extraction, document clustering and sentiment analysis.
Lets look a bit closer at each of these; first entity extraction. Entity extraction is the automated process of identifying and reporting on the “entities” within a block of text. These “entities” might be people, places, organisations, dates, phone numbers, addresses etc and depending on the type of entity you are interested in there are several different techniques that can be applied to more accurately identify these.
For example, types such as email addresses and phone numbers have common patterns and as such can be accurately identified through regular expressions. Providing knowledge in the form of gazettes, taxonomies or unambiguous lists of words to represent entities such as people or countries is another technique. We have started employing more sophisticated techniques, that leverage statistical and machine learning models – such as Conditional Random Field and Maximum Entropy which we might cover in another blog post.
Next up we have concept extraction; in a way this is going one level above entity extraction, by extracting higher level concepts, rather than specific words, or making sense of the word within the sentence. In order to achieve this we have a knowledge base behind the technology; which it refers to for its higher level concepts, as well as a mechanism to provide the word sense disambiguation – or in other words clarifying the meaning in the particular sentence. One such knowledge based we currently utilise is a comprehensive Wikipedia knowledge base, where concepts are organized and structured according to the relationships among them. What is even more beneficial is that the Wikipedia knowledge base is updated frequently, which means our tools are also always up to date – talk about harvesting knowledge from web!
Once we have extracted our entities and our concepts we now have points of comparison within our set of documents. This is where document clustering comes in; we are able to construct clusters, or groupings based on concepts/entities extracted from the set of documents. This is possible because we can measure the relatedness of each of the concepts and entities. For example, an apple has a higher relatedness score to an orange compare to an aeroplane. This might sound easy, but I can assure you that the algorithms and techniques used to achieve this pretty sophisticated
Document or text clustering can be useful as a larger scope relatedness measure for your documents; meaning documents that are in the same cluster, largely share the same concepts/entities. With these clusters in place; if we are given a new document, and we cluster this against the existing clusters it will naturally align with a particular set of documents based on its content. Neat huh?
The last area of text mining we will touch on here is sentiment analysis. The purpose of sentiment analysis is to attempt to determine the attitude of the speaker, or writer in the case of text mining ,when discussing a particular topic. This can be approached in a number of different ways; but in our case we have built a machine learning model for classifying whether the person is speaking negatively, positively or neutrally on a given topic. Benefits of this can be seen through automatically detecting whether someone is talking about your product, person, place in a positive or negative way.

Теги