January 8, 2014

Flutura demystifies Machine Learning:

Machine Learning is one term that gets thrown at you every day if you are even somewhere remotely close to working with huge amounts of data, or trying to make sense of it.
It seems like a mystical science that only some mighty minds can make sense of. But then, once you start moving closer, it starts opening up to you. 

So, what is machine learning? It's just making machines powerful enough to program themselves. Why? There is a simple example to put your questions to rest. Just open your mail box, and see how many spam mails you got today. Now, just imagine the process that's happening behind it. How does your mailbox know that the mail that's headlined "Buy! Buy! Buy!!" is spam?

There are two ways that this may be occurring. In case of regular programming, we can have a number of IF-THEN conditions, and go on filtering but it goes on to become an endless loop. That’s when a far more efficient and less redundant process comes into picture. Machine learning.

What makes it more interesting is that it is almost a complete replication of the human thought process. Regular programming can replace major logical decisions, but the catch is that it only takes into account the current data, and does not “learn” from the past experiences. Taking the case of the above spam e-mail itself, when it enters the mail-box, anyone looking at it would immediately gauge that it is spam. This occurs due to their past experience with those kind of e-mails. To overcome this, the “experience” for a machine is replicated using vast amounts of data.

In case of the e-mail example, the engine is fed with text files called “training set”. The data here is in the form of string characters which is difficult to manipulate, so it is converted to a vector of numbers. And a particular feature or characteristic of the message is extracted, for example: the occurrence of a word say “buy”.
Based on this, the probability of a message being spam/non spam is calculated. But this feature alone may not be relevant, or this feature by itself may not be sufficient to classify the e-mails. And that’s why, enriching the training set comes into picture. By enriching the training set, each time the probability vectors and the coefficients get recalculated, and every time this happens, the algorithm gets replenished.  And based on this “experience”, the filtering gets more and more accurate. However, it is highly improbable that you get the perfect algorithm.

If an algorithm is made more and more accurate for the training set, just a simple example, say, you filter out all the messages that contain the word ‘Buy’ in it, you may lose out some legitimate mail when you have advertised something on an e-commerce website. Hence, there should always be a balance. When you try to make an algorithm very accurate for training data, it may fail against a new test data. This, in statistics, is called ‘Over-fitting of data’.

Primarily, Machine learning is used in cases when the solution is not fixed, it keeps changing, as in the case of spam e-mails or when you cannot explain how you solve a problem, for example: cycling, it is very difficult to explain how we cycle.

At Flutura, we use various machine learning techniques to solve industry centric problems using electrical sensor data. Predicting unplanned outages by co-relating smart-grid and smart meter data, improving electric device performance by looking into its internal sensors and maintenance data, analysing oil rig data to improve safety in oil and gas plants are just some of the interesting things we are currently involved in.


Read more about how we do it:
Flutura M2M in Oil and Gas


No comments:

Post a Comment