BigML is a web-service which can be used to analyse data. Their beta has been open to the public for a few months, but so far I have completed their tutorial but not experimented any further.
I uploaded the 1% sample file to BigML. It took a little while to upload 35MB on my connection. If I want to use it on bigger / unsampled data, I may need to use an AWS instance to do it so that I’m not hampered by my home internet connection.
After uploading, the next step with BigML is to create a “dataset” from the “datasource”. A dataset seems to be a view on the data. BigML correctly parsed the data file and identified “numeric”, “text”, “category” etc, for most of the columns. It selected “text” for the date-formatted columns but there was no more suitable option, and I left the postId and OwnerUserId columns as numeric, but I’m not entirely sure they should be since they don’t really signify numeric values: They are simply IDs. Mousing over the “Create dataset” button showed a pricetag with a price of “30”. Luckily this turned out not to be priced in dollars, but “Credits” which cost 5 cents each. The real price would be 30*0.05 = $1.50, and the first 700 credits are free anyway, so I continued.
After the dataset was created, BigML do a very good job of summarising your data. There are columns to show you the number of rows, the maximum and minimum in each column, and a histogram of how the values are distributed, with informative rollovers. It’s also very fast and snappy. If I’d tried to find the maximum value of a column in Excel, it would have taken a long time to complete. BigML created the dataset asynchronously in under 30 seconds and once complete, all the data is there. It even starts to show you the summary while it’s still processing some of the rows. I could instantly see that of the 33,704 rows in the sample, 33,003 were “Open”, 305 were “not a real question”, 168 were “off topic”, 161 were “not constructive” and 67 were “too localized”.
An odd result was that it seemed to completely ignore any of the columns that I specified as “text” or “category” (with the exception of the OpenStatus column). I guess this is because it doesn’t have any capabilities to process text, but I think it could probably have done something with the dates, even if it was just converting them to timestamps so they could be analysed by the model. I feel that it would be good to have an attempt at parsing the text columns into something useful (ie. numerical metrics), but there are a lot of different ways to do this, and I guess BigML are leaving it up to the user to do this kind of data transformation before uploading your data source to them. This is the area that I think automated machine learning as a service will really start to take off over the next few years, but I’m getting ahead of myself.
BigML gives a lot of non-descriptive error messages.
Once you have a dataset with the columns you want to include, and the target column set to “OpenStatus”, the next step is to train a model. I set the holdout option to 10% of the training set. I’m not really sure what this did: There doesn’t seem to be anywhere to test the resultant model on this holdout set after the model is generated. Their FAQ states that “Currently we don’t offer tools to assess the accuracy of your model. It is high on our to-do list!”. The other options were simple to choose, and I was informed it would cost 150 credits to create my model.
I created a dataset from the same datasource containing only 3 fields, hoping that maybe the problem was with the text columns or strange formatting. Oddly, this restricted dataset still used up 35MB of my allowance and cost the same number of credits to create it.
I seemed to be being charged for the failed models (hopefully this doesn’t happen once you’re outside the free tier!?), so I tested creating a model from only 1000 rows as this uses less credits (~5 credits or $0.25, rather than 150 credits ($7.50) ), then scaled back up to the full dataset once I had a dataset I could successfully build a model on. As it happened, cutting out the textual columns from the dataset fixed the non-specific error and the model generated successfully.
BigML currently only creates “Tree” type models. They state that they intend to add more model types, but they seem to be focusing on making sure that the models generated are easily understandable and can be used by people who don’t have a background in Machine Learning or statistics. The tree model generated in the tutorial is shown below.
This model attempts to predict a person’s income (either < 50K or > 50K per year) based on other information about the person, eg. Their job, relationship status, etc. To use the model, you start at the top (root) of the tree and treat it similarly to a flow chart, answering the questions and following the lines until you end up at a node marked “>50K” or “<50K”, and that is your answer. This is quite a powerful model when the target can be predicted by asking a series of multiple-choice (often Yes / No) questions, and is certainly easy to understand & use by a novice.
Here is the model that BigML learned based on my 3-attribute dataset:
You can see that there are no questions to answer. The model predicts “open” for all classes. This is actually a pretty good guess: Guessing “open” all the time is actually quite a good strategy, since 95% of the questions really are “open”. In the benchmark files provided by Stack Overflow, they also predict “open” as the most probable class for every question, only the % confidence changes between different questions. I suspect this might be a common problem with other models (the data really is in favour of predicting “open” all the time), and that I’ll need some way of either making the classes equal (by sampling) and adding a bias towards the “open” class afterwards, or I’ll need a model that provides confidences for each class, which can be used to form a prediction file.
Unfortunately, BigML does not provide confidences. Their tree simply returns a predicted class (“open”), with no indication of how confident we should be that the question actually belongs to the “open” class.
My Thoughts on BigML
I am not convinced that BigML’s focus purely on a tree model is a good idea. Tree models alone cannot deal with all tasks that Machine Learning is used to solve (aside from the Stack Overflow confidences problem I have, it also can’t predict values numerical values like “income” without restricting them to multiple-choice (eg. “>50K” or “
It’s also quite easy to generate a tree-model offline using WEKA, RapidMiner or various other libraries, or even to write one yourself using nested if statements- BigML is simply a browser based way of generating it. An offline method means that you don’t need to upload all of your training data over the internet, have no limits on the number of predictions / models you can make, and you don’t require an internet connection to make predictions. BigML actually offer a downloadable version of the tree model as python code. As they offer this, I don’t see why anyone would ever use their API to make predictions: cloud computing is hardly necessary to make a prediction by evaluating a series of simple “if” statements!
On the positive side, the interface is very slick (except for the errors), and as it’s a cloud solution you don’t need a powerful computer to train a model. It coped with my 35MB file well, and the tutorial, tooltips and free tier make it very easy to get started. Training a simple model has the advantage that not many parameters need to be set, so it’s easier to use than some offline libraries that can require more configuration.
I think BigML are 90% of the way there. They have the infrastructure and a solid API and framework, now they just need to add more ML. I think that the strength of a web-based service like BigML should be that the user can send over the data, train a model, and then request predictions. The details of the model do not need to be understood or even known about by the user: they can just use it as a black-box to make predictions. By restricting themselves to tree-models, BigML are severely restricting the number of applications that their service can be used for. That said, they say that they plan to add more models (Naïve Bayes etc.) and perhaps with those models added and selected automatically, it will start to serve a useful purpose.
BigML is essentially a very shiny, easy to use “tree generator”. This screenshot shows the end result, an offline python copy of the trained tree model. Once you have this, you no longer need BigML.
This post was mostly written a few ago, before BigML released their “pruning” options. I have since been back and experimented with their pruning options which can create trees with more than 1 node. They also seem to have fixed some of the problems I was having with bugs and errors appearing. I guess it’s only in beta, and is improving by the day!