Machine Learning Interview Questions and Answers - 2
8. The data files used for Machine Learning Algorithms can be very large at times. How would you handle them to avoid the crashing of algorithms or letting it run out of memory?
The datasets in Machine learning can get really large and may lead to problems like crashing of algorithm or the system running out of memory.
Following are some ways to deal with large data files:
i.) Increase the memory of your computer - This is one of the easiest ways to deal with the problem. If your requirement is heavier, you can even consider renting computer time on cloud services.
ii.) Re-configure your tools and libraries - Check the tool or the library you are using& try to re-configure it to allocate more memory. Some of them are limited.
iii.) Decrease the dataset size - to something that you really require. If the system doesn't need that large data, why use it?
iv.) Use memory saving data format - Try converting your dataset to a format that can load it faster or uses lesser memory or may be doesn't need the complete dataset to be loaded into the memory at a time. Progressive loading can help you save the memory tremendously.
v.) Use RDBMS - This will need you to use algorithms that support these databases.
vi.) Use Big Data platforms if the dataset is really huge and nothing else gives you a good performance. 9. What is Classification Accuracy?
Classification Accuracy is the ratio of correct predictions made by your model to the total predictions. Usually it is presented in the percentage format.
The reverse of classification accuracy rate is error rate.
The main limitation with Classification Accuracy Rate is, it at times doesn't give you a very clear picture of the performance of your model specially when your data contains 3 or more classes or when the number of classes are not even.
In these cases, you do not understand if the model is working equally well on all the classes or if it is ignoring some particular classes. So, while your accuracy percentage may be high, you can still not be sure about the performance of your model. 10. What is Data Leakage? How would you prevent it?
Data Leakage is when the model uses data other than the training data while getting created. This usually happens when validation or test data leaks into the training data.
Following are certain things you can do to prevent Data Leakage:
i.) Split your dataset into train, validation and test data & keep everything other than your train data away. Use it consciously when you are fully done with training the model.
ii.) Avoid over preparing the data otherwise it may lead to over fitting.
iii.) Remove all the data that you have prior to the event of your interest.
iv.) If you suspect some variables to be leaking into the model, consider removing them. 11. What do you know about Bagging?
Bagging or Bootstrap Aggregation is an ensemble method.
Ensemble methods take the predictions from various machine learning algorithms together and make predictions which are expected to be more accurate than any single method.
Bagging is used to reduce the variance of high variance algorithms like Decision Trees and can be used for both classification and regression problems.