Data Science python R R-Programming

Predictive Maintenance: Zero to Deployment in Manufacturing

Predictive maintenance has been seen as a holy grail for cost cutting manufacturing. There are various steps involved in just feasibility study such as problem identification, sensor installation, signal processing, feature extraction and analysis, and finally modeling. Once a reliable and robust model is developed, the model has to be deployed to a manufacturing environment.

Various tools are being used for modeling and deployment such as R, Python, Docker, Kubernetes, JSON, PostgreSQL etc and will be discussing the process and deployment flow in this session.

I will be presenting “Predictive Maintenance: Zero to Deployment in Manufacturing” at ODSC East through virtual conference. The registrations are still open for this conference.


Data Science python R R-Programming

AutoML Frameworks in R & Python

In last few years, AutoML or automated machine learning as become widely popular among data science community. Big tech giants like Google, Amazon and Microsoft have started offering AutoML tools. There is still a split among data scientists when it comes to AutoML. Some fear that it is going to be a threat to their jobs and others believe that there is a bigger risk than a job; might cost the company itself. Others see it as a tool that they could use for non-critical tasks or for presenting proof-of-concepts. In-arguably, it has definitely made its mark among the data science community.

If you don’t know what AutoML is, a quick google search will give you a good intro to AutoML. According to wikipedia “Automated machine learning (AutoML) is the process of automating the process of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model”

In this blog post, I will give my take on AutoML and introduce to few frameworks in R and Python.


  • Time saving: It’s a quick and dirty prototyping tool. If you are not working on critical task, you could use AutoML to do the job for you while you focus on more critical tasks.
  • Benchmarking: Building an ML/DL model is fun. But, how do you know the model you have is the best? You either have to spend a lot of time in building iterative models or ask your colleague to build one and compare it. The other option is to use AutoML to benchmark yours.


  • Most AI models that we come across are black box. Similar is the case with these AutoML frameworks. If you don’t understand what you are doing, it could be catastrophic.
  • Based on my previous point, AutoML is being marketed as a tool for non-data scientists. This is a bad move. Without understanding how a model works and blindly using it for making decisions could be disastrous.

Personally, I do use AutoML frameworks for day-to-day tasks. It helps me save time and understand the techniques and tuning parameters behind these frameworks.

Now, let me introduce you to some of the top open source AutoML frame works I have come across.



H2O definitely goes on the top of the list. They offer ML, deep learning and stacked ensemble models in their frame work. Although it is written in java, they offer connectors for R and Python through API’s. The best feature that I have almost never seen is the “stopping time”, where I can set how long I want to train my model. Below is the code for running in R and Python for Iris data set.


# Load library

# start h2o cluster

# convert data as h2o type
train_h = as.h2o(train)
test_h = as.h2o(test)

# set label type
y = 'Species'
pred = setdiff(names(train), y)

#convert variables to factors
train[,y] = as.factor(train[,y])
test[,y] = as.factor(test[,y])

# Run AutoML for 20 base models
aml = h2o.automl(x = pred, y = y,
                  training_frame = train_h,
                  max_models = 20,
                  seed = 1,
                  max_runtime_secs = 20

# AutoML Leaderboard
lb = aml@leaderboard

# prediction result on test data
prediction = h2o.predict(aml@leader, test_h[,-5]) %>%

# create a confusion matrix
caret::confusionMatrix(test$Species, prediction$predict)

# close h2o connection
h2o.shutdown(prompt = F)


# load python libraries
import h2o
from h2o.automl import H2OAutoML
import pandas as pd

# start cluster

# convert to h2o frame
traindf = h2o.H2OFrame(r.train)
testdf = h2o.H2OFrame(r.test)

y = "Species"
x = list(traindf.columns)

# create df to factors
traindf[y] = traindf[y].asfactor()
testdf[y] = testdf[y].asfactor()

#run automl
aml = H2OAutoML(max_runtime_secs = 60)
aml.train(x = x, y = y, training_frame = traindf)

# view leader board

# do pridiction and convert it to a data frame
predict = aml.predict(testdf)
p = predict.as_data_frame()

# convert to pandas dataframe
data = {'actual': r.test.Species, 'Ypredict': p['predict'].tolist()}

df = pd.DataFrame(data, columns = ['actual','Ypredict'])

# create a confusion matrix and print results
confusion_matrix = pd.crosstab(df['actual'], df['Ypredict'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)

# close h2o connection
h2o.shutdown(prompt = False)

automl Package in R

The automl package is availabe on CRAN. The automl package fits from simple regression to highly customizable deep neural networks either with gradient descent or metaheuristic, using automatic hyper parameters tuning and custom cost function. A mix inspired by the common tricks on Deep Learning and Particle Swarm Optimization. Below is a sample code for how to use in R.


amlmodel = automl_train_manual(Xref = subset(train, select = -c(Species)),
                               Yref = subset(train, select = c(Species))$Species
                               %>% as.numeric(),
                               hpar = list(learningrate = 0.01,
                               minibatchsize = 2^2,
                               numiterations = 60))

prediction = automl_predict(model = amlmodel, X = test[,1:4]) 

prediction = ifelse(prediction > 2.5, 3, ifelse(prediction > 1.5, 2, 1)) %>% as.factor()

caret::confusionMatrix(test$Species, prediction)

Remix AutoML



Remix AutoML was developed by remyx institute. According to the developers “This is a collection of functions that I have made to speed up machine learning and to ensure high quality modeling results and output are generated. They are great at establishing solid baselines that are extremely challenging to beat using alternative methods (if at all). They are intended to make the development cycle fast and robust, along with making operationalizing quick and easy, with low latency model scoring.” Below is a sample code for how to use in R.

train$Species = train$Species %>% as.integer()
remixml = AutoCatBoostRegression(data = train %>% data.matrix()
                                 , TargetColumnName = "Species"
                                 , FeatureColNames = c(1:4)
                                 , MaxModelsInGrid = 1
                                 , ModelID = "ModelTest"
                                 , ReturnModelObjects = F
                                 , Trees = 150
                                 , task_type = "CPU"
                                 , GridTune = FALSE
predictions = AutoCatBoostScoring(TargetType = 'regression'
                                  , ScoringData = test %>% data.table::data.table()
                                  , FeatureColumnNames = c(1:4)
                                  , ModelObject = remixml$Model

prediction = ifelse(predictions$Predictions > 2.5, 3, ifelse(predictions$Predictions > 1.5, 2, 1)) %>% as.factor()

caret::confusionMatrix(test$Species, prediction)



The autoxgboost aims to find an optimal xgboost model automatically using the machine learning framework mlr and the bayesian optimization framework mlrMBO. The development version of this package is available on github. Below is a sample code for how to use in R.

# load library

# create a classification task
trainTask = makeClassifTask(data = train, target = "Species")

# create a control object for optimizer
ctrl = makeMBOControl()
ctrl = setMBOControlTermination(ctrl, iters = 5L) 

# fit the model
res = autoxgboost(trainTask, control = ctrl, tune.threshold = FALSE)

# do prediction and print confusion matrix
prediction = predict(res, test[,1:4])
caret::confusionMatrix(test$Species, prediction$data$response)


Auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. According to Auto-sklearn team, “auto-sklearn frees a machine learning user from algorithm selection and hyperparameter tuning. It leverages recent advantages in Bayesian optimization, meta-learning and ensemble construction. Learn more about the technology behind auto-sklearn by reading our paper published at NIPS 2015 .” Also to note that, this framework is possibly the slowest among all the frameworks presented in this post. Below is a sample code for how to use in Python.

import autosklearn.classification
import sklearn.model_selection
import sklearn.metrics
import pandas as pd

train = pd.DataFrame(r.train)
test = pd.DataFrame(r.test)

x_train = train.iloc[:,1:4]
y_train = train[['Species']]
x_test = test.iloc[:,1:4]
y_test = test[['Species']]

automl = autosklearn.classification.AutoSklearnClassifier()
print("fittiong" ), y_train)
y_hat = automl.predict(x_test)

# convert to pandas dataframe
data = {'actual': r.test.Species, 'Ypredict': y_hat.tolist()}

df = pd.DataFrame(data, columns = ['actual','Ypredict'])

# create a confusion matrix and print results
confusion_matrix = pd.crosstab(df['actual'], df['Ypredict'], rownames=['Actual'], colnames=['Predicted'])
print (confusion_matrix)



Autogluon is the latest offering by aws labs. According to the developers, “AutoGluon enables easy-to-use and easy-to-extend AutoML with a focus on deep learning and real-world applications spanning image, text, or tabular data. Intended for both ML beginners and experts, AutoGluon enables you to:

  • Quickly prototype deep learning solutions for your data with few lines of code.
  • Leverage automatic hyperparameter tuning, model selection / architecture search, and data processing.
  • Automatically utilize state-of-the-art deep learning techniques without expert knowledge.
  • Easily improve existing bespoke models and data pipelines, or customize AutoGluon for your use-case.

Below is a sample code for how to use in Python.

#import autogluon as ag
from autogluon import TabularPrediction as task
import pandas as pd

train_data = task.Dataset(file_path = "TRAIN_DATA.csv")
test_data = task.Dataset(file_path = "TEST_DATA.csv")

label_column = 'Species'
print("Summary of class variable: \n", train_data[label_column].describe())

predictor = = train_data, label = label_column)

y_test = test_data[label_column]  # values to predict

y_pred = predictor.predict(test_data)
print("Predictions:  ", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

The above frameworks just a few to scratch the surface. Some of the honorable mentions to this list are autokeras, deep learning studio, auto-weka and tpot. Some of the other paid tools are from Dataiku, data robot, rapid miner etc. As you can see from the above that there are so many open source tools that you can use today and here is a list of open source AutoML projects being worked on right now.

Hope you enjoyed this post. Comment below to let me know if I missed any frameworks or is worth mentioning. Do subscribe to this blog and check out my other posts.

Data Science python R R-Programming

Top Data Science Blogs

As a data scientist, I always seek to learn about out new tools and techniques. Although research papers are a great resource to learn, they are mostly either theoretical or lack in hands on explanation. Blogs are a great way to learn if you like me. They are concise, application oriented and doesn’t require a lot of time. There are hundreds or may be thousands of blogs out there. Some of the few I frequently follow are R-bloggers, python-bloggers, machine learning mastery, and ODSC blog. I was curious to see if there were more like these that I could follow. I came across this compilation from feedspot. This includes links to pages, posting frequency and ranking. Although this is not an exhaustive list as they rely on bloggers to submit their site, its a great source to follow a few.

Below if the list of compilation from feedspot.

# Page Title DA Posting Frequency Location
1 Data Science Central 70 2/day Los Angeles, California, United States
2 KDnuggets 69 21/week Brookline, Massachusetts, United States
3 Data Science | Reddit 91 23/day
4 Analytics Vidhya 60 1/day Gurgaon, Haryana, India
5 DataCamp Blog 61
6 Chris Albon – Data Science, Machine Learning, and Artificial Intelligence 41 Rural border town
7 Yhat: The Yhat (Pronounced Y-hat) Blog 49 Brooklyn, NY
8 Data Science 101 45 4/week
9 UC San Diego’s Data Science Blog – Center for Computational Biology & Bioinformatics 90 San Diego, CA
10 Kaggle | Data Science News – No Free Hunch 96 1/day San Francisco
11 Dataquest Blog 55 4/month Boston, MA
12 Codementor » Data Science Tutorials 60 2/week
13 Analytics Insight 48 5/day Hyderabad, India
14 Data Science made in Switzerland 64 1/month Winterthur, Zürich
15 Command Line Tips 23 2/week
16 DATAVERSITY » Data Science News, Articles, & Education 59 1/quarter
17 365 Data Science 32 5/week
18 Domino Data Science Blog 51 1/month San Francisco, CA
19 Data Science Blog | AI, ML, big data analytics 42 3/month
20 The Data Incubator 49 1/month New York, USA
21 Hevo Blog | Transformative ideas and real insights on all things Data 32 8/week San Francisco, CA
22 Revolutions 59 1/week Chicago, Illinois, United States
23 Insight Data 52 2/month Palo Alto, CA
24 Little Miss Data 27 1/week Texas, USA
25 Dataaspirant 41 India
26 Dimensionless 30 4/quarter Navi Mumbai, India
27 Dimensionless Technologies Blog | Data Science & Business Analytics 30 4/quarter Navi Mumbai, India
28 Data Science Blog (English only) 37 1/week Berlin
29 Data Elixir 40 1/week
30 Win-Vector Blog 46 7/month San Francisco, CA
31 DataMites Blog 17 2/month Bengaluru, India
32 Data Plus Science 37 4/quarter Cincinnati, Ohio, United States
33 Data Science Association blog 49 1/month Denver, CO
34 b.telligent – Data Science Blog 31 Munich (DE), Zurich (CH)
35 Verra Mobility 47 1/week
36 Ujjwal Karn – The Data Science blog 38
37 mlwhiz 33 7/month
38 Data Science for Social Good 49 1/quarter Chicago, Illinois, United States
39 InnoArchiTech 40
40 SV Data Science 48 Silicon Valley, CA
41 Yanir Seroussi | Data science and beyond 38 2/quarter
42 Data Skeptic 44 1/week Los Angeles, CA
43 Backyard Data Science | Precision Guesswork While You Wait 27 6/quarter
44 Data Piques 32 2/quarter New York, NY
45 Data Science Review | Learning Data Science Right 10 2/month New York, USA
46 Mad (Data) Scientist 32 1/month
47 Blue Orange 24 2/week New York City, New York, United States
48 Data Science, Database, Tools and QA Learning’s 9 1/week Bangalore, Karnataka, India
49 Neural Market Trends 32 8/quarter NJ, USA
50 Data Science Unicorn 8 1/week
51 Data Science Africa 15 Africa
52 Nina B. Zumel | Data Science 21 2/year
53 Oracle Data Science Blog 94 1/week Los Angeles, CA
54 The Data Science Community 30 1/month Brussels
55 Data Science Consulting LLC 5
56 Datalore Labs Blog 1
57 AnalytiXon 17 1/day
58 Perfect Price Blog 35 1/week
59 FOXY DATA SCIENCE 8 2/month Greece
60 Data Science Diary 4
61 My Data Science and Big Data blog 13
62 Datapolitan 26 New York, NY
63 Appsilon Data Science Blog 33 Warsaw
64 ikigomu 19 2/year
65 DSI Analytics – Data Science Insights 24 Amsterdam, The Netherlands
66 ERDataDoc 7 Billings Montana
67 Carlo Carandang – Data Science 11 Canada
68 Hi! I am Nagdev 0 2/week United States
69 Becoming A Data Scientist 43 4/year Harrisonburg, VA
70 The Data Blogger 29 1/year
71 Data Science at NIH 95 30/year Bethesda, MD
72 Datascience@Berkeley | Online Learning Blog 93 3/year Berkeley, CA
73 Insight Extractor – Blog 23 8/year United States
74 Socrates Data Science Blog 2 1/year
75 District Data Labs 50 Washington, DC
76 data science ish 45 Salt Lake City, UT
77 DataScience@SMU – Southern Methodist University 77 1/week
78 NYU Center for Data Science 91 New York, NY
79 DataRobot | Machine Learning Software 59 Boston, Massachusetts, United States

Let me know if there are any other blogs that I missed out here.

Follow my blog for other latest data science related posts.

R R-Programming

Simulating your Model Training for Robust ML Models

In this post, I will explain you why one should run simulations on their model training process so the models don’t fail in production. Traditionally we are always used to training models at certain split ratio’s of say, 70:30 or 80:20. The fundamental issue with this is that we don’t always train models on different parts of data in those splits as shown in below image.


Hence, it becomes vital to make sure one trains their model with various scenarios to make sure the model is not biased. This also ensures that the model is reliable and robust enough to deploy it into production.

Below we will go over an example on how this all comes together. The steps are as follows:

  1. load libraries
  2. load mtcars data set
  3. write a function to split data at different degrees
  4. run simulation in a loop to get error rates
  5. Finally visualizing the result

We will first begin with loading libraries and our data set as shown below

# load libraries

# load data

Next, we will write a function that includes

  • set seed value. This is because we want to capture new data every time (Duh! that the whole point of this simulation)
  • split the data in to train and test at various ratios
  • build an SVM model using train data
  • do predictions on test data
  • calculate & return error value (MAE)
# function to run simulation
runsimulation = function(i, split){

  seedValue = i*rnorm(1)

  # change seed values

  # create samples
  samples = sample(1:nrow(mtcars), split*nrow(mtcars))

  # split data to test and train
  train = mtcars[samples, ]
  test = mtcars[-samples, ]

  # build a model
  model = svm(mpg ~ ., data  = train, scale = F, kernel = "radial")

  # do predictions
  prediction = predict(model, test %>% select(-mpg))

  # calculate error
  error = mae(actual = test$mpg, predicted = prediction)

  # return error values


We will create a sequence of split ratios and then run these ratios in the loop. For each split ratio, we will run around 300 runs.

# create split ratios
split = seq(from = 0.5, to = 0.95, by = 0.05) %>% rep(300) %>% sort(decreasing = FALSE)

# get the length of i for seed values
i = 1:length(split)

# get errors
errors = mapply(runsimulation, i = i, split = split)

# put data into a data frame
simResults = data.frame(split, errors)

Finally, we visualize the data and look at the results. In the below box plot we can see that the median decreases as the split ratio increases. This should be true as we are feeding in more data to the model. We also notice that the minimum error decreases as we add more data while training. This also increases the max errors. We can notice similar observation for quantile as well.


Next, we will look at the summary of mean and variance for each split ratios. We notice that the least average error is with 95% split and also comes with higher degree of SD. and vice versa.

# plot results
data = simResults,
main = "Error Simulation Results",
xlab = "Train Split",
ylab = "Mean Absolute Error",
col = "light blue")
grid (NULL,NULL, lty = 6, col = "cornsilk2") 

simResults %>%
  group_by(split) %>%
  summarise(mean = mean(errors), sd = sd(errors)) %>%
#     split   mean      sd
# 1   0.50 4.826838 0.7090876
# 2   0.55 4.701303 0.8178482
# 3   0.60 4.674690 0.8442144
# 4   0.65 4.645363 0.8727532
# 5   0.70 4.652534 1.0769249
# 6   0.75 4.555186 1.1046217
# 7   0.80 4.588761 1.3002216
# 8   0.85 4.572775 1.6021275
# 9   0.90 4.519118 1.7865828
# 10  0.95 4.443357 2.4188333

At this point, its up to the decision maker to decide what model one should go for. Can they afford significant variations in error rates or want to control the variance of error rate. If I was the decision maker, I would go with either 65% or 70% split and control that variance in error.

In conclusion, machine learning is hard. Its not as simple as fitting a model with data. You need to run simulations as above to analyze your models. The above is the most simplest case you could come across. Once you get to hyper parameters, it gets even more complicated. There is not one set of tools or flows that works for all. You sometimes need to get creative and come up with your own flows.

Hope you enjoyed this quick tutorial. Feel free to like, share and subscribe to this blog. 

R R-Programming Thoughts

COVID-19 Data and Prediction for Michigan

Every country is facing a global pandemic caused by COVID19 and it’s quite scary for everyone. Unlike any other pandemic we faced before, COVID19 is providing plenty of quality data in near real time. Making this available for general public has helped citizen data scientists to share their reports, forecast trends and building real-time dashboards.

Like everyone else, I am just as curious as anyone else as to “How long will all this last?”. So, I decided to pull up some data for my state and see if I build a prediction model.

Getting all the Data Needed

CDC and your state gov websites should be publishing data every day. I got my data from and click on Detroit. Here is the link to compiled data on my GitHub.

Visualize Data


From the above plot we can clearly see that the data is increasing in an exponential trend for total cases and the total deaths seems to be in a similar trend.


The correlation between each of the variables is as shown below. We will just use Day and Cases for the model building. The reason for this is because we want to be able to extrapolate our data to visualize future trends.

             Day     Cases     Daily  Previous    Deaths
Day      1.0000000 0.8699299 0.8990702 0.8715494 0.7617497
Cases    0.8699299 1.0000000 0.9614424 0.9570949 0.9597218
Daily    0.8990702 0.9614424 1.0000000 0.9350738 0.8990124
Previous 0.8715494 0.9570949 0.9350738 1.0000000 0.9004541
Deaths   0.7617497 0.9597218 0.8990124 0.9004541 1.0000000

Build a Model for Total Cases

To build the model, we will first split the data in to train and test. The split ratio is set at 80%. Next, we build an exponential regression model by using our simple lm function. Finally, we can view the summary of the model.

 # create samples from the data
samples = sample(1:16, size = 16*0.8)

# build an exponential regression model
model = lm(log(Cases) ~ Day + I(Day^2) , data = data[samples,])

# look at the summary of the model

In the below summary we can see that Day column is highly significant for our prediction and Day^2 is not highly significant. We will still keep this. Our adjusted R-squared is 0.97 indicating the model is significant and p-value is less than 0.05.

Note: Don’t bash me about number of samples. I agree this is not a good amount of samples and I might be over fitting. 

lm(formula = log(Cases) ~ Day + I(Day^2), data = data[samples,

     Min       1Q   Median       3Q      Max
-0.58417 -0.13007  0.07647  0.17218  0.56305 

             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.091554   0.347073  -0.264   0.7979
Day          0.711025   0.104040   6.834 7.61e-05 ***
I(Day^2)    -0.013296   0.006391  -2.080   0.0672 .
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3772 on 9 degrees of freedom
Multiple R-squared:  0.9806,	Adjusted R-squared:  0.9763
F-statistic:   228 on 2 and 9 DF,  p-value: 1.951e-08
Prediction for New Data

Prediction Time

Now that we have a model, we can do predictions on the test data. In all honesty, I did not intend to make the prediction call this complicated but, here it is . From the prediction, we have calculated Mean Absolute Error. This is indicating that our average error rate is 114 cases. We are either over estimating or under estimating.

“Seems like Overfitting!!”

results = data.frame(actual = data[-samples,]$Cases,
           Prediction = exp(predict(model, data.frame(Day = data$Day[-samples])))
# view test results

#     actual Prediction
# 1     25   12.67729
# 2     53   40.28360
# 3    110  186.92442
# 4   2294 2646.77897

# calculate mae
Metrics::mae(results$actual, results$Prediction)
# [1] 113.6856

Visualize the Predictions

Let’s plot over entire model results train and test to see how close are we. The plot seems to show that we are very accurate with our predictions. This might be because of scaling.


Now, let’s try with log scale and is as shown below. Now, we can see that our prediction model was over estimating the total cases. This is also a valuable lesson to show how two different charts can interpret the results differently.



From the above analysis and model building we saw how we can predict the number of pandemic cases in Michigan. On further analyzing the model, we found that the model was too good to be true or over fitting. For now, I don’t have a lot of data to work with. I will give this model another try in a week to see how it performs with feeding more data. This would be a good experiment.

Let me know what you think of this and comment some of your comments on how differently should I have done it.

Industry R R-Programming

Auto Encoders for Anomaly Detection in Predictive Maintenance

Autoencoders is an unsupervised version of neural network that is used for data encoding. This technique is mainly used to learn the representation of data that can be used for dimensionality reduction by training network to ignore noise. Autoencoders play an important role in unsupervised learning and deep architectures mainly for transfer learning (Pierre. B, 2012). When autoencoders are decoded, they are simple linear circuits that transforms inputs to outputs with least distortion. Autoencoders were first introduced in 1980’s to address the issue of back propagation without training and rather use input as a teacher (Rumelhart et al., 1986). Since then, autoencoders have taken a phase change to the form on Restricted Boltzman Machine. Today, autoencoders are used in various applications such as predicting sentiment distributions in Natural Language Processing (NLP) (Socher et al., 2011a) (Socher et al., 2011b), feature extraction (Masci et al., 2011), anomaly detection (Sakurada et al., 2014), facial recognition (Gao et al., 2015), clustering (Dilokthanakul et al., 2016), image classification (Geng et al., 2015) and many other application.


Image: Simple auto encoder representation

In today’s tutorial, I will go over on how to use auto encoders for anomaly detection in predictive maintenance.

Load Libraries

You will need only two libraries for this analysis.


# load libraries

Load data

Here we are using data from a bench press and can be downloaded from my github repo. This is an experimental data I generated in a lab for my PhD dissertation.  There are total of four different states in this machine and they are split into four different csv files. We need to load the data first. In the data time represents the time between samples, ax is the acceleration on x axis, ay is the acceleration on y axis, az is the acceleration on z axis and at is the G’s. The data was collected at sample rate of 100hz.

Four different states of the machine were collected

1. Nothing attached to drill press

2. Wooden base attached to drill press

3. Imbalance created by adding weight to one end of wooden base

4. Imbalance created by adding weight to two ends of wooden base.

#read csv files
file1 = read.csv("dry run.csv", sep=",", header =T)
file2 = read.csv("base.csv", sep=",", header =T)
file3 = read.csv("imbalance 1.csv", sep=",", header =T)
file4 = read.csv("imbalance 2.csv", sep=",", header =T)
time ax ay az aT
<dbl> <dbl> <dbl> <dbl> <dbl>
0.002 -0.3246 0.2748 0.1502 0.451
0.009 0.6020 -0.1900 -0.3227 0.709
0.019 0.9787 0.3258 0.0124 1.032
0.027 0.6141 -0.4179 0.0471 0.744
0.038 -0.3218 -0.6389 -0.4259 0.833
0.047 -0.3607 0.1332 -0.1291 0.406

We can look at the summary of each file using summary function in R. Below, we can observe that 66 seconds long data is available. We also have min, max and mean for each of the variables.

# summary of each file
      time               ax                  ay                  az          
 Min.   :  0.004   Min.   :-1.402700   Min.   :-1.693300   Min.   :-3.18930  
 1st Qu.: 27.005   1st Qu.:-0.311100   1st Qu.:-0.429600   1st Qu.:-0.57337  
 Median : 54.142   Median : 0.015100   Median :-0.010700   Median :-0.11835  
 Mean   : 54.086   Mean   : 0.005385   Mean   :-0.002534   Mean   :-0.09105  
 3rd Qu.: 81.146   3rd Qu.: 0.314800   3rd Qu.: 0.419475   3rd Qu.: 0.34815  
 Max.   :108.127   Max.   : 1.771900   Max.   : 1.515600   Max.   : 5.04610  
 Min.   :0.0360  
 1st Qu.:0.6270  
 Median :0.8670  
 Mean   :0.9261  
 3rd Qu.:1.1550  
 Max.   :5.2950

Data Aggregation and feature extraction

Here, the data is aggregated by 1 minute and features are extracted. Features are extracted to reduce the size of the data and only storing the representation of the data.

file1$group = as.factor(round(file1$time))
file2$group = as.factor(round(file2$time))
file3$group = as.factor(round(file3$time))
file4$group = as.factor(round(file4$time))

#list of all files
files = list(file1, file2, file3, file4)

#loop through all files and combine
features = NULL
for (i in 1:4){
res = files[[i]] %>%
    group_by(group) %>%
    summarize(ax_mean = mean(ax),
              ax_sd = sd(ax),
              ax_min = min(ax),
              ax_max = max(ax),
              ax_median = median(ax),
              ay_mean = mean(ay),
              ay_sd = sd(ay),
              ay_min = min(ay),
              ay_may = max(ay),
              ay_median = median(ay),
              az_mean = mean(az),
              az_sd = sd(az),
              az_min = min(az),
              az_maz = max(az),
              az_median = median(az),
              aT_mean = mean(aT),
              aT_sd = sd(aT),
              aT_min = min(aT),
              aT_maT = max(aT),
              aT_median = median(aT)
    features = rbind(features, res)

#view all features

Create Train and Test Set

To build an anomaly detection model, a train and test set is required. Here, the normal condition of the data is used for training and remaining is used for testing.

# create train and test set
train = features[1:67,2:ncol(features)]
test = features[68:nrow(features),2:ncol(features)]

Auto Encoders

Auto Encoders using H2O package

Use the h2o.init()method to initialize H2O. This method accepts the following options. Note: that in most cases, simply using h2o.init() is all that a user is required to do.

# initialize h2o cluser

The R object to be converted to an H2O object should be named so that it can be used in subsequent analysis. Also note that the R object is converted to a parsed H2O data object, and will be treated as a data frame by H2O in subsequent analysis.

# convert train and test to h2o object
train_h2o = as.h2o(train)
test_h2o = as.h2o(test)

The h2o.deeplearning function fits H2O’s Deep Learning models from within R. While H2O Deep Learning has many parameters, it was designed to be just as easy to use as the other supervised training methods in H2O. Early stopping, automatic data standardization and handling of categorical variables and missing values and adaptive learning rates (per weight) reduce the amount of parameters the user has to specify. Often, it’s just the number and sizes of hidden layers, the number of epochs and the activation function and maybe some regularization techniques.

# build auto encoder model with 3 layers
model_unsup = h2o.deeplearning(x = 2:ncol(features)
                 , training_frame = train_h2o
                 , model_id = "Test01"
                 , autoencoder = TRUE
                 , reproducible = TRUE
                 , ignore_const_cols = FALSE
                 , seed = 42
                 , hidden = c(50,10,50,100,100)
                 , epochs = 100
                 , activation ="Tanh")

# view the model
Model Details:

H2OAutoEncoderModel: deeplearning
Model ID:  Test01 
Status of Neuron Layers: auto-encoder, gaussian distribution, Quadratic loss, 19,179 weights/biases, 236.0 KB, 2,546 training samples, mini-batch size 1
  layer units  type dropout       l1       l2 mean_rate rate_rms momentum
1     1    19 Input  0.00 %       NA       NA        NA       NA       NA
2     2    50  Tanh  0.00 % 0.000000 0.000000  0.029104 0.007101 0.000000
3     3    10  Tanh  0.00 % 0.000000 0.000000  0.021010 0.006320 0.000000
4     4    50  Tanh  0.00 % 0.000000 0.000000  0.024570 0.006848 0.000000
5     5   100  Tanh  0.00 % 0.000000 0.000000  0.052482 0.018357 0.000000
6     6   100  Tanh  0.00 % 0.000000 0.000000  0.052677 0.021417 0.000000
7     7    19  Tanh      NA 0.000000 0.000000  0.025557 0.009494 0.000000
  mean_weight weight_rms mean_bias bias_rms
1          NA         NA        NA       NA
2    0.000069   0.180678  0.001542 0.017311
3    0.000008   0.187546 -0.000435 0.011542
4    0.011644   0.184633  0.000371 0.006443
5    0.000063   0.113350 -0.000964 0.008983
6    0.000581   0.100150  0.001003 0.013848
7   -0.001349   0.121616  0.006549 0.012720

H2OAutoEncoderMetrics: deeplearning
** Reported on training data. **

Training Set Metrics: 

MSE: (Extract with `h2o.mse`) 0.005829827
RMSE: (Extract with `h2o.rmse`) 0.0763533

Detect anomalies in an H2O data set using an H2O deep learning model with auto-encoding trained previously.

# now we need to calculate MSE or anomaly score  
anmlt = h2o.anomaly(model_unsup
                      , train_h2o
                      , per_feature = FALSE) %>%
# create a label for healthy data
anmlt$y = 0

# view top data
Reconstruction.MSE y
<dbl> <dbl>
0.001953387 0
0.004875430 0
0.002195593 0
0.006722837 0
0.001670331 0
0.005859846 0

Calculate the threshold value for trainanomaly scores. Various methods can be used such as calculating the quantiles, max, median, min etc. It all depends on the use case. Here we will use quantile with probability of 99.9%.

# calculate thresholds from train data
threshold = quantile(anmlt$Reconstruction.MSE, probs = 0.999)

Now, we have anomaly score for train and its thresholds, we can predict the new anomaly scores for test data and plot it to see how it differs from train data.

# calculate anomaly scores for test data
test_anmlt = h2o.anomaly(model_unsup
                      , test_h2o
                      , per_feature = FALSE) %>%

# create a label for healthy data
test_anmlt$y = 1
# combine the train and test anomaly scores for visulaizatio
results = data.frame(rbind(anmlt,test_anmlt), threshold)
Reconstruction.MSE y threshold
<dbl> <dbl> <dbl>
0.001953387 0 0.01705935
0.004875430 0 0.01705935
0.002195593 0 0.01705935
0.006722837 0 0.01705935
0.001670331 0 0.01705935
0.005859846 0 0.01705935

The results are plotted below. The x axis is the observations and y axis is the anomaly score. The green points are the trained data and red are test data. We can note that all the data that was trained except one lied below the anomaly limit. Its also interesting to note the increasing trend pattern for the anomaly scores for other state of the machine.

# Adjust plot sizes
options(repr.plot.width = 15, repr.plot.height = 6)
plot(results$Reconstruction.MSE, type = 'n', xlab='observations', ylab='Reconstruction.MSE', main = "Anomaly Detection Results")
points(results$Reconstruction.MSE, pch=19, col=ifelse(results$Reconstruction.MSE < threshold, "green", "red"))
abline(h=threshold, col='red', lwd=2)
ae results


Auto encoder is a very powerful tool and very fun to play with. They have been used in image analysis, image reconstruction and image colorization. In this tutorial you have seen how to perform anomaly detection on a simple signal data and few lines of code. The possibilities of using this are many. Let me know what you think about auto encoders in the comments below.


Follow my work

Github, Researchgate, and LinkedIn 

Session info

Below is the session info for the the packages and their versions used in this analysis.

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] h2o_3.26.0.2 dplyr_0.8.3 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2          magrittr_1.5        tidyselect_0.2.5   
 [4] uuid_0.1-2          R6_2.4.0            rlang_0.4.0        
 [7] tools_3.3.3         htmltools_0.3.6     assertthat_0.2.1   
[10] digest_0.6.20       tibble_2.1.3        crayon_1.3.4       
[13] IRdisplay_0.7.0     purrr_0.3.2         repr_1.0.1         
[16] base64enc_0.1-3     vctrs_0.2.0         bitops_1.0-6       
[19] RCurl_1.95-4.12     IRkernel_1.0.2.9000 zeallot_0.1.0      
[22] glue_1.3.1          evaluate_0.14       pbdZMQ_0.3-3       
[25] pillar_1.4.2        backports_1.1.4     jsonlite_1.6       
[28] pkgconfig_2.0.2
Computer related R R-Programming

Convolutional Neural Network under the Hood

Neural networks have really taken over for solving image recognition and high sample rate data problems in the last couple of years. In all honesty, I promise I won’t be teaching you what neural networks are or CNN’s are. There are hundred’s of resources that are published everyday explaining them. I’ll post few links below.

I am a serious R user and very new to Deep learning domain. As I started coming across new image classification projects, I started to incline towards CNN’s. I went over few tutorials regarding image classification using CNN’s and read a few books. After a few, I started to see the same old pattern in every blog post

  1. Download data set
  2. Split them three ways (train/test/validation)
  3. Create a model (in most cases pre-trained models)
  4. Set up generators
  5. Compile the model
  6. Predict
  7. End

I understood the concept of filter, filter size and activation functions. But, I was curious on what the network was actually seeing through the filter. I did a lot of digging and found a stackoverflow post linking to RStudio’s Keras-FAQ. It was literally 3 lines of code to visualize what was happening at each layer. Meanwhile in python it was over two dozen lines of code. (Irony!) I thought there might be quite a few people out there who would be interesting in knowing this in R just like me. So, I decided to write this blog post. It would be very useful when you are explaining this to your boss or a work colleague.

Let’s get started!

Initial Setup

Downloading Data Set

For this example, I will be using cats and dogs data set from Kaggle. You can follow the link and download the data. You might have to create an account to download it.

If you have your own data then don’t worry about this step. Skip it.

Load Keras library


Split the data into train and test

The below code is courtesy of Rstudio blog. 

original_dataset_dir = "/home/rstudio/train"

base_dir = "/home/rstudio/data"

train_dir = file.path(base_dir, "train")

test_dir = file.path(base_dir, "test")

train_cats_dir = file.path(train_dir, "cats")

train_dogs_dir = file.path(train_dir, "dogs")

test_cats_dir = file.path(test_dir, "cats")

test_dogs_dir = file.path(test_dir, "dogs")

fnames = paste0("cat.", 1:2000, ".jpg")
file.copy(file.path(original_dataset_dir, fnames), 

fnames = paste0("cat.", 2001:3000, ".jpg")
file.copy(file.path(original_dataset_dir, fnames),
          file.path(test_cats_dir)) fnames = paste0("dog.", 1:2000, ".jpg")
file.copy(file.path(original_dataset_dir, fnames),
          file.path(train_dogs_dir)) fnames = paste0("dog.", 2001:3000, ".jpg")
file.copy(file.path(original_dataset_dir, fnames),

Set initial parameters

I am creating few variables and assigning them values here. The main reason is, it’s easy to tweak them and retrain the models.

# set path
path = "/home/rstudio/data/"

# set inital parameters
img_width = 150
img_height = 150
channels = 3
output_n = 2
train_samples = length(list.files(paste0(path,"train/cats"))) + length(list.files(paste0(path,"train/dogs")))
test_samples = length(list.files(paste0(path,"test/cats"))) + length(list.files(paste0(path,"test/dogs")))
batch_size = 50

# set dataset directory
train_dir = paste0(path,"train")
test_dir = paste0(path,"test")

Create a custom mode

I could use a pre-trained model such as VGG16 or VGG18. But, what’s the fun in that? Let me build my own. Don’t judge me about bad layering. I am still learning.

# CNN model
  model = keras_model_sequential() %>% 
  layer_conv_2d(filters = 8, kernel_size = c(3,3), activation = "relu", input_shape = c(img_width,img_height,channels)) %>% 
  layer_conv_2d(filters = 16, kernel_size = c(3,3), activation = "relu") %>% 
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = "relu") %>% 
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = "relu") %>% 
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 
  layer_conv_2d(filters = 16, kernel_size = c(3,3), activation = "relu") %>% 
  layer_flatten() %>% 
  layer_dense(units = 64, activation = "relu") %>% 
  layer_dense(units = 64, activation = "relu") %>% 
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 128, activation = "relu") %>% 
  layer_dense(units = 128, activation = "relu") %>% 
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 256, activation = "relu") %>% 
  layer_dense(units = 256, activation = "relu") %>% 
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 64, activation = "relu") %>% 
  layer_dense(units = 64, activation = "relu") %>% 
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 32, activation = "relu") %>% 
  layer_dense(units = output_n, activation = "softmax")

# summary of the overall model

Image processing

Setup image augmentation

When your data set is small, augmentation helps in increasing your own data set. Here we have few parameters like rotation, shift and zoom that would be added to your current train set to increase your train size.

# Train data image preprocessing
datagen = image_data_generator(
                               rotation_range = 40,
                               width_shift_range = 0.2,
                               height_shift_range = 0.2,
                               shear_range = 0.2,
                               zoom_range = 0.2,
                               horizontal_flip = TRUE,
                               fill_mode = "nearest",
                               samplewise_std_normalization = TRUE

Setup image generators

Flow from image directory really helps in easing up the pre-processing. In the previous step we put our images into separate directories based on classes. Now this function would read the images from as per each class. No need to create any metadata.

# get all the train set
train_generator = flow_images_from_directory(
                                              color_mode = "rgb",
                                              target_size = c(img_width, img_height), 
                                              batch_size = batch_size,
                                              class_mode = "categorical", 
                                              shuffle = TRUE

# Get test data set
test_generator = flow_images_from_directory(
                                            color_mode = "rgb",
                                            target_size =  c(img_width, img_height),
                                            batch_size = batch_size,
                                            class_mode = "categorical",
                                            shuffle = TRUE

Compile and fit the model

Now, that we have a model and generators, we can compile the model and fit the generator. I ran the model at 100 epochs couple of times and achieved an average accuracy of 80%. Not too bad for this test model!.

# compile the model
model %>% compile(
                  loss = "binary_crossentropy",
                  optimizer = optimizer_adamax(lr = 0.001, decay = 0),
                  metrics = c("accuracy")
history = model %>% fit_generator(
                                  steps_per_epoch = as.integer(train_samples/batch_size),
                                  epochs = 100,
                                  validation_data = test_generator,
                                  validation_steps = 10,
                                  initial_epoch = 1

parse 1

Visualizing a test image through layers

Now, that we have a model, we can choose a test image and pass it through different layers that we have and visualize what happens to our image at each layer. This will tell us what was the point in the object that our network is learning from. Below we have chosen a test image that we want to test our model.

# load image
x = image_load(paste0(path,"test/cats/cat.2001.jpg"),target_size =  c(img_width, img_height)) 
data = x %>% array_reshape(c(-1,img_width, img_height, channels))

image = jpeg::readJPEG(paste0(path,"test/cats/cat.2511.jpg"))
Next, we will capture an intermediate layer, save that layer as model, predict our image based on intermediate later. We will get a multidimensional matrix output. In the below results we have an image size of 33 x 33 and 64 filters. You can tweak them to plot the results.
Note: Index is the layer number that we want to look at. 
# what layer do we want to look at?
index = 6

# choose that layer as model
intermediate_layer_model = keras_model(inputs = model$input,
                                        outputs = get_layer(model, index = index)$output)

# predict on that layer
intermediate_output = predict(intermediate_layer_model, data)

# dimensionso of prediction
[1]  1 33 33 64
Finally, we can plot our matrix data from each of our filters into a grid using image function as shown below.
Note: the images below are rotated. You can rotate the images using matrix rotate function. 
par(mfrow = c(3,3))
for(i in 1:9){

Layer 2


Layer 3


Layer 6



From the above you can see how the CNN filters are narrowing the point of interest to the cat. This not only helps explain how your model is working but, also a way to confirm that your model is working like it is intended to. It was quite a journey for me to go through the inner webs to find a way to visualize my layers. Hope you’all can use this for your projects.

Links to tutorials

  1. Shirin’s playgRound
  2. Rstudio Blog
  3. Rstudio Tensorflow
  4. R-Bloggers

Industry python R R-Programming

Data Science in Manufacturing: An Overview

Original article published in

In the last couple of years, data science has seen an immense influx in various industrial applications across the board. Today, we can see data science applied in health care, customer service, governments, cyber security, mechanical, aerospace, and other industrial applications. Among these, manufacturing has gained more prominence to achieve a simple goal of Just-in-Time (JIT). In the last 100 years, manufacturing has gone through four major industrial revolutions. In the first Industrial Revolution, we saw a transition of harvesting steam energy to mechanical energy. In the second industrial revolution, we saw batch production to assembly lines, which made things more affordable (e.g.: Ford’s Model T was a major outcome), and in the third, we saw significant use of computers and robotics. Between the third and fourth, there was a wave of lean manufacturing that is still being embraced by a lot of manufacturers. Currently, we are going through the fourth Industrial Revolution, where data from machines, environment, and products are being harvested to get closer to that simple goal of Just-in-Time; “Making the right products in right quantities at the right time.” One might ask why JIT is so important in manufacturing? The simple answer is to reduce the manufacturing cost and make products more affordable for everyone.

In this article, I will try to answer some of the most frequently asked questions on data science in manufacturing

How is manufacturing using data science and its impact?

The applications of data science in manufacturing are several. To name a few predictive maintenance, predictive quality, safety analytics, warranty analytics, plant facilities monitoring, computer vision, sales forecasting, KPI forecasting, and many more [1] as shown in Figure 1 [2].

Figure 1: Data science opportunities in manufacturing [2]

Predictive Maintenance: Machine breakdown in manufacturing is very expensive. Unplanned downtime is the single largest contributor to manufacturing overhead costs. Unplanned downtime costs businesses an average of $2 million over the last three years. In 2014 the average downtime cost per hour was $164,000. By 2016, that statistic had exploded by 59% to $260,000 per hour [3]. This has led to embracing technologies like condition-based monitoring and predictive maintenance. Sensor data from machines are monitored continuously to detect anomalies (using models such as PCA-T2, one-class SVM, auto encoders, and logistic regression), diagnose failure modes (using classification models such as SVM, random forest, decision trees, and neural networks), predict the time to failure (TTF) (using combination of techniques such as survival analysis, lagging, curve fitting and regression models) and optimal maintenance time prediction (using operations research techniques) [4] [5].

Computer Vision: Traditional computer vision systems measure the parts for tolerance to determine if the parts are acceptable or not. Detecting the quality of the parts for defects such as scuff marks, scratches, and dents are equally important. Traditionally humans were used for inspecting for such defects. Today, AI technologies such as CNN, RCNN, and Fast RCNN’s have proven to be more accurate than their human counterparts and take much less time in inspecting. Hence, significantly reducing the cost of the products [6].

Sales forecasting: Predicting future trends has always helped in optimizing the resources for profitability. This has been true in various industries, such as manufacturing, airlines, and tourism. In manufacturing, knowing the manufacturing volumes ahead of time helps in optimizing the resources such as supply chain, machine-product balancing, and workforce. Techniques ranging from linear regression models, ARIMA, lagging to more complicated models such as LSTM are being used today to optimize the resources.

Predicting quality: The quality of the products coming out of the machines are predictable. Statistical process control techniques are the most common tools that we find on the manufacturing floor that tell us if the process is in control or out of control as shown in Figure 2. Using statistical techniques such as linear regression on time and product quality would yield us a reasonable trend line. This line is then extrapolated to answer questions such as “How long do we have before we start to make bad parts?”

The above are just some of the most common and popular applications. There are still various applications that are hidden and yet to be discovered.

Figure 2: An example of X-bar chart from R’s qcc package

How big is data science in manufacturing?

According to one estimate for the US, “The Big Data Analytics in Manufacturing Industry Market was valued at USD 904.65 million in 2019 and is expected to reach USD 4.55 billion by 2025, at a CAGR of 30.9% over the forecast period 2020 – 2025. [7]” In another estimation, “TrendForce forecasts that the size of the global market for smart manufacturing solutions will surpass US$320 billion by 2020. [8]” In another report it was stated that “The global smart manufacturing market size is estimated to reach USD 395.24 billion by 2025, registering a CAGR of 10.7% according to a new study by Grand View Research, Inc. [9]”

What are the challenges of data science in manufacturing?

There are various challenges for applying data science in manufacturing. Some of the most common ones that I have come across are as follows

Lack of subject matter expertise: Data science is a very new field. Every application in data science requires their own core set of skills. Likewise, in manufacturing, knowing the manufacturing and process terminologies, rules and regulations, business understanding, components of supply chain and industrial engineering is very vital. Lack of SME would lead to tackling the wrong set of problems, eventually leading to failed projects and, more importantly, losing trust. When someone asks me what is a manufacturing data scientist?, I show them this nice image in Figure 3.

Figure 3: Who is a manufacturing data scientist?

Reinventing the wheel: Every problem in a manufacturing environment is new, and the stakeholders are different. Deploying a standard solution is risky and, more importantly, at some point its bound to fail. Every new problem has a part of the solution that is readily available, and the remaining has to be engineered. Engineering involves developing new ML model workflows and/ writing new ML packages for the simplest case and developing a new sensor or hardware in the most complex ones. In my experience for the last couple of years, I have been on both extreme ends, and I have enjoyed it.

What tools do data scientists who work in manufacturing use?

A data scientist in manufacturing uses a combination of tools at every stage of the project lifecycle. For example:

1. Feasibility study: Notebooks (R markdown & Jupyter), GIT and PowerPoint

“Yes! You read it right. PowerPoint is still very much necessary in any organization. BI tools are trying hard to take them over. In my experience with half a dozen BI tools, PowerPoint still stands in first place in terms of storytelling.”

2. Proof of concept: R, Python, SQL, PostgreSQL, MinIO, and GIT

3. Scale-up: Kubernetes, Docker, and GIT pipelines


Currently, applying data science in manufacturing is very new. New applications are being discovered every day, and various solutions are invented constantly. In many manufacturing projects (capital investments), ROI is realized over the years (5 – 7 years). Most successfully deployed data science projects have their ROI in less than a year. This makes them very appreciable. Data science is just one of many tools that manufacturing industries are currently using to achieve their JIT goal. As a manufacturing

data scientist, some of my recommendations are to spend enough time to understand the problem statement, target for the low hanging fruit, get those early wins, and build trust in the organization.

I will be at ODSC East 2020, presenting “Predictive Maintenance: Zero to Deployment in Manufacturing.” Do stop by to learn more about our journey in deploying predictive maintenance in the production environment.


[1] ActiveWizards, “Top 8 Data Science Use Cases in Manufacturing,” [Online]. Available:

[2] IIoT World, “,” [Online]. Available: [Accessed 02 10 2020].

[3] Swift Systems, “Swift Systems,” [Online]. Available:

[4] N. a. T. G. Amruthnath, “Fault class prediction in unsupervised learning using model-based clustering approach.,” in In 2018 International Conference on Information and Computer Technologies (ICICT), Chicago, 2018.

[5] N. a. T. G. Amruthnath, “A research study on unsupervised machine learning algorithms for early fault detection in predictive maintenance.,” in In 2018 5th International Conference on Industrial Engineering and Applications (ICIEA), 2018.

[6] T. Y. C. M. Q. a. H. S. Wang, “A fast and robust convolutional neural network-based defect detection model in product quality control.,” The International Journal of Advanced Manufacturing Technology, vol. 94, no. 9-12, pp. 3465-3471, 2018.

[7] “Big Data Analytics in Manufacturing Industry Market – Growth, Trends, and Forecast (2020 – 2025),” Mordor Intelligence, 2020.

[8] Trendforce, “TrendForce Forecasts Size of Global Market for Smart Manufacturing Solutions to Top US$320 Billion by 2020; Product Development Favors Integrated Solutions,” 2017.

[9] Grand View Research. Inc, “Smart Manufacturing Market Size Worth $395.24 Billion By 2025,” 2019.



MinIO for Machine Learning Model Storage using Python

R R-Programming

EnsembleML: An R package for Parallel Ensemble Modeling in R

Ensemble in machine learning is being used for a while. Ensemble is a concept of training multiple machine learning models and using them for predicting using either voting or feeding the prediction result to a different machine learning model. You could also build ensemble of ensembles. So, this is pretty cool! Why do we ever need the concept of Ensemble? Most real-world data is not as clean as we learnt in school. They don’t follow one single distribution. When dealing with real-world data few models perform well with data with certain distributions and few with others. So, we might end up needing various ML models to get a better result. Ensemble models have been one of the top models in various competitions. You can read more about Ensemble models and them being used in competitions in this article by Willaim Vorhies here.

Image Courtesy:

R Packages for Ensemble Modeling

There are various packages for ensemble modeling such as SuperLearner, randomForest, knn, glmnet, caretEnsemble etc. All these packages are great and I am a caret fan boy who basically uses caret for building machine learning models. I wanted to use the same standardized and flexible performance tuning approach for Ensemble Modeling. This and a free weekend led to building a package with caret architecture called EnsembleML.


EnsembleML is an R package for performing feature creation in time series and frequency series, building multiple regression and classification models and combining those models to be an ensemble. You can save and read models created using this package and also deploy them as API within the same model.

Installation of the package

The package is currently only available in Github and won’t be seeing anytime in CRAN. Use devtools to install from github as follows.

# install packages

# load the library

Feature creation

Features can be created both in time series and frequency series using this package. Use featureCreationTS() for time series and featureCreationF for frequency domain. The standard features include mean, sd, median, trimmed, mad, min, max, range, skew, kurtosis, se, iqr, nZero, nUnique, lowerBound, upperBound, and quantiles.

# Create a sample data set
data = rnorm(50)

# create some features

#     TS_mean    TS_sd TS_median TS_trimmed    TS_mad    TS_min   TS_max TS_range   TS_skew TS_kurtosis
#X1 0.2107398 1.025315 0.1822303  0.2097342 0.8026293 -2.194434 3.161181 5.355616 0.1251634   0.6376311
#       TS_se   TS_iqr TS_nZero TS_nUnique TS_lowerBound TS_upperBound    TS_X1.    TS_X5.   TS_X25.
#X1 0.1450015 1.041068        0         50     -1.850106      2.314167 -2.116863 -1.388573 -0.288504
#     TS_X50.   TS_X75.  TS_X95.  TS_X99.
#X1 0.1822303 0.7525642 1.661972 2.814687

Summary of the data

numSummary() function can be used to generate the numerical summary of the entire data set. The example for iris data set is shown below. Rest of the documentation will include using iris data set.

# load iris data set

# get numerical summary of the data

#                n mean    sd max min range nunique nzeros  iqr lowerbound upperbound noutlier kurtosis
# Sepal.Length 150 5.84 0.828 7.9 4.3   3.6      35      0 1.30       3.15       8.35        0   -0.606
# Sepal.Width  150 3.06 0.436 4.4 2.0   2.4      23      0 0.50       2.05       4.05        4    0.139
# Petal.Length 150 3.76 1.765 6.9 1.0   5.9      43      0 3.55      -3.72      10.42        0   -1.417
# Petal.Width  150 1.20 0.762 2.5 0.1   2.4      22      0 1.50      -1.95       4.05        0   -1.358
#              skewness mode miss miss%   1%   5% 25%  50% 75%  95%  99%
# Sepal.Length    0.309  5.0    0     0 4.40 4.60 5.1 5.80 6.4 7.25 7.70
# Sepal.Width     0.313  3.0    0     0 2.20 2.34 2.8 3.00 3.3 3.80 4.15
# Petal.Length   -0.269  1.4    0     0 1.15 1.30 1.6 4.35 5.1 6.10 6.70
# Petal.Width    -0.101  0.2    0     0 0.10 0.20 0.3 1.30 1.8 2.30 2.50

Training multiple models

For most prototyping we end up training multiple models manually. This is not only time consuming but also not very efficient. multipleModels() function can be used to train multiple models at once as shown below. All the models uses caret function models. You an read more about it here

# train multiple machine learning models
mm = multipleModels(train = iris, test = iris, y = "Species", models = c("C5.0", "parRF"))

# $summary
#       Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue
# C5.0     0.960  0.94         0.915         0.985        0.333       2.53e-60           NaN
# parRF    0.973  0.96         0.933         0.993        0.333       8.88e-64           NaN

The bench mark for training multiple models for iris data set is as follows

# benchmark the training results
microbenchmark::microbenchmark(multipleModels(train = iris, test = iris, y = "Species", models = c("C5.0", "parRF")), times = 5)

# Unit: seconds
#                                                                                        expr  min   lq mean 
#  multipleModels(train = iris, test = iris, y = "Species", models = c("C5.0",      "parRF")) 22.6 22.6 22.9
#  median   uq  max neval
#    22.7 22.7 23.8     5

Training an ensemble

Ensemble training is a concept of joining results from multiple models and feeding it to a different model. You can use ensembleTrain() function to achieve this. We use the results from multiple models mm and then feed it to this function as follows

em = ensembleTrain(mm, train = iris, test = iris, y = "Species", emsembleModelTrain = "C5.0")

# $summary
# Confusion Matrix and Statistics
#             Reference
# Prediction   setosa versicolor virginica
#   setosa         50          0         0
#   versicolor      0         47         1
#   virginica       0          3        49
# Overall Statistics
#                Accuracy : 0.973         
#                  95% CI : (0.933, 0.993)
#     No Information Rate : 0.333         
#     P-Value [Acc > NIR] : <2e-16        
#                   Kappa : 0.96          
#  Mcnemar's Test P-Value : NA            
# Statistics by Class:
#                      Class: setosa Class: versicolor Class: virginica
# Sensitivity                  1.000             0.940            0.980
# Specificity                  1.000             0.990            0.970
# Pos Pred Value               1.000             0.979            0.942
# Neg Pred Value               1.000             0.971            0.990
# Prevalence                   0.333             0.333            0.333
# Detection Rate               0.333             0.313            0.327
# Detection Prevalence         0.333             0.320            0.347
# Balanced Accuracy            1.000             0.965            0.975

Predicting from ensemble

predictEnsemble() function is used to predict from ensemble model

predictEnsemble(em, iris)

#     prediction
# 1       setosa
# 2       setosa
# 3       setosa
# 4       setosa
# 5       setosa
# 6       setosa
# 7       setosa
# 8       setosa
#           .
#           .
#           .

Saving and reading the model

Ensemble models can be saved and read back to the memory as follows

saveRDS(ensembleModel, "/home/savedEnsembleModel.RDS")

Deploying models as API

The trained models could be deployed as API using the same package as follows. First we need to save the models and then call them as follows

createAPI(host = '', port = 8890)
# Serving the jug at
# [1] "Model was successfully loaded"
# HTTP | /predict - POST - 200 

Lets curl and see what we get

curl -X POST \ \
  -H 'Host:' \
  -H 'content-type: multipart/form-data' \
  -F 'jsondata={"model":["/home/EnsembleML/savedEnsembleModel.RDS"],"test":[{"Sepal.Length":5.1,"Sepal.Width":3.5,"Petal.Length":1.4,"Petal.Width":0.2,"Species":"setosa"}]}'

Issues and Tracking

If you have any issues related to the project, please post an issue and I will try to address it.




Internet of Things R R-Programming

minio.s3: A MinIO connector package for R

MinIO is a high performance, distributed object storage system. It is software-defined, runs on industry standard hardware and is 100% open source under the Apache V2 license[1]. Today, MinIO is deployed globally with over 272.5M+ docker pulls and 18K+ git commits. MinIO is written in “go” language. So, expect it to have fast response. You can read more about this here.

MinIO and Data Science

MinIO has played a pivotal role in data science deployment. Today, we deal with data in various formats such as images, videos, audio clips and other proprietary format objects. Storing this information in traditional databases is quite challenging and does not have high response times for high frequency applications.

The other application of MinIO in data science is storing trained models. Deep learning models usually have bigger file size compared to its counterpart machine learning models which are usually few KB to MB.

MinIO officially supports integration with python, go language, etc. As a heavy R user it can be quite challenging to use MinIO through R. The first solution that came to my mind was to use reticulate to access MinIO through python. Again, this is good for testing and not feasible for deployment into production.

Installing MinIO

Image result for docker logo"

MinIO could be installed in few lines of code and is well documented here. MinIO can be deployed on Linux, mac, Windows, and K8’s. for prototyping I would recommend running a stateful docker container. Instructions for running on docker can be found here.

R Package minio.s3


MinIO is compatible with Amazon S3 cloud service. So, we can technically use S3 compatible API’s to access MinIO storage. You might be wondering don’t we already have a package for accessing Amazon Web Services (AWS) through R? You are right, R does have a package called aws.s3 that we could use to access AWS which was developed by cloudyR team. I tried using that package and it was quite clunky to access MinIO and not all functions were compatible.

So, my solution was to use their package and tweek quite a lot and could be used for accessing MinIO. The end product was minio.s3 package.

I would like to thank cloudyR team for their initial contributions for this package.


This package is not yet on CRAN. To install the latest development version you can install from the github:



By default, all packages for AWS/MinIO services allow the use of credentials specified in a number of ways, beginning with:

  1. User-supplied values passed directly to functions.
  2. Environment variables, which can alternatively be set on the command line prior to starting R or via an or .Renviron file, which are used to set environment variables in R during startup (see ? Startup). Or they can be set within R:
    Sys.setenv("AWS_ACCESS_KEY_ID" = "test", # enter your credentials
           "AWS_SECRET_ACCESS_KEY" = "test123", # enter your credentials
           "AWS_DEFAULT_REGION" = "us-east-1",
           "AWS_S3_ENDPOINT" = "")    # change it to your specific IP and port

For more information on aws usage, refer to aws.s3 package.

Code Examples

The package can be used to examine publicly accessible S3 buckets and publicly accessible S3 objects.

bucketlist(add_region = FALSE)

If your credentials are incorrect, this function will return an error. Otherwise, it will return a list of information about the buckets you have access to.


Create a bucket

To create a new bucket, simply call

put_bucket('my-bucket', acl = "public-read-write", use_https=F)

If successful, it should return TRUE

List Bucket Contents

To get a listing of all objects in a public bucket, simply call

get_bucket(bucket = 'my-bucket', use_https = F)

Delete Bucket

To delete a bucket, simply call

delete_bucket(bucket = 'my-bucket', use_https = F)


There are eight main functions that will be useful for working with objects in S3:

  1. s3read_using() provides a generic interface for reading from S3 objects using a user-defined function
  2. s3write_using() provides a generic interface for writing to S3 objects using a user-defined function
  3. get_object() returns a raw vector representation of an S3 object. This might then be parsed in a number of ways, such as rawToChar()xml2::read_xml()jsonlite::fromJSON(), and so forth depending on the file format of the object
  4. save_object() saves an S3 object to a specified local file
  5. put_object() stores a local file into an S3 bucket
  6. s3save() saves one or more in-memory R objects to an .Rdata file in S3 (analogously to save()). s3saveRDS() is an analogue for saveRDS()
  7. s3load() loads one or more objects into memory from an .Rdata file stored in S3 (analogously to load()). s3readRDS() is an analogue for saveRDS()
  8. s3source() sources an R script directly from S3

They behave as you would probably expect:

# save an in-memory R object into S3
s3save(mtcars, bucket = "my_bucket", object = "mtcars.Rdata", use_https = F)

# `load()` R objects from the file
s3load("mtcars.Rdata", bucket = "my_bucket", use_https = F)

# get file as raw vector
get_object("mtcars.Rdata", bucket = "my_bucket", use_https = F)
# alternative 'S3 URI' syntax:
get_object("s3://my_bucket/mtcars.Rdata", use_https = F)

# save file locally
save_object("mtcars.Rdata", file = "mtcars.Rdata", bucket = "my_bucket", use_https = F)

# put local file into S3
put_object(file = "mtcars.Rdata", object = "mtcars2.Rdata", bucket = "my_bucket", use_https = F)


Please feel free to email me if you have any questions or comments. If you have any issues in the package, please create an issue on github. Also, check out my github page for other R packages, tutorials and other projects.

Computer related Industry R R-Programming

How to test the integrity of your clusters?

Machine learning (ML) and AI has become the new buzz word in town. With that being said, there is a lot of demand for data scientists and machine learning engineers across various industries including IT, telecom, automotive, manufacturing and many more. Today, there are hundreds to thousands of machine learning online courses that are being offered that teach folks from different domains to use machine learning in their day today activities. Most of these courses have completely focused on the application side of the algorithms while ignoring suitability of the model selection for each application [1]. Many forget that most machine learning models are based on statistical theory and focus narrowly on accuracy metrics of the models. This phenomenon has led to a ripple effect where, models perform extremely well in lab but, fail to perform in real world. 

To give you an example, one of the simplest data mining and clustering algorithms is K-means clustering. This is also one of the most popular algorithms in ML domain with over 4000 posts in stackoverflow [2]. There are at least 10 variations of k-means models that are available today. k-means is used in various applications such as predictive maintenance, big data analytics, image processing, risk assessment and many as such [3]. There are assumptions to this k-means model that has to be met before we can use it to cluster data such as

  • The algorithm assumes that the variable distribution for variance is spherical.
  • All the features have the same variance (hence, data is scaled before clustering).
  • The probability of each cluster is the same. This means that each cluster has an equal number of observations

If any of the assumptions are unmet in your data, then you might end up with a bad model or result. In few extreme cases, the results might be accurate but, in production it might fail. This is just one model but, there are various clustering models such as c-means, Gaussian mixture model, spectral clustering, DB scan etc. Unlike K-means, there are few models where there are no assumptions to use the model and it becomes challenging to judge the integrity of model’s result. Luckily, we can fall back on statistical theory to test the model and its results and be 100% confident before we push it to production (and skip the embarrassment).


I have developed a R-package called clusterTesting just to do that. The package uses data and cluster information to create a sample size for analysis. Then a normality test is performed to see if the data is normally distributed. If it is, then a parametric approach is uses to test the validity of the results. If not, then a non-parametric test is used to test the integrity of our clustering results [3].

All the installation instructions and examples are provided on file of my repository. Feel free to use it and do let me know if you have questions or recommendations.


Follow me on Github:


Loops! Loops! Loops in R. A Microbenchmark

Loops are the holy grail in data science. You might use it when you want to repeat your task or a function or build a model say “n” times or iterations. There are quite few types of loops and most common ones are for and while. The main difference between while and for is, in while you run it until a condition is met like “run until you find an apple in a basket” whereas in for “you run until say n times”. Like I said earlier, for loops are a life saver. Interestingly in R, you have different types of loops similar to for.

  1. for: for is a regular for where you can run n times.
  2. foreach: for each is a parallel computing for loop where you have multiple iterations running in parallel. This is useful when you have computationally intensive loops such as training a model.
  3. apply: this is a special type of loop in R. This function can be applied to arrays, lists, matrices and data frames for both column and rows.

With various types of looping functions, its often difficult to decide which to use and when. I created a scenario with a dummy function and trying various loops at different data sizes. The micro benchmark results are as shown in below jupyter notebook (opens mybinder for interactive execution). It was intersting in the analysis to note that as the size of the data increases the time to execute with for and foreach increased. While for apply there was hardly any change.

Based on the following results, today I have started to embrace apply loops in R for many of my projects and have shown similar improvement in performance.

Let me know your comments below.

Computer related Internet of Things python R

Top 8 Docker Images for Data Science

Dockerizing Data Science: Introduction

PreReqs: Docker, images, and containers

Dockerizing data science packages have become more relevant these days mainly because you can isolate your data science projects without breaking anything. Dockerizing data science projects also make most of your projects portable and sharable and not worrying about installing right dependencies (you python fans know about it). One of the greatest challenges that threaten data science projects are the ability to deploy them. Docker makes it easy to either deploying them as API’s (using plumber or flask) or deploy data science applications (Shiny) or scheduling runs (cron or taskscheduleR). To put this on steroids, you can orchestrate these through Kubernetes (k8s) or Docker swarm which can do a lot more things such as maintaining the state of your container, managing multiple containers and load balancing. Most data science platforms (DSP) are built based on this architecture and leveraging them for a very long time. When these tools are out of reach for you or for your enterprise (due to exorbitant price), you can always leverage all the open source tools with few extra steps and achieve the same. So, here are some of the best Docker images that are out there for you to start exploring data in an isolated environment in minutes.

So here are the top 8 docker images out there

  1. Jupyter

Jupyter is one of the favorite tools for many data scientists and analysts today. That is mainly because of their notebook style data analysis technique. Jupyter also supports various kernels such as Python, R, Julia and many more and hence it has gained a huge fan base. Most DSP’s comes with this notebook by default which has added on further. There is a Docker image on the Docker hub for both single user and hub. This makes it easier to pull an image do all the analysis, commit it and share it with anyone you want. Hence, it comes on top of the list.

  1. Jupyter Lab

Jupyter lab is an extension to Jupyter. This provides a more extensive interface and more options on your notebook such as having notebooks side by side, viewing all kernels on the same page, browsing through folders and many more. If you have already experienced Jupyter, this is something that you have to try. This is my preferred notebook, but the only reason it’s not in the first place is that it is due to lack of add-ons.

Note: there are a couple of dozen variants of various Jupyter lab and Jupyter images with add-ons such as spark, Hadoop, etc. I won’t be mentioning it here.

  1. Rstudio

R has been my go-to language for a very long time because of its extensive statistics library which python lacks in and is very easy to use. No frills at all. If you look at standard RIDE, it is really boring and sometimes frustrates you. That’s where Rstudio makes an entrance. It’s probably the best RIDE out there and thanks to Rocker, its available in Docker as well. The only thing this lacks in is Jupyter style notebook writing. If Rstudio can make it happen, I would never cheat on Rstudio.

  1. Python Base

Once you have built all your models and time for deployment, you might not need all that interactive IDE for deployment. The main reason is because it consumes a lot of space. In those instances, you can use base python images, install required libraries and deploy them. In the end, you can have a containerized project under 500MB running in an isolated environment.

  1. R-base

Technically, Python and R-base go hand in hand. They both deserve the same place. Similar to Python, Rstudio consumes close 1GB for each image. So, if you want to deploy containers say for a plumber app, use R-base which is much leaner in size.

  1. DataIku

If you are not a coder, Dataiku (DSP) has their platform as a Docker image. You can pull the image and get it up and running in 5 minutes. Dataiku is one of the best enterprise ready DSP’s out there. It supports both coding as well as clicking kind of interface (Dataiku calls it to code or click). When you go through close to 30 or 40 DSP’s out there, this is probably the only DSP to support this. They also have a free version with limited features that you can use for quick analysis. Also, to mention they do have AutoML built into this tool.

  1. KNIME

KNIME is another open source DSP. This is available for both windows as well as Linux. If you are not good at programming, then this is probably the best tool out there for data science. Thanks to contributors to KNIME, a Docker image is available on Docker hub for this.

  1. H2O flow

H2O flow is another open source tool by This an interactive tool where you can load data, cleaning, processing, model building and analyzing data interactively on the browser. They also have their own version of AutoML where you can build multiple models with not needing to code explicitly. I usually use this to benchmark my ML models and has saved me a lot of times. H2O is also available in both R and python if you are interested in coding.

So, here are my thoughts on top 8 docker images out there for data science projects. Let, me know what you think. If you think there is a particular image that deserves to be on this list, then comment below.


Industry R R-Programming

Statistical Process Control (SPC) in R


Statistical Process Control (SPC) is a quality control technique that uses statistical techniques to monitor and control the process and product quality. Although this is an age old technique, this is widely used in various applications such as manufacturing, health care, banking and other service related industries. In this blog post, I will not deep dive into SPC but, show you how easy it is to do process monitoring in R. Here, I have used a generic data set. But, if you have a system that collects the data automatically, then this can be automated.

Jupyter Notebook is also available here


Computer related Industry python R

Using Cassandra Through R

In the last couple of years, there has been a lot of buzz around open source community. Almost every day, there are a lot of tools being open sourced. With a ton of open source tools in the market, don’t expect to have drivers built for every platform. I am a big fan of open source and the main reason is the huge community behind it.

I came across Cassandra, a No-SQL database a while ago and was very impressed. Since it was open source, I did not wait a moment to get my hands into it. Being primarily an R-user, I was happy to see R-Package to connect to Cassandra. That’s where problems began. For some reason, I could not connect to the database. After hours and hours of research on stack overflow, I ended up eventually connecting to it. Next problem was, data I queried was in a very weird format. Guess what, I turned to stack overflow. After a few hours, I gave up on it and didn’t bother for a few weeks.

One day, it hit me. Let me give it a try in Python, my second favorite language and it did the job I wanted. So, now the question was how do I replicate this in R. The answer was simple. Just write Python code in R-script voila! It solved my problem for now and hopefully someone or I can come up with a solution to rewrite the package for Cassandra.

#Supress Warnings

#load reticulate library to use python Scripts
library(reticulate, quietly=T)

#call the table in cassandra using Python function
py = py_run_string('import requests;
from cassandra.cluster import Cluster;
from datetime import datetime;
import pandas as pd;

cluster = Cluster(["","",""]);
session = cluster.connect("test");

query="select * from sample_table; ";


#move the pandas dataframe to R-dataframe
data = py$df

So, what the above code does is, you will be running a python script to access Cassandra using reticulate package, get the results and insert them into a pandas data frame. Next, move pandas data frame to R data frame.

More tutorials on the Reticulate package is available here.

Hope this helps.









Computer related Internet of Things R

Sound Analytics in R for Animal Sound Classification using Vector Machine

I am a regular user of Shazam app. If you are not aware of Shazam, this app basically recognizes the music, who the artist is and where you can get the music in real time. All you have to do is start the app when you are listening to music on the radio or elsewhere. So, I was always fascinated about this app and how it works. Today, finally I decided why not to try it myself. I built a small script that is capable of recognizing sounds from three different categories such as birds, farm animals and wild animals. Although I am not using any fancy signal processing, I will be using basic statistical features in time and frequency domain. I was able to download few sounds from  here.

Vector machines are widely used in  various applications such as image classification, non-linear predictions, regression etc. Because of its robustness, vector machines have become to go to option in machine and deep learning. If you would like to learn more on vector machines, Jason has a great article in his blog.

This is how my data processing flow looks like


In my data processing, I used 75% as my training data set and 25% as testing data set where my sample size was only 16. In my sample size, 5 were for bird, 5 for farm animals and 6 for wild animals. One will definitely argue that the sample size is very small for modeling. But, like I mentioned earlier, I coded this just for fun. I will definitely agree that sample size is small, if you want to extend this as an application, you can increase the sample size to your needs. Here is the link on how to calculate the sample size.

In the end when I computed the accuracy of the model, I was definitely surprised to see an accuracy of 75% and especially Kappa value 0.6364. Although this is not the best results, this was much higher than I actually anticipated. So what would be the accuracy and Kappa value in classification? I would say somewhere over 85 for accuracy and over 0.8 for Kappa is something that I would consider as a good or acceptable value.


In case of multi-class Area under the curve, the results were as follows

Multi-class area under the curve: 0.9167

Finally, here is the source code. I will try to create this as a project and will upload it to my GitHub repo.


#extract statistical feaures for birds voices
bird1<-readWave("C:/Users/Desktop/sounds/birds/Bird chirps animals140.wav")
bird2<-readWave("C:/Users/Desktop/sounds/birds/Bird-chirp (Red Lories) animals119.wav")
bird3<-readWave("C:/Users/Desktop/sounds/birds/Crow animals010.wav")
bird4<-readWave("C:/Users/Desktop/sounds/birds/owl animals074.wav")
bird5<-readWave("C:/Users/Desktop/sounds/birds/Vulture animals008.wav")


b1_fft <- fft(bird1_data)
amplitude <- Mod(b1_fft[1:round(length(b1_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = TRUE, append = FALSE )

b2_fft <- fft(bird2_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b3_fft <- fft(bird3_data)
amplitude <- Mod(b3_fft[1:round(length(b3_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b4_fft <- fft(bird4_data)
amplitude <- Mod(b4_fft[1:round(length(b4_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b5_fft <- fft(bird5_data)
amplitude <- Mod(b2_fft[1:round(length(b5_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

#extract statistical feaures for farm animal voices
farm1<-readWave("C:/Users/Desktop/sounds/farm/Cat meow animals020.wav")
farm2<-readWave("C:/Users/Desktop/sounds/farm/Cow animals055.wav")
farm3<-readWave("C:/Users/Desktop/sounds/farm/Dog animals080.wav")
farm4<-readWave("C:/Users/Desktop/sounds/farm/Goat animals115.wav")
farm5<-readWave("C:/Users/Desktop/sounds/farm/Sheep - ewe animals112.wav")


b2_fft <- fft(farm1_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(farm2_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(farm3_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(farm4_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(farm5_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

wild1<-readWave("C:/Users/Desktop/sounds/wild/Elephant-angry animals035.wav")
wild2<-readWave("C:/Users/Desktop/sounds/wild/Leopard growl animals089.wav")
wild3<-readWave("C:/Users/Desktop/sounds/wild/Lion growl and snarl animals098.wav")
wild4<-readWave("C:/Users/Desktop/sounds/wild/Lion roar animals103.wav")
wild5<-readWave("C:/Users/Desktop/sounds/wild/Rhinoceros animals134.wav")
wild6<-readWave("C:/Users/Desktop/sounds/wild/Tiger growl animals026.wav")


#extract statistical feaures for wild animal voices
b2_fft <- fft(wild1_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(wild2_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(wild3_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(wild4_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(wild5_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

b2_fft <- fft(wild6_data)
amplitude <- Mod(b2_fft[1:round(length(b2_fft)/2,0)])
write.table(featuredata,"C:/Users/Desktop/sounds/featuredata.csv",quote = FALSE, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE )

#import feature data table
featureData<-read.table("C:/Users/Desktop/sounds/featuredata.csv", header = TRUE, sep = ",")
## 75% of the sample size
smp_size <- floor(0.75 * nrow(ft))

## set the seed to make your partition reproductible
train_ind <- sample(seq_len(nrow(ft)), size = smp_size)

train <- ft[train_ind,1:22 ]
test <- ft[-train_ind,1:22 ]

trainlabel<-ft[train_ind,23 ]
testlabel<-ft[-train_ind,23 ]

#Support Vector Machine for classification
model_svm <- svm(trainlabel ~ train )

#Use the predictions on the data
pred <- round(predict(model_svm, test),0)

#ROC and AUC curves and their plots
roc.multi<-multiclass.roc(testlabel, pred[1:4],levels=c(1, 2, 3))
rs <- roc.multi[['rocs']]
sapply(2:length(rs),function(i) lines.roc(rs[[i]],col=i))



R-Package for ROC/ MROC Algorithm

Its been over a year and I finally got some time this weekend to put together the algorithm that I proposed in 2016 called Modified Rank Order Clustering (MROC) as R-package. Interestingly, I did not plan on building this but, to take my mind off of ALICE (Autonomous Learning Integrated Computational Ensemble) that I’ve been working on past few weeks, I decided to build this.

So, what is ROC/ MROC?

In 2016, I presented a paper in IFAC, Reims, France 2016 showcasing an enhanced approach to King’s Rank Order Clustering (ROC) algorithm (1980). King initially proposed ROC algorithm for machine and component part routing in cellular manufacturing by clustering using binary weights. I proposed an enhanced approach by embedding manufacturing data into the algorithm. When I was working on this research, I used Excel macros. It had it its own challenges (robustness duh!..). Since I started to embrace R-programming, I searched if there was any package available for ROC and interestingly, my search turned out to be empty. So, I thought why not build my own package for ROC and MROC and a couple of hours later, I finally had a package. If anyone is interested in pursuing research on this topic, the package is available for download at GitHub along with its source code.


Source Code

To install this package in R, extract the zip file and save it under R\win-library\3.4  in my documents.

Documentation is included in the package and for source file documentation, find it on GitHub.


J. R. KING (1980), Machine-component grouping in production flow analysis: an approach using a rank order clustering algorithm International Journal of Production Research Vol. 18 , Iss. 2,1980

Amruthnath, N., & Gupta, T. (2016). Modified Rank Order Clustering Algorithm Approach by Including Manufacturing Data. IFAC-PapersOnLine, 49(5), 138-142.

Computer related R

Why I love R- programming?

R is one if the underdogs in the programming language. It’s highly likely that one would use R unless you are a statistician or a data scientist. I have been using R for over a year and it’s been one of the best journeys in programming. Although I have a computer science background, I was never into programming until I came across R. It is such a powerful tool, most BI tools have R integration by default like Power BI, SISENSE etc and moreover R is free for commercial purpose. To start using R, one doesn’t need strong programming skills.

About a year ago, during a conversation with a buddy, he mentioned about learning R and went on talking about how great this tool was? Initially, I was little skeptical and thought to myself why not try it? For the first few days, it just bounced off my head with vectors, matrices, objects, functions. As I started grasping deeper concepts, I was astonished by its capabilities. Today working as a Data Scientist and a fellow researcher, my day doesn’t go by without using R at least once a day. I recently started to code machine learning algorithms for my Ph.D. dissertation and I have never felt strong about any other programming language.

I could definitely say this with confidence without any references, “You can do anything with R” and I mean it. Once you hop onto R terminal, there is no need to get off for any reason. You can start importing the data from the cloud, SQL database or you name it, perform data cleaning, explore data set, fit statistical and machine learning models, perform predictions, export data to Git, email it to your colleagues, upload it Facebook and LinkedIn. When you are on R terminal, believe me, you don’t even need a mouse for anything. Jared Lander (Author, R for Everyone) talked about this at 2016 New York R Conference. Here is the link to the video.

Just to give you an example, I decided to write a code where, you can start importing data from a cloud drive, do statistical analysis and email it to your friend without getting out of R- terminal.

Note: Source code file is at the end of the page. I won’t be explaining each module as this post is meant for showcasing R-programming capability

Let’s install all the library files and install the packages


Motorcar data is one of the basic data sets that are used while learning R. There is a built-in dataset as well. Another popular data set is Iris data which is also one of my favorite datasets. Instead of calling built-in data set, I’ll download the data from my google drive and assign it to an object as follows

#Use head to print first 6 lines of code just to make sure data is imported
#Summary is used to view the summary statistics of the data.

I won't be going over the summary of the data. With basic statistics, it's hard to interpret the results. So, let's plot some correlation plots and a linear regression model

#converting my data to numeric
my_num_data <- mydata[, sapply(mydata, is.numeric)]
res <- cor(my_num_data)
round(res, 2)

#plotting correlation plots
corrplot(res, type = "upper", order = "hclust", tl.col = "black", = 45)
chart.Correlation(my_num_data, histogram=TRUE, pch=19)

#developing a linear regression model
<- lm(cyl~ hp,my_num_data)


Figure 1: Correlation plot for mtcar dataset

Now we know how the data is correlated to each other. In less that 30 line of code, let’s try to do some PCA (Principle Component Analysis), K-Nearest Neighbor, Decision tree and K-means Clustering.

#--------------------Principle componant Analysis----------------------------------
responseY <- my_num_data$cyl
predictorX<- my_num_data[,3:11]

pca <- princomp(predictorX, cor=T)
pc.comp <- pca$scores
pc.comp1 <- -1*pc.comp[,1] # define PC1
pc.comp2 <- -1*pc.comp[,2] # definr PC2
plot(pc.comp1, pc.comp2,main="Principle Components", xlab="PC1", ylab="pc2",type="p")
points(pc.comp1[responseY==4], pc.comp2[responseY==4], cex=0.5, col="blue")
points(pc.comp1[responseY==6], pc.comp2[responseY==6], cex=0.5, col="red")
points(pc.comp1[responseY==8], pc.comp2[responseY==8], cex=0.5, col="green")

#-------------------------K neareast Node algo------------------------------------
X.train <- cbind(pc.comp1, pc.comp2)
train_dat <- my_num_data[1:22,3:11]
test_dat <- my_num_data[23:32,3:11]

train_label <- my_num_data[1:22,2]
test_label <- my_num_data[23:32,2]
model.knn <- knn(train=train_dat, test=test_dat, train_label, k=19, prob=T)
CTab<-CrossTable(x = test_label, y = model.knn ,prop.chisq=FALSE)
# when you look at the data you can observe that 9 out of 10 labels were tested
#right. we could say the prediction was 90% accurate

#How about a decision tree now?
#-------------------------Decision Tree-------------------------------------------
tree.modeltree <- tree(log(responseY) ~ ., data=my_num_data)
text(tree.modeltree, cex=.75)

# lets cluster some data ans see what it tells us
#--------------------------Kmeans clustering algorithm--------------------------
X <- cbind(pc.comp1, pc.comp2)
cl <- kmeans(X,13)
plot(pc.comp1, pc.comp2,col=cl$cluster)
points(cl$centers, pch=16)


Figure 2: Principle Components using PCA



Figure 3: K- Nearest Neighbor Results


Figure 4: Decision Tree



Figure 5: K-Means Algorithm

Wasn’t that great? Four algorithms in less than 30 lines of code. Trust me, I still have one more trick under my sleeve. This is all fine. But, I want to send this code to say, my friend “David” and I’m too lazy to open up my web browser or mailing app.

What should I do?….

You know what? Let me code it in R! Less than 12 lines of code?

sender <- ""
recipients <- c("")
send.mail(from = sender,
          to = recipients,
          subject = "here you go man!",
          body = "Vola!",
          smtp = list( = "", port = 465,
             = "",
                      passwd = "*********", ssl = TRUE),
          authenticate = TRUE,
          send = TRUE,
          attach.files = c("/R program/blog_mtcars.R"))

Like I mentioned above, this is one of the reasons I fell in love with R-programming. You can do most of your work within R console and in 86 line of code, I was able to import data, do a couple of statistical tests, modeling and email it.

Let me know your views on this and your experience with R!

Here is the SoureCode

Disclaimer: I did not write every single line of code. I did reuse some parts of code from various authors across the web.
Feature Image Courtesy:
Industry Thoughts

The Beginning of the End!…. The United Airlines Case

Having worked in the manufacturing industry for few years now, I have learned a few things here and there about workplace etiquette. One lesson that sticks out is “the customer always comes first”. Every day that I walk on the production floor, everyone has one mission and that is to keep the customer happy. They can achieve this by making quality products delivered in the right time and in the right quantities. It is not just a manufacturing environment that this observance is made. It can also be observed in successful restaurants, fast food chains, automotive service, and many more.

Image: Toyota Production System (Courtesy:

If I was given an option to choose a fast-food restaurant, I would always choose McDonalds just for one reason. That reason is they believe customers come first. In every instance, however long the drive through queue might be, I have never spent more than 10 minutes to get my food and my order has never been wrong. Are they the best? Absolutely not! All I can say is they are better than their competition from experience. If you are in a service based industry, the highest order or most prioritized item in your business model should be the customer. It is not enough if you have a customer focused model. They must also ensure that the customers are happy with the service provided.

In the last few weeks, there has been two instances where United Airlines has appeared in news for concerning their bad customer service. I was not surprised about this news. Today, their stocks have dropped 5% and they have lost capital by at least $1 billion. Not allowing people to board because they did not follow dress code and because service overbooked a plane, is the solution really to escort customers out? There have been mixed responses from onlookers regarding these decisions made by the airline. If you support the airlines stance, it is highly unlikely that you have not traveled outside the USA to experience what a quality service can be.

Image: Major US Airlines (Courtesy

I have traveled several times on United and each time, the service was not pleasant. This unpleasantness is not just United Airlines. It is the same in the case of Delta and American Airlines. Spirit Airline is a low-cost carrier in the USA and I feel it has superior service compared to the big three. In industry, the customer can choose between cost, quality, or speed. Over time, the airline industry has shifted towards cost savings for customers but to do so, this has resulted in a poorer customer service trend. In the case of these airlines, they are sacrificing quality to meet cost and time objectives to please the customer and bottom line. Is this appropriate? Is there a better balance to improve quality flights? These bad PR situations have certainly brought to the spotlight to these questions.

Dress code

Returning to the two instances concerning United, is it wrong for United Airlines to have dress code? I do not think it is wrong for an airline to expect their customers to dress nicely. However, to counter this expectation, ask yourself, does the situation dictate this stringent dress code? The airplane is essentially a big sardine box where the airline continues to add more seats, reduce the pitch of seats, and reduce leg room. This is in addition with the expectation passengers will respect the policies and dress casual in cramped positions for possibly multiple hours. Is this expectation reasonable? Of course not, especially for the average passenger.

The situation that developed was due to two women dressing outside company dress code taking advantage of company discounts. United has already clarified upon this fact and has reiterated that regular passengers are not subject to these expectations.

Overbooking and dragging paid customers out of the plane

Was the situation ethical? According to their rules guide, yes! Is it good for business? No! United has appeared to have completely forgotten that they are in service based industry. Customers first have become “it was never about customers to begin with”. Overbooking the plane is done by most airlines and there is strong mathematical justification with it and I support the tactic. However, kicking people out so you could accommodate additional crew is a mistake. If you overbooked a plane and no passengers are willing to give up their ticket, have a proper strategy to handle the situation. This is neither a lottery to pick random people nor a simulation you are running. It is an appropriate time to hire operations research scientists to build a model to predict at what cost, the customers are willing to give up their ticket or convince people to give up their ticket. This way, you can have satisfied customers without shedding blood and no more negative publicity in the headlines.

Video: Why Airline Overbook (Courtesy: Wendover Productions)

The other option is to place employees on a later flight to accommodate business needs. Hiring of new industrial and operations engineers to find other ways to cut operations cost without compromising the quality of service is another option moving forward.

It is time for the US airline industry to learn a few things from their European and Middle Eastern counterparts on what quality of service is. And what quality of service means to their business? If this culture does not change, customers would definitely be denying service from these carries.


Featured image courtesy

The Day I Stopped Using Mainstream Social Media..


This is one of those issues people would not talk about it and one of the most neglected issues today. I wanted to put myself out there and tell my story to the world. I have always been a social media buff. I am definitely not arguing about that. It was one night when I stayed up until midnight, I realized that I was an addict to social media. Scrolling through the pages, looking at people and seeing what they were up to. I was consumed by mainstream social media. Every instance I get, I would pull out my phone and open up the app, start scrolling down, liking posts and commenting. The hours I wasted on social media is irretrievable.


Being a millennialit started out as creating an online presence and it was only a matter of time before I was consumed by it. Over the years, I started to realize that my online presence was a mistake. What once started out as a platform for freedom of speech and freedom of views turned out to being offensive to someone. This was the point where, I realized and stopped commenting on posts then, stopped liking the posts and started to become a shadow user. When someone really close to me and I dearly respect said “you gotta be careful about what you post on social media. Its only a matter of time where people would start judging you”. At that moment, I felt, every word of it was true. I was able to tell the difference in people I see everyday or talk to everyday.

First, I lost my privacy and next in line was to opinions or political views getting in my head. There were so many instances where, if I had a different opinion that was different from most of them on my friends list, I two options. One, to go all in like a soldier and mostly get bullied or keep my opinions to myself. I chose the second one. Logically, I was spending my precious time in ruining my happy  mood. Being born and live in a society where the way you live is being judged from things you do since waking up in the morning to sleeping at the the end of the day is closely monitored 24×7 by people around you/close to you and then discussing about it like court appointed jurors with a final verdict that stays with you for the rest of you life, it was hard to keep pretending to be someone on social media whom you are not.

Image result for pretending to be someone

Image retrieved from

I Quit!

This was probably the toughest part when I said to myself “enough was enough” and uninstalled all the mainstream social media apps on my phone and other personal devices. It has been over three months since, I uninstalled the app and until today, I have no regrets. I don’t spend hours on social media, rather I spend my time in reading actual news around the world. I don’t care about that what people are up to or all their whining and crying on social media. Now, I use my precious time and thoughts for something constructive. As of today, I don’t access social media unless I am in front of a computer and would not spend more that 2 mins on them. Once an addict, now have lost interest. I feel liberated and peaceful mind. It was not too late before I realized and took care of my addiction problem today I’m a happy man.


Disclaimer: This post is not meant to advice anyone or promote against social media rather to tell my story. 

Image source :




Internet of Things

Gradtalks in 3 Minutes : Nagdev Amruthnath

gdtsSome of you have been asking me regarding my research on? Thanks to GSA of Western Michigan University, for making a 3 minute on my research and sharing it to the world. I am proud to talk about my vision and my work in creating hundreds of jobs along with making current jobs lot easier and save millions of dollars in manufacturing costs.

Video Courtesy: Graduate Student Association, Western Michigan University

Computer related

Conditional formating for Today’s cells


I have been a heavy Microsoft Excel user. I use it to create my project schedules, do reports, data analysis and run simulations. One of the most neglected aspects in any industry is data visualization. It can be as simple as bar chart or complex 2-D scatter plots. Recently, I was working on preparing a schedule for one of my projects and wondered, it would be nice to highlight all the cells corresponding to today’s date so, it would give me actual status without scrolling towards days. Moreover, it looks presentable. Firstly, I did not want to highlight cells manually every time I opened it. I wanted my current day’s cells to highlight automatically. Secondly, I did not want to write any macros although it was my first option.

I did some research on the internet and found some vague information. So, I just decided to go with conditional formatting and experiment it myself with trial and error. After few tries, Vola! I did manage to do it. It was pretty simple indeed. Alright, here it goes.

Alright folks!  here it goes.

  1. First, I selected F12 cell which corresponds to my first date cell

2. Then, I selected conditional formatting > New condition > Use formula

3. I entered this formula =F$12 = Today(), selected yellow as cell background under format and then clicked Ok


4. Conditional formatting was added to only one cell. So, I dragged the rule to all the cells by selecting manage rules under conditional formatting and then selecting all the cells under applies to. Here, its between $F$12:$W$12.


This is how my final schedule looked like in the end. Pretty neat!


And that was it. It was really simple. Now, I can visually see the status of my projects with reference to today’s date.



Image Courtesy: Kevin Doncaster retrieved from

Computer related

Not this again Mac!!!

Working in IT for quite some time now, I have deployed quite a number of macs. MacBook Pro, MacBook Air, and iMac are one of the overpriced and best-designed products out there in the market. With all my experience over the years, if I ever could afford a mac, I would definitely* buy one. That “Astrix” was definitely not a typo and discussion regarding this issue is probably for another time. 

Over past couple of weeks, I have come across two specific issues with the macs I deployed. Although the resolution is a couple of clicks away, it just gets to my nerve when I get calls regarding this every other day. So, here are the two main issues deploying macs in domain

Keychain Password for printer

At my workplace, we have a time interval between when we have to change our passwords. Usually 6 months. Every time when my customer changes their password, it needs to be updated in the keychain. I get the whole concept of keychain and its ” so-called benefits”. But Apple, you cannot expect every user to be an IT tech. 

Resolution is simple, Print a document and an error window is going to pop up in the bottom menu tray. Click that and you will see a dialog box with an error “Hold for authentication”. Next, to the error message, you can see a refresh symbol, click it and another dialog box should pop up asking for administrator username and password. Enter your new password and click ok, that should resolve the issue. 


Wasn’t that simple! But, my customers don’t understand it. Apple, take a note of this issue to work on.

 Network connection not available

Well! Well! Well!This is yet another issue that I often com across. The first time, I came across this issue, I spent an entire day trying to resolve it. The first thing, I would suggest you look for if, you have this issue is “TIME”. Computers tend to use network time. If your time is even off by a couple of milliseconds, your mac will not let you log in with domain credentials. The first step is to log into a local account and open up “Terminal” and enter the following command. 

#sudo ntpdate -u

Replace “” with your domain server and execute the command couple of times. This works for me all the time. Like a Charm!

If you are reading this probably you one of these issues. Hope this solution helps!

Internet of Things

root@kali:~# Sudo apt–get install Security-and-Privacy


As an Information Technology (IT) Engineer working as an IT Tech, I frequently hear about network security and privacy. Day to day, I handle gigabytes of data backups and see the vulnerabilities that appear when the data is connected to network. This data can be as simple as my client’s grocery list file. Compromising is not an option when it comes to an individuals or company’s privacy. When security fails, it is an invasion of privacy and that is unacceptable.

In recent years, we have been hearing a lot about the Internet of Things (IoT). The IoT connects all physical and electronic devices which enable these devices to communicate with each other fluidly. Although it may seem like a new buzz word for most of us, the IoT has been in existence since 1985 and first coined by Peter T. Lewis. In reality, most of us are inside the IoT bubble without our knowledge. Devices such as cellphones, tablets, laptops, smart watches, thermostats, smart TVs, and more are all connected and have the capability to communicate with each other.

The IoT is consists of 3 features. These are sensors, machines and cloud applications. Sensors can perceive physical entities while machines enable inter-device communication and the cloud-architecture brings all the machines under the same roof for some tea, cookies, and conversation. In 2014, Wired published an article on how IoT is far larger than anyone realizes. Wired also wrote how IoT can be exploited to get the best applications out of it. In the same year, Wired published another article on how insecure IoT is and how machines can be exploited for the worst. Cisco predicts there will be 500 billion internet connected devices by 2030. This would not be surprising considering that majority of people in the world have at least 3 devices that are connected to the internet.

Several articles over the past few years have been published featuring the concerns of privacy and IoT security issues. People generally greatly value their privacy and compromising it is not an option. Likewise, nobody wants to see their coffee machine high jacked by some guy living in his mom’s basement. Amazon recently released their Amazon Echo, a hands-free speaker assistant device you control with your voice. I have read a few of review about this product. One review says “It can recognize my voice even when music is played loudly”. The very first thing that popped in my head was “Does it listen to my every conversation? It is an invasion of my privacy!”. I started looking around for its architecture and found the device uses key phrases to activate the response procedure. The voice is then converted to audio recordings, encrypted and sent to the real brains miles away. Sounds good right! … No! Not at all. These messages are then stored on Amazon servers until they are deleted manually using an app. This is not just Amazon Echo. This includes other voice prompted systems such as Apple Siri, Google Home, and Window’s Cortana. If someone could get hold of your login password, they can view all your conversations you may have forgotten to delete. The only way to protect your everyday privacy is by turning off the mic using the app.

Apart from privacy, the next big issue is the security of the IoT enabled structure. It would be a great dinner conversation when you invite your family or friends over to your home and talk about how your house heating, appliances, lighting, security codes and other machines that are connected through IoT infrastructure. You may even let them play with the master remote or app. A couple of students from the University of Michigan in Ann Arbor were able to hack into Samsung’s smart home automation system (one of the top selling IoT platforms for consumers) and get a pin access code to their front door. There was another incident a couple of years ago where a remote hack of a Jeep SUV was performed. The hack resulted in control of many of the car’s functions. I have personally worked on setting up a cloud shared drive at home to sync all my devices and cloud drives such as drop box and google drive. As I worked through set up, I discovered first-hand how vulnerable my data is and the risk it could be compromised. These days, all it requires is a Linux distribution and couple of lines of python code which can be easily downloaded from repository sites to update the appropriate securities.  In one study, HP found that as many as 70 percent of IoT devices are vulnerable to attack. Toptal created few basic points on what can be done to improve security such as to emphasize security from day one, Lifecycle, review access control and device authentication, and prepare for security breaches.

This technology is a large sector of our consumer goods which has grown at a rate much faster than we could have predicted 20 years ago. Due to this accelerated industrial and technological growth, we require modern rules to govern and respect consumer’s privacy and data security. These rules need to be updated at a pace that matches the industry. With stricter laws, companies would be required to continue to push security and privacy as key cornerstones of their product.


Nagdev Amruthnath



What’s the Perfect Blend?

In this competitive world, most of them want to achieve a lot of things in a short span of time. Between this time frame, only a handful of them have seen some part of success and rest; either they want to give up or want to keep trying. Success in my article is not about the money makers but, also the one’s who created a positive impact to themselves and around them. Everyday I see a lot of posts on my wall about trying and never giving up is the only way to achieve success. Until recently, when I was having a conversation with a fine gentleman from Japan and who has seen quite a good amount of success, between our healthy productive argument said “Go beyond your book of rules if, you want to see that real success”. This sentence changed my way of thinking and made me realize I had been tackling all the problems by constraints. This conversation kept me thinking for quite a few days. I started researching and experimenting on myself. Small projects turned into full fledged projects, simple designs turned into products and accessories. In every meeting, in every conversation I started asking why? why not this? It was a great learning experience. Sometimes it worked and mostly it failed. I failed not because, I did not follow the rule book but, because of my lack of knowledge in that particular concept. After all, I believe all human beings are handicapped! Yes! if we all weren’t we wouldn’t need a new car every year, new phones with new add-on’s every month. We would have had a car with 100 percent or beyond  efficiency. No more concepts; everything would be in to production. We would have been super life in this universe and may be across others. With my experience for past couple of months, my lack of knowledge made me learn more, started designing, prototyping and testing. With all this, there are still human constraints that needs to be addressed. But, with all this, I understood that to create that success or positive impact we need that perfect blend of things involving, mind, belief and soul just like our coffee we have every morning.

Data Science python R R-Programming

Free coding education in the time of Covid-19

At the time of this COVID 19, there are tons of resources being offered for free to update your skills. If you are an ML enthusiast. and want to learn on how to program then check our both sections related to R and Python.


Harvard has always had free training for R programming through edX. The course to learn is free. If you care about certificate of completion, then you have to pay extra.

harvard R

Harvard also offers a ton of other data science related learning. Check out the link below.




Stanford has decided to offer free python course for anyone who is interested in the course. I have completed few sections in this course and this seems to be very intuitive and very easy to understand for anyone who has no experience in programming. Check out the image below for details.




Some of the other resources for you to learn python is W3Schools,, and huge list by hacknoon.

If you are more interested in just learning machine learning check out W3Schools.

If you know of any other free training available, please comment below and I will add that to this post.