Data Science News

Data Science News  – handpicked articles, news, and stories from Data Science world – June.



Experts Predict When Artificial Intelligence Will Exceed Human Performance – Artificial intelligence is changing the world and doing it at breakneck speed. The promise is that intelligent machines will be able to do every task better and more cheaply than humans. Rightly or wrongly, one industry after another is falling under its spell, even though few have benefited significantly so far.


This app uses artificial intelligence to turn design mockups into source code – While traditionally it has been the task of front-end developers to transform the work of designers from raw graphical user interface mockups to the actual source code, this trend might soon be a thing of the past – courtesy of artificial intelligence.


Applying deep learning to real-world problems – The rise of artificial intelligence in recent years is grounded in the success of deep learning. Three major drivers caused the breakthrough of (deep) neural networks: the availability of huge amounts of training data, powerful computational infrastructure, and advances in academia.


How AI Can Keep Accelerating After Moore’s Law – Google CEO Sundar Pichai was obviously excited when he spoke to developers about a blockbuster result from his machine-learning lab earlier this month. Researchers had figured out how to automate some of the work of crafting machine-learning software, something that could make it much easier to deploy the technology in new situations and industries.


Bayesian GAN – Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. We present a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. Within this framework, we use stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of the generator and discriminator networks.


The $1700 great Deep Learning box. – After years of using a thin client in the form of increasingly thinner MacBooks, I had gotten used to it. So when I got into Deep Learning (DL), I went straight for the brand new at the time Amazon P2 cloud servers. No upfront cost, the ability to train many models simultaneously and the general coolness of having a machine learning model out there slowly teaching itself.


Exploring LSTMs – The first time I learned about LSTMs, my eyes glazed over. Not in a good, jelly donut kind of way. It turns out LSTMs are a fairly simple extension to neural networks, and they’re behind a lot of the amazing achievements deep learning has made in the past few years. So I’ll try to present them as intuitively as possible – in such a way that you could have discovered them yourself.


An Algorithm Summarizes Lengthy Text Surprisingly Well – ho has time to read every article they see shared on Twitter or Facebook, or every document that’s relevant to their job? As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. And it may become routine to rely on a machine to analyze and paraphrase articles, research papers, and other text for you.


Divide and Conquer: How Microsoft researchers used AI to master Ms. Pac-Man – Microsoft researchers have created an artificial intelligence-based system that learned how to get the maximum score on the addictive 1980s video game Ms. Pac-Man, using a divide-and-conquer method that could have broad implications for teaching AI agents to do complex tasks that augment human capabilities.


Open Source Datasets – A large-scale, high-quality dataset of URL links to approximately 300,000 video clips that cover 400 human action classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400 video clips. Each clip is human annotated with a single action class and lasts around 10s.


One Model To Learn Them All – Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains. 


THE POWER OF GTC – GTC is the largest and most important event of the year for GPU developers. GTC and the global GTC event series offer valuable training and a showcase of the most vital work in the computing industry today – including artificial intelligence and deep learning, healthcare, virtual reality, accelerated analytics, and self-driving cars.


 Artificial intelligence can now predict suicide with remarkable accuracy – When someone commits suicide, their family and friends can be left with the heartbreaking and answerless question of what they could have done differently. Colin Walsh, data scientist at Vanderbilt University Medical Center, hopes his work in predicting suicide risk will give people the opportunity to ask “what can I do?” while there’s still a chance to intervene.


Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor

The Sentient Machine: The Coming Age of Artificial Intelligence

Out of Remote Control (The DATA Set Book 7)



Learning Path: TensorFlow: The Road to TensorFlow – Discover deep learning and machine learning with Python and TensorFlow

Python for Data Structures, Algorithms, and Interviews! – Get a kick start on your career and ace your coding interviews!

Python for Data Science by UC San DiegoX – Learn to use powerful, open-source, Python tools, including Pandas, Git and Matplotlib, to manipulate, analyze, and visualize complex datasets.

High-Dimensional Data Analysis by HarvardX  – A focus on several techniques that are widely used in the analysis of high-dimensional data.

A developer’s guide to the Internet of Things (IoT) – The Internet of Things (IoT) is an area of rapid growth and opportunity. Technical innovations in networks, sensors and applications, coupled with the advent of ‘smart machines’ have resulted in a huge diversity of devices generating all kinds of structured and unstructured data that needs to be processed somewhere.

Neural Networks for Machine Learning –  Learn about artificial neural networks and how they’re being used for machine learning, as applied to speech and object recognition, image segmentation, modeling language and human motion, etc. We’ll emphasize both the basic algorithms and the practical tricks needed to get them to work well.

If you have found above useful, please don’t forget to share with others on social media.

How would you validate-test a machine learning model?

How would you validate-test a machine learning model?

Why evaluate/test model at all?

Evaluating the performance of a model is one of the most important stages in predictive modeling, it indicates how successful model has been for the dataset. It enables to tune parameters and in the end test the tuned model against a fresh cut of data.

Below we will look at few most common validation metrics used for predictive modeling. The choice of metrics influences how you weight the importance of different characteristics in the results and your ultimate choice of which machine learning algorithm to choose. Before we move on to a variety of metrics lets get basics right.

Golden rules for validating-testing a model.

Rule #1

Never use same data for training and testing!!!

Rule #2

Look at Rule #1

What it means is, always leave cut of data that are not included in training to test your model against after fitting/tuning is finished.

There is a very simple way to set data aside, using Scikit-Learn:


from sklearn import datasets
from sklearn.cross_validation import train_test_split

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print('Full dataset, features:',len(X))
print('Full dataset, labels:',len(y))
print('Train dataset, features:',len(X_train))
print('Test dataset, features:',len(X_test))
print('Train dataset, labels:',len(y_train))
print('Test dataset, features:',len(y_test))

In some cases, it may seem like ‘losing’ part of the training set, especially when data sample is not large enough.There are ways around it as well, one of most popular is ‘Cross Validation’.

Cross Validation

It is simple ‘trick’ that splits data into n equal parts. Then successively hold out each part and fit the model using the rest. This gives n estimates of model performance that can be combined into an overall measure. Although very ‘heavy’ from computing point of view, very efficient and widely used method to avoid overfitting and improve ‘out of sample’ performance of the model.

Here is an example of Cross Validation using Scikit-Learn:


from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

num_instances = len(X_train)
kfold = cross_validation.KFold(len(X_train),n_folds=10, random_state=1)
model = LogisticRegression()
preds = cross_validation.cross_val_score(model, X_train, y_train, cv=kfold)
print(preds.mean(), preds.std())

Two main types of predictive models.

Talking about a predictive modeling,  there are two main types of problems to be solved:

  • Regression problems are those where you are trying to predict or explain one thing (dependent variable) using other things (independent variables) with continuous output eg exact price of a stock next day.
  • Classification problems try to determine group membership by deriving probabilities eg. will the stock price go up/down or will not change next day. Algorithms like SVM and KNN create a class output. Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.

In regression problems, we do not have such inconsistencies in output. The output is always continuous in nature and requires no further treatment.

Techniques/metrics for model validation.

Having the dataset divided, and model fitted there is a question, what kind of quantifiable validation metrics to use. There are few very basic quick and dirty methods to check performance. One of them is value range – if model outputs are far outside of the response variable range, that would immediately indicate poor estimation or model inaccuracy.

Most often there is a need to use something more sophisticated and ‘scientific’.


Accuracy is a classification metric, it the number of correct predictions made as a ratio of all predictions. Probably it is the most common evaluation metric for classification problems.

Below is an example of calculating classification accuracy using Scikit-Learn.


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(),y_train)
preds = model.predict(X_test)


Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate.


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(),y_train)
preds = model.predict(X_test)
print(precision_score(preds,y_test, average=None))

Sensitivity or Recall

Recall (Sensitivity) – is the ratio of correctly predicted positive observations to all observations in actual class.


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import recall_score

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(),y_train)
preds = model.predict(X_test)
print(recall_score(preds,y_test, average=None))

F1 score

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import f1_score

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(),y_train)
preds = model.predict(X_test)
print(f1_score(preds,y_test, average=None))

Confusion Matrix

It is a matrix of dimension N x N, where N is the number of classes being predicted. The confusion matrix is a presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis.

There are four possible options:

  • True positives (TP), which are the instances that are positives and are classified as positives.
  • False positives (FP), which are the instances that are negatives and are classified as positives.
  • False negatives (FN), which are the instances that are positives and are classified as negatives.
  • True negatives (TN), which are the instances that are negatives and are classified as negatives.

Below is an example code to compute confusion matrix, using ScikitLearn.


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(),y_train)
preds = model.predict(X_test)

Gain and Lift Chart.

Many times the measure of the overall effectiveness of the model is not enough. It may be important to know if the model does increasingly better with more data. Is there any marginal improvement in the model’s predictive ability if for example, we consider 70% of the data versus only 50%?

The lift charts represent the actual lift for each percentage of the population, which is defined as the ratio between the percentage of positive instances found by using the model and without using it.

These types of charts are common in business analytics of Direct Marketing where the problem is to identify if a particular prospect was worth calling.

Kolmogorov-Smirnov Chart.

This non-parametric statistical test is used to compare two distributions, to assess how close they are to each other. In this context, one of the distributions is the theoretical distribution that the observations are supposed to follow (usually a continuous distribution with one or two parameters, such as Gaussian), while the other distribution is the actual, empirical, parameter-free, discrete distribution computed on the observations.

KS is maximum difference between % cumulative Goods and Bads distribution across score/probability bands. The gains table typically has % cumulative Goods (or Event) and % Cumulative Bads (Or Non-event) across 10/20 score bands. Using gains table, we can find the KS for the model which has been used for creating gains table.

KS is point estimate, meaning it is only one value and indicate the score/probability band where separate between Goods (or Event) and Bads (or Non-event) is maximum.

Area Under the ROC curve (AUC – ROC)

The ROC curve is almost independent of the response rate. The Receiver Operating Characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier. The curve is created by plotting the true positive rate (TP) against the false positive rate (FP) at various threshold settings. This is one of the popular metrics used in the industry.  The biggest advantage of using ROC curve is that it is independent of the change in

The biggest advantage of ROC curve is that it is independent of the change in the proportion of responders.

An area under ROC Curve (or AUC) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random.

The example below provides a demonstration of calculating AUC.


import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
roc_auc_score(y_true, y_scores)

Gini Coefficient

Gini coefficient is used in classification problems. It can be derived from the AUC ROC number.


from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split

def gini(list_of_values):
sorted_list = sorted(list(list_of_values))
height, area = 0, 0
for value in sorted_list:
height += value
area += height - value / 2.
fair_area = height * len(list_of_values) / 2
return (fair_area - area) / fair_area
def normalized_gini(y_pred, y):
normalized_gini = gini(y_pred)/gini(y)
return normalized_gini

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(),y_train)
preds = model.predict(X_test)

Logarithmic Loss

Log Loss quantifies the accuracy of a classifier by penalizing false classifications. Minimizing the Log Loss is basically equivalent to maximizing the accuracy of the classifier.

In order to calculate Log Loss, the classifier must assign a probability to each class rather than simply give the most likely class.  A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss.

Mean Absolute Error

The mean absolute error (MAE) is a quantity used to measure how close forecasts or predictions are to the eventual outcomes.

The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. The measure gives an idea of the magnitude of the error, but no idea of the direction.

The example below demonstrates simple example:


from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_absolute_error(y_true, y_pred))

Mean Squared Error

One of the most common measures used to quantify the performance of the model. It is an average of the squares of the difference between the actual observations and those predicted. The squaring of the errors tends to heavily weight statistical outliers, affecting the accuracy of the results.

Below a basic example of calculation using ScikitLearn.


from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print(mean_squared_error(y_true, y_pred))

Root Mean Squared Error (RMSE)

RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error is unbiased and follow a normal distribution.

RMSE metric is given by:


where N is Total Number of Observations.

R^2 Metric

The R^2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination. This is a value between 0 and 1 for no-fit and perfect fit respectively.

Below a basic example of calculation using ScikitLearn.


from sklearn.metrics import r2_score 
y_true = [3, -0.5, 2, 7] 
y_pred = [2.5, 0.0, 2, 8] 
print(r2_score(y_true, y_pred))

If you think the above is useful please share it with others via social media.

The topic of model validation and testing is much more complicated then examples above if you are interested in it, please check the reading list below or look at available courses on,



TensorFlow Tutorial

TensorFlow Tutorial.

What is TensorFlow?

The shortest definition would be, TensorFlow is a general-purpose library for graph-based computation.

But there is a variety of other ways to define TensorFlow, for example, Rodolfo Bonnin in his book – Building Machine Learning Projects with TensorFlow brings up definition like this:

“TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) passed between them.”

To quote the TensorFlow website, TensorFlow is an “open source software library for numerical computation using data flow graphs”.  Name TensorFlow derives from the operations which neural networks perform on multidimensional data arrays, often referred to as ‘tensors’. It is using data flow graphs and is capable of building and training variety of different machine learning algorithms including deep neural networks, at the same time, it is general enough to be applicable in a wide variety of other domains as well. Flexible architecture allows deploying computation to one or more CPUs or GPU in a desktop, server, or mobile device with a single API.

TensorFlow is Google Brain’s second generation machine learning system, released as open source software in 2015. TensorFlow is available on 64-bit Linux, macOS, and mobile computing platforms including Android and iOS. TensorFlow provides a Python API, as well as C++, Haskell, Java and Go APIs. Google’s machine learning framework became lately ‘hottest’ in data science world, it is particularly useful for building deep learning systems for predictive models involving natural language processing, audio, and images.


What is ‘Graph’ or ‘Data Flow Graph’? What is TensorFlow Session?


Trying to define what TensorFlow is, it is hard to avoid using word ‘graph’, or ‘data flow graph’, so what is that? The shortest definition would be, TensorFlow Graph is a description of computations. Deep learning (neural networks with many layers) uses mostly very simple mathematical operations – just many of them, on high dimensional data structures(tensors). Neural networks can have thousands or even millions of weights. Computing them, by interpreting every step (Python) would take forever.

That’s why we create a graph made up of defined tensors and mathematical operations and even initial values for variables. Only after we’ve created this ‘recipe’ we can pass it to what TensorFlow calls a session. To compute anything, a graph must be launched in a Session. The session runs the graph using very efficient and optimized code. Not only that, but many of the operations, such as matrix multiplication, are ones that can be parallelised by supported GPU (Graphics Processing Unit) and the session will do that for you. Also, TensorFlow is built to be able to distribute the processing across multiple machines and/or GPUs.

TensorFlow programs are usually divided into a construction phase, that assembles a graph, and an execution phase that uses a session to execute operations in the graph. To do machine learning in TensorFlow, you want to create tensors, adding operations (that output other tensors), and then executing the computation (running the computational graph). In particular, it’s important to realize that when you add an operation on tensors, it doesn’t execute immediately. TensorFlow waits for you to define all the operations you want to perform and then optimizes the computation graph, ‘deciding’ how to execute the computation, before generating the data. Because of this, tensors in TensorFlow are not so much holding the data as a placeholder for holding the data, waiting for the data to arrive when a computation is executed.





Before we move on to create our first model in TensorFlow, we’ll need to get the basics right, talk a bit about the structure of a simple neural network.

A simple neural network has some input units where the input goes. It also has hidden units, so-called because from a user’s perspective they’re hidden. And there are output units, from which we get the results. Off to the side are also bias units, which are there to help control the values emitted from the hidden and output units. Connecting all of these units are a bunch of weights, which are just numbers, each of which is associated with two units. The way we train neural network is to assign values to all those weights. That’s what training a neural network does, find suitable values for those weights. One step in “running” the neural network is to multiply the value of each weight by the value of its input unit, and then to store the result in the associated unit.

There is plenty of resources available online to get more background on the neural networks architectures, few examples below:



Deep learning uses very simple mathematical operations, it would be recommended to get/refresh at least basics of them.  I recommend starting from one of the following:



It would be advised to have basics Python programming before moving forward, few available resources:


Let’s do it… TensorFlow first example code.


To keep things simple let’s start with ‘Halo World’ example.

importing TensorFlow


import tensorflow as tf

Declaring constants/variables, TensorFlow constants can be declared using the tf.constant function, and variables with the tf.Variable function.  The first element in both is the value to be assigned the constant/variable when it is initialised.  TensorFlow will infer the type of the constant/variable initialised value, but it can also be set explicitly using the optional dtype argument. It’s important to note that, as the Python code runs through these commands, the variables haven’t actually been declared as they would have been if you just had a standard Python declaration.

x = tf.constant(2.0) 
y = tf.Variable(3.0)

Lets make our code compute something, simple multiplication.

z = y * x

Now comes the time when we would like to see the outcome, except nothing, has been computed yet… welcome to the TensorFlow. To make use of TensorFlow variables and perform calculations, Session must be created and all variables must be initialized. We can do it using the following statements.

sess = tf.Session()

init = tf.global_variables_initializer()


We have Session and even all constants/variables in place. Let’s see the outcome.

print("z = y * x = ",


If you see something like this:
‘z = y * x = 6.0’
Congratulations, you have just coded you first TensorFlow ‘model’.

Below whole code in one piece:

import tensorflow as tf
x = tf.constant(2.0)
y = tf.Variable(3.0)
z = y * x
sess = tf.Session()
init = tf.global_variables_initializer()
print("z = y * x = ",

This tutorial, of course, will not end up like this and will be continued soon… in next part, we will code our first neural network in TensorFlow.

If You liked this post please share it on your social media, if you have any questions or comments please make use of contact form.

Recommended reading list below:

Data Science News

Data Science News Digest – handpicked articles, news, and stories from Data Science world.




  • CUDA 9 Features Revealed  – At the GPU Technology Conference, NVIDIA announced CUDA 9, the latest version of CUDA’s powerful parallel computing platform and programming model.




  • AlphaGo’s next move – Chinese Go Grandmaster and world number one Ke Jie departed from his typical style of play and opened with a “3:3 point” strategy – a highly unusual approach aimed at quickly claiming corner territory at the start of the game.


  • Integrate Your Amazon Lex Bot with Any Messaging Service – Is your Amazon Lex chatbot ready to talk to the world? When it is, chances are that you’ll want it to be able to interact with as many users as possible. Amazon Lex offers built-in integration with Facebook, Slack and Twilio. But what if you want to connect to a messaging service that isn’t supported? Well, there’s an API for that–the Amazon Lex API.


  • How Our Company Learned to Make Better Predictions About Everything – In Silicon Valley, everyone makes bets. Founders bet years of their lives on finding product-market fit, investors bet billions on the future value of ambitious startups, and executives bet that their strategies will increase a company’s prospects. Here, predicting the future is not a theoretical superpower, it’s part of the job.


  • Are Pop Lyrics Getting More Repetitive? – In 1977, the great computer scientist Donald Knuth published a paper called The Complexity of Songs, which is basically one long joke about the repetitive lyrics of newfangled music (example quote: “the advent of modern drugs has led to demands for still less memory, and the ultimate improvement of Theorem 1 has consequently just been announced”).


  • Home advantages and wanderlust – When Burnley got beat 3-1 by Everton at Goodison Park on the 15th April, 33 games into their Premier League season, they’d gained only 4 points out of a possible 51 in their away fixtures. But during this time they’d also managed to accrue 32 points out of a possible 48 at Turf Moor; if the league table were based upon only home fixtures, they’d be in a highly impressive 6th place.




  • The Simple, Economic Value of Artificial Intelligence – How does this framing now apply to our emerging AI revolution?  After decades of promise and hype, AI seems to have finally arrived, – driven by the explosive growth of big data,  inexpensive computing power and storage, and advanced algorithms like machine learning that enable us to analyze and extract insights from all that data. 


Immortal Life: A Soon To Be True Story Kindle Edition by Stanley Bing

Neural Network Programming with Python Kindle Edition by Fabio. M. Soares, Rodrigo Nunes


If you have found above useful, please don’t forget to share with others on social media.

“MUST KNOW” Python-Pandas for Data Science

Top 10 “MUST KNOW” from Python-Pandas for Data Science.

Pandas is very popular Python library for data analysis, manipulation, and visualization, I would like to share my personal view on the list of most often used functions/snippets for data analysis.

1.Import Pandas to Python

import pandas as pd

2. Import data from CSV/Excel file

df=pd.read_csv('C:/Folder/mlhype.csv')   #imports whole csv to pd dataframe
df = pd.read_csv('C:/Folder/mlhype.csv', usecols=['abv', 'ibu'])  #imports selected columns
df = pd.read_excel('C:/Folder/mlhype.xlsx')  #imports excel file

3. Save data to CSV/Excel

df.to_csv('C:/Folder/mlhype.csv') #saves data frame to csv
df.to_excel('C:/Folder/mlhype.xlsx') #saves data frame to excel

4. Exploring data

df.head(5) #returns top 5 rows of data
df.tail(5) #returns bottom 5 rows of data
df.sample(5) #returns random 5 rows of data
df.shape #returns number of rows and columns #returns index,data types, memory information
df.describe() #returns basic statistical summary of columns

5. Basic statistical functions

df.mean() #returns mean of columns
df.corr() #returns correlation table
df.count() #returns count of non-null's in column
df.max() #returns max value in each column
df.min() #returns min value in each column
df.median() #returns median of each colun
df.std() #returns standard deviation of each column

6. Selecting subsets

df['ColumnName'] #returns column 'ColumnName'
df[['ColumnName1','ColumnName2']] #returns multiple columns from the list
df.iloc[2,:] #selection by position - whole second row
df.iloc[:,2] #selection by position - whole second column
df.loc[:10,'ColumnName'] #returns first 11 rows of column
df.ix[2,'ColumnName'] #returns second element of column

7. Data cleansing

df.columns = ['a','b','c','d','e','f','g','h'] #rename column names
df.dropna() #drops all rows that contain missing values
df.fillna(0) #replaces missing values with 0 (or any other passed value)
df.fillna(df.mean()) #replaces missing values with mean(or any other passed function)


df[df['ColumnName'] > 0.08] #returns rows with meeting criterion 
df[(df['ColumnName1']>2004) & (df['ColumnName2']==9)] #returns rows meeting multiple criteria
df.sort_values('ColumnName') #sorts by column in ascending order
df.sort_values('ColumnName',ascending=False) #sort by column in descending order

9. Data frames concatenation

pd.concat([DateFrame1, DataFrame2],axis=0) #concatenate rows vertically
pd.concat([DateFrame1, DataFrame2],axis=1) #concatenate rows horizontally

10.Adding new columns

df['NewColumn'] = 50 #creates new column with value 50 in each row
df['NewColumn3'] = df['NewColumn1']+df['NewColumn2'] #new column with value equal to sum of other columns
del df['NewColumn'] #deletes column

I hope you will find above useful, if you need more information on pandas, I recommend going to Pandas documentation or getting one of these books:

What is Hadoop YARN?

Hadoop YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored on a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. As its architectural center, YARN enhances a Hadoop compute cluster in the following ways: Multitenancy, Cluster utilization, Scalability and Compatibility. Multi-tenant data processing improves an enterprises’ return on Hadoop investments. YARNs dynamic allocation of cluster resources improves utilization over more static MapReduce rules. YARN’s resource manager focuses exclusively on scheduling and keeps pace as clusters expand to thousands of nodes. Existing MapReduce applications developed for Hadoop 1 can run YARN without any disruptions to the processes that already work.

What is Hadoop Flume?

Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or uncompress data, modify data by adding or removing pieces of information, and more. Flume allows you a number of different configurations and topologies, allowing you to choose the right setup for your application. Flume is a distributed system which runs across multiple machines. It can collect large volumes of data from many applications and systems. It includes mechanisms for load balancing and failover, and it can be extended and customized in many ways. Flume is a scalable, reliable, configurable and extensible system for management the movement of large volumes of data.

What is Apache Kafka?

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library. The design is heavily influenced by transaction logs. Apache Kafka was originally developed by LinkedIn and was subsequently open sourced in early 2011. Graduation from the Apache Incubator occurred on 23 October 2012. Due to its widespread integration into enterprise-level infrastructures, monitoring Kafka performance at scale has become an increasingly important issue. Monitoring end-to-end performance requires tracking metrics from brokers, consumer, and producers, in addition to monitoring ZooKeeper, which is used by Kafka for coordination among consumers. There are currently several monitoring platforms to track Kafka performance, both open-source, like LinkedIn’s Burrow, as well as paid, like Datadog. In addition to these platforms, collecting Kafka data can also be performed using tools commonly bundled with Java, including JConsole.

What is Hadoop Zookeeper?

Hadoop Zookeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster. ZooKeeper maintains common objects needed in large cluster environments. Examples of these objects include configuration information, hierarchical naming space, etc. Applications can leverage these services to coordinate distributed processing across large clusters. Name services, group services, synchronization services, configuration management, and more, are available in Zookeeper, which means that each of these projects can embed ZooKeeper without having to build synchronization services from scratch into each project. Interaction with ZooKeeper occurs via Java or C interfaces time. Within ZooKeeper, an application can create what is called a znode (a file that persists in memory on the ZooKeeper servers). The znode can be updated by any node in the cluster, and any node in the cluster can register to be informed of changes to that znode (in ZooKeeper parlance, a server can be set up to “watch” a specific znode). Using this znode infrastructure, applications can synchronize their tasks across the distributed cluster by updating their status in a ZooKeeper znode. This cluster-wide status centralization service is essential for management and serialization tasks across a large distributed set of servers.

Show Buttons
Hide Buttons