What The Fuck Is Data Science?

If you are a contemporary technologists, then no doubt you have heard the term ‘Data Science’ (DS). Now, DS’ close relative ‘Machine Learning’ is never far from the mix in a conversation that utters those magical words. ‘Big Data’, ‘Data Analytics’, ‘Predictive Analytics’ are synonyms while its older version ‘Data Mining’ has gone the way of Microsoft 95.

Insert cheesy informercial voice: ‘With the power of machine learning, you can take three cameras and synthesize them into a single higher resolution shot’ – while machine learning can also take shots from 5 cameras on something that is not an iPhone too e.g. a Nokia.

In other words, Data Science & Machine Learning have a little bit of self-interested marketing blitz attached to them so that they are associated with all things novel, technologically innovative etc. This makes the definition ever more difficult to pin down.

Sycophantically, Data Science is often attached to industry goals. It tells management what consumers are looking for, thought existed in a product catalog or what products they may use under certain conditions. It’s the application of statistical methods to large quantities of machine generated or curated data.

Data Science Tasks

A Data Scientist can take data from some domain and make predictions about potential behavior that corresponds to that domain using some programming techniques covered in a UC Davis Neuroscience course. Still not clear? I personally don’t think we’ll get a more sophisticated definition than ‘one part large data, one part domain expertise & two scoops of computer science with a side of statistics’.

As universities around the world rush to not only complement subject matter expertise with a stochastic method component (think UC Davis deploying Computational Linguistics every year since the late 2000’s), there are also efforts to expand the data science curriculum into more generalist, undergraduate courses. Currently, there is enough material to distill into several Data Science courses for an undergraduate degree at UNAM (National Autonomous University of Mexico) much of it sourcing from faculty research.

All of this is to say that Data Science will continue to evolve and generate interest because of its uniquely tangible results: predictions that are better than nothing in domains that IT industry cares about.

For many of the core technical points, we find that the best resource is the Stanford University course on Data Mining.

— Ricardo Lezama

ImageAI: Python Library For Recognizing Images

Ricardo Lezama — Image AI is an excellent, easy-to-use, Machine Learning wrapper that allows a python script to identify the dominant concept to describe an image. The developers are a group from a Facebook-backed outfit based in Nigeria. One of the principal developers is Moses Olafenwa, a founder of DeepQuest AI. Aside from this excellent python library, Olafenwa’s group develops AI servers for business applications.

Code Summary: ImageAI Predictions

In this summary, we will review the code examples here: https://github.com/OlafenwaMoses/ImageAI/tree/master/imageai/Prediction

Model Dependencies: ResNet

Aside from the libraries called through import statements, the more important dependencies for our test script using ImageAI’s python moduleare the different models that one can use to run a particular image against the model. In this particular example, we reference the RESNET model trained on ImageNet-1000 images. There is an annual competition in which various neural net models are compared against one another using the ImageNet libraries as a frame of reference.

ResNet is a model that uses ‘residual learning’ to create deeper learning.

According to the authors, ResNet “explicitly reformulate[s] the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.”ย 

Easy Ways To Interface

Instead of modifying the hardcoded line referencing an image, we modified the sample script to accept a simple command line argument. The script (posted originally here has been modified through the sys library. I just pass it in a command line argument.

Name the file “predicition.py”, then run the script (copy/paste) from wherever your image file is local and also resnet50_weights_tf_dim_ordering_tf_kernels.h5 is a Microsoft sponsored model developed by Kaiming He et al.

from imageai.Prediction import ImagePrediction
import sys

import os

prediction = ImagePrediction()

predictions, percentage_probabilities = prediction.predictImage(sys.argv[1], result_count=5)
for index in range(len(predictions)):
	print(predictions[index] , " : " , percentage_probabilities[index])