What is it like to be a Data Scientist

In the following youtube video Josh Wills, Senior Director of Data Science at Cloudera, talks about what it is like to be a Data Scientist. The term data scientist means different things to different people. Josh Wills himself has an excellent less-then-140-characters definition of a Data Scientist which goes like this:

“a person who is better at statistics than any software engineer and better at software engineering than any statistician”

In this 33-minute video Josh Wills will take you through all the key components of what it takes to become, work, and think like a Data Scientist.

Here is a short summary of what he touches on during the video:

  • What data scientists do?
  • The evolution of data storage.
  • Extracting business value from data – Data economics.
  • Big Data economics.
  • Hadoop, Map/Reduce.
  • Thinking like a data scientist
  • Solving data intensive problems and finding insights.
  • Data abudance vs Data scarcity.
  • Creating a data science team.
  • Designing machine learning models to optimize a business problem.
  • Measuring impact.
  • and finally introduces to Data Science courses offered by Cloudera.

Cloudera: Training a New Generation of Data Scientists

Why you should become a Data Scientist?

The following youtube video by EMC outlines some of the reasons on why you should aim for a career in Data Science. Here is a rough guide on the difference between a statistician and a data scientist:

Screen Shot 2013-10-09 at 00.08.39

TL;DW (too long; didn’t watch)

  • Data Science is a blend of math, statistics, and computer science that is used to solve quantitive problems creatively.
  • Lots of companies are trying to hire people with these skill sets.
  • Demand for these roles is far outstripping the supply.
  • People with Data Science skills can apply their analytical skills to many verticals, such as economics, healthcare, and many more industries.
  • Websites such as Kaggle provide a platform where you can practice your Data Science skills on real world datasets provided by many companies from a range of domains. And they pay you to do that which is another bonus.
  • Data Scientists can also be thought to act as consultants that work with data.
  • They need to work with people to understand business problems and represent those problems in a quantitative way on top of which they can run different algorithms.
  • They need to be able work across a range of verticals, visualize data, build models, etc.


Here is the youtube video (6:46 minutes)

Why should i enroll for Data Science and Big Data Analytics course?



Why should you join the Graduate Data Science Initiative?


The Graduate Data Science Initiative (GDSI) is an initiative that aims to create an environment where students both undergraduates and graduates can learn about data science at an introductory, intermediate, and/or advanced level. The initiative’s main objective is to equip students with the right skills to enter the growing and expanding market of data science.

As a member of the GDSI you will be able to learn about the different tools, methods, and technologies being used within the data science community. You will be exposed to real-world use case scenarios of data science in the industry and academia. You will learn about how machine learning is used to predict stock market trends, how natural language processing is used to determine the sentiment on a particular topic, how programming languages like R, Clojure, and Python are leading the way in performing data analysis, how NoSQL databases are providing a new paradigm to store unstructured data retrieved from social networks, sensors, log files, etc., how technologies like Hadoop are revolutionising data processing and parallel computing, how recommendations systems work inside Amazon, Netflix, Spotify, how DNA sequencing is being facilitated by machine learning algorithms, how data mining is helping in better understanding human-environment interactions and social economic dynamics, how predictive analytics is helping today’s business leaders in making key decisions and transforming their businesses into huge success stories and many more.

They say “Data is the new oil”. The amount of data in the world is increasing at incredible rates. Over 90% of the world’s data was generated in the last two years alone. This overwhelming explosion of data is expected to increase in the next five to ten years as more and more data sources become available and the digital world is infused deeper in our daily lives. With the increase of data, comes the increase of difficulty in managing huge amounts of datasets which are usually unstructured in nature. This includes data from social networks, log files, sensors, etc. This unstructured data posses a new challenge in terms of storage using current database technologies that adhere to the relational model. Furthermore, there are new challenges in terms of managing huge amounts of data in a timely fashion. Computers have a limited capacity as to the amount of data they can process at any one time. Therefore, computing power must be distributed across a number computing devices, making processing of data more efficient. Moreover, once data has been stored and processed, meaningful actionable insights should be generated so that the end user will benefit in one way or another.

The whole idea of Data Science is to ultimately “turn data into insights into action”. The term Data Science is used interchangeably with the term Big Data. The whole spectrum of Data Science ‘processes’ can be highly complicated to implement and put together. However, the benefits far outweigh the cost and time spent. As such, companies from all sectors and industries have started on a hiring spree to recruit the best in the field to provide Data Science solutions to their clients. The most interesting aspect of Data Science is that it is applicable to almost every domain/industry where data is a factor. Financial services use Data Science to detect and prevent fraud, saving tens of millions of dollars. The games industry uses Data Science to estimate the value of customers coming through different marketing channels, improve game levels by analysing gamers’ behaviour, and encourage users to upgrade to paid versions. In online movie distribution networks such as Netflix, data about online viewers such as movies watched, likes, and user preferences are gathered to generate meaningful recommendations. Cyber security uses data science to analyse network logs to detect and predict network intrusions. Data Science has helped hospitals to offer better treatments to their patients, and many more.

Companies across a range of industries are investing heavily in Data Science. There is an increasing demand for data scientist. The following graph illustrates this demand over a period of five years with a notable growth happening between 2011 and 2013. There has been a 15,000% increase in demand for data scientists between the summers of 2011 and 2012 alone.

Screen Shot 2013-10-07 at 18.40.31

Fig 1. (source: indeed.com)

However, despite the huge demand, there is a huge skills shortage in this area. It is estimated that the United States alone could face a skills shortage in this area of 140,000 to 190,000 people by 2018. Part of the reason for this shortage is the lack of university courses necessary to equip students with the right skills to enter this market. The good news is that you don’t necessarily need a university degree to become a data scientists. In addition, you don’t necessarily need to have any programming experience to become a data scientist. In other words, data scientists generally come from a number of different disciplines such as biostatistics, econometrics, engineering, computer science, physics, applied mathematics, statistics, and other interrelated disciplines. Here is a rough illustrative guide on how to become a data scientist:


Fig 2. (source: Swami Chandrasekaran)

As you can see from the illustrative guide above Data Science has a spectrum of dimensions. Starting from data warehousing and data integration, statistics, machine learning and data mining, visualization, etc. One great way to kick-start your education and career in Data Science is to join the Graduate Data Science Initiative (GDSI). The GDSI’s main aim is to encourage and help students learn data science by learning from experts in the field who are already involved in Big Data projects and some of them even lead Data Science companies. You will have the opportunity to learn by listening to technical presentations, industry use case scenarios, participate in workshops, hackathons, and many more activities. The best part of it is that you’ll be doing all of this in a social environment where you’ll have the chance to follow up with questions and have fun in the meantime.

We at GDSI are really passionate about Data Science. We hope you are too. We can’t wait for you to join us and be part of our community.

Join us at the following link: http://www.meetup.com/Graduate-Data-Science-Initiative/