The World of a Data Scientist

During the last two to three years we’ve seen a proliferation of new data sources being introduced in the market place that has enabled people to generate more data in a couple of years than the entire data generated since the advent of the internet. This data explosion if you will, has introduced new challenges and opportunities for companies who are thinking of remaining competitive and relevant in this information age. Terms like Big Data have been used to describe this phenomenon happening as a result of a combination of data generated externally from sources such as social networks, sensor readings, network traffic, mobile devices, etc. and data generated internally from company systems.

However, this data in its raw form does not provide any value if it is not analyzed properly and if business does not take advantage of it. In order to successfully do that, we need the right combination of technology, people, business processes, and culture. At the centre of it sits a new field (some would argue that it is not new) and that is Data Science. It is through the adoption of a scientific process and fact-based decision making mindset that companies will start to see real tangible results in terms of increased overall business value. This scientific process is what we nowadays call Data Science. It is Data Science that helps turn big data into an advantage, value and impact.

Now you might ask what Data Science actually is. The fact is that there is no standard definition out there and that people coming from different backgrounds usually have various opinions about it. Take for instance Jean-Paul Isson who understands Data Science to be about understanding the business challenge and creating actionable insights which are then communicated back to the business. Hilary Mason and Chris Wiggins go a bit deeper into some of the technical areas involved in Data Science such as statistics, machine learning, mathematics, and domain expertise. On the other hand DJ Patil who actually coined the phrase “Data Scientist” back in 2008 at Linkedin, hints to the fact that Data Science is about creating immediate and massive impact on the business through data applications which come as a result of combining the use of data and science.

Data Science has led companies to undertake business transformation initiatives in order to adopt a data-driven culture within the organisation. At the centre of this is the role of the Data Scientist, which in 2008 according to an article in the Harvard Business Review has been called the sexiest job of the 21st century. But what exactly is a Data Scientist, what do Data Scientists do, where do they fit in within the organization, and how do they differ from say a Business Analyst?

Let’s start with the last point.

business_analyst_vs_data_scientist

As you can see from this illustration, Data Scientists when compared to Business Analysts and Data Engineers sit at the far right spectrum of the analytic process where predictive and prescriptive analytics is concerned. Data Engineers or Data Architects tend to usually work with problems around building the infrastructure (hardware and software), representing raw data in a computable format and moving data around from system to system. Business Analysts tend to focus more on reporting and summarization and interpretation in other words they focus on historical analysis of data. In contrast a Data Scientist is more concerned about generating insights by applying a scientific process to data analysis in order to unfold the present and predict the future. Let’s dive a bit deeper into the differences and common features of a Business Analyst vs Data Scientist.

ba_vs_ds

We can see that they share many common attributes such as SQL skills, data mining tool users, ability to communicate to the business. However when it comes to more advanced analytics where statistics, maths, engineering, and big data is involved we tend to incline more towards the Data Scientist role than the Business Analyst. However, it is important to understand that both these folks share the same purpose which is to turn data into insight into action. The methodology and tooling they use is different but the purpose remains the same. To further expand on the methodology that a business analyst follows compared to that of a Data Scientist. We tend to think that a business analyst is more concerned about giving answers to known questions using clean historical data allowing them to analyse past activity.  A Data Scientist’s methodology is more experimental in nature. It is more about working with messy data, searching for patterns and new insights. A Data Scientist explores the data and finds answers to questions that the business never thought to ask in the first place.

vertical_horizontal_ds

Now the world of Data Scientist is not black and white. There are differences within the Data Scientist role itself. One particular difference is based on knowledge and experience. We tend to think of two groups with regards to that: vertical and horizontal data scientists. The vertical data scientist is a specialist focused in a specific area: for instance he might be an expert in Hadoop and R, another might be an expert in Machine Learning and NoSQL, etc. The horizontal data scientist in contrast, has cross-discipline knowledge. He might be good at machine learning and statistics, programming, visualizations, domain expertise, storytelling. However, these candidates are hard to find. As a result, many companies have resorted to creating data science teams consisting of individuals with specialist skills in different areas (business, domain expertise, machine learning, databases, etc). This also enables a closer collaboration with the business and as such data science teams are usually embedded as part of other teams within the organization.

exploratory_vs_operational

On the other hand, when comparing Data Scientist on the basis of what they do in their day to day activities we can differentiate between exploratory and operational data scientists. In the first group a Data Scientist is typically found doing investigative analysis or experimentation using interactive statistical environments like R,SPSS and programming languages like Python. An Operational Data Scientist on the other hand is usually building systems which support scalable machine learning libraries, are production ready, and in line to be consumed by the business immediately. They each use different and sometimes overlapping tools and architectures.

Generally speaking we see Data Scientists use the following tools, with SQL, R, Python, Excel, and Hadoop leading the pack. Although we’re seeing an increasing amount of unstructured data being ingested and used by companies – SQL still remains strong in the Data Scientist’s world. The need to utilize existing resources and skills is what’s driving SQL’s dominance in this list. This is even more emphasized by the recent technological innovations from large companies around bringing SQL on top of Hadoop. Here’s a survey conducted by O’Reilly during the 2013 Strata Conference which shows some of the most commonly used tools by attendees of the conference working in Data and Non-Data roles.

tools

In the following example you can clearly see the tools being used by a real world data scientist and how they are categorized under experimentation and production. He uses these tools to develop algorithms, find, clean and transform data and ultimately build production systems and extract value from data.

expe_prod_tools

So where do Data Scientists fit in within an organization? We see organisations positioning data science teams in two different ways:

In some occasions a data science team is kept separate from other teams and communication is limited to meetings and planning sessions.

separate

In other occasions we’re seeing Data Science teams becoming an integral part of the development team. That effectively increases the influence of the data science team to drive product innovations and strategic positioning of the organization in the market.

integrated

A Data Scientist will typically follow a common process to generate insight. That process may differ, however we see the following phases in a typical data science project that are quite common in the industry. Acquire, Transform, Model, Learn, Develop Data Product, and Deliver Insight. Naturally you would segregate this process into two groups: The data preparation and wrangling phase and the insight generation phase.

datascience_proces

It is broadly accepted that data exploration, wrangling and modelling are the most time consuming tasks in this process. Usually, we find a Data Scientist spending almost 80% of the time engaging in typical data wrangling tasks. And this is exactly where we’re seeing many companies innovating. Effectively bringing that 80% figure down and shifting focus from data wrangling to actual insight and business value generation.

In summary, Data Scientists are increasingly becoming influencers affecting key decision making and infusing a fact-based based decision making mindset within organizations. The world of a Data Scientist is not black and white and we’ve seen previously that the role itself can be viewed from many different perspectives and that people usually approach the role and the field itself from different angles. What’s important to note is that Data Science is all about converting data into insight into action. This can be achieved through embracing a data-driven culture, investing in people and skills, and adopting new technologies and tools, preferably in that order. :)

 

 

 

 

 

 

 

 

What is it like to be a Data Scientist

In the following youtube video Josh Wills, Senior Director of Data Science at Cloudera, talks about what it is like to be a Data Scientist. The term data scientist means different things to different people. Josh Wills himself has an excellent less-then-140-characters definition of a Data Scientist which goes like this:

“a person who is better at statistics than any software engineer and better at software engineering than any statistician”

In this 33-minute video Josh Wills will take you through all the key components of what it takes to become, work, and think like a Data Scientist.

Here is a short summary of what he touches on during the video:

  • What data scientists do?
  • The evolution of data storage.
  • Extracting business value from data – Data economics.
  • Big Data economics.
  • Hadoop, Map/Reduce.
  • Thinking like a data scientist
  • Solving data intensive problems and finding insights.
  • Data abudance vs Data scarcity.
  • Creating a data science team.
  • Designing machine learning models to optimize a business problem.
  • Measuring impact.
  • and finally introduces to Data Science courses offered by Cloudera.

Cloudera: Training a New Generation of Data Scientists

Why you should become a Data Scientist?

The following youtube video by EMC outlines some of the reasons on why you should aim for a career in Data Science. Here is a rough guide on the difference between a statistician and a data scientist:

Screen Shot 2013-10-09 at 00.08.39

TL;DW (too long; didn’t watch)

  • Data Science is a blend of math, statistics, and computer science that is used to solve quantitive problems creatively.
  • Lots of companies are trying to hire people with these skill sets.
  • Demand for these roles is far outstripping the supply.
  • People with Data Science skills can apply their analytical skills to many verticals, such as economics, healthcare, and many more industries.
  • Websites such as Kaggle provide a platform where you can practice your Data Science skills on real world datasets provided by many companies from a range of domains. And they pay you to do that which is another bonus.
  • Data Scientists can also be thought to act as consultants that work with data.
  • They need to work with people to understand business problems and represent those problems in a quantitative way on top of which they can run different algorithms.
  • They need to be able work across a range of verticals, visualize data, build models, etc.

 

Here is the youtube video (6:46 minutes)

Why should i enroll for Data Science and Big Data Analytics course?

 

Why should you join the Graduate Data Science Initiative?

 

The Graduate Data Science Initiative (GDSI) is an initiative that aims to create an environment where students both undergraduates and graduates can learn about data science at an introductory, intermediate, and/or advanced level. The initiative’s main objective is to equip students with the right skills to enter the growing and expanding market of data science.

As a member of the GDSI you will be able to learn about the different tools, methods, and technologies being used within the data science community. You will be exposed to real-world use case scenarios of data science in the industry and academia. You will learn about how machine learning is used to predict stock market trends, how natural language processing is used to determine the sentiment on a particular topic, how programming languages like R, Clojure, and Python are leading the way in performing data analysis, how NoSQL databases are providing a new paradigm to store unstructured data retrieved from social networks, sensors, log files, etc., how technologies like Hadoop are revolutionising data processing and parallel computing, how recommendations systems work inside Amazon, Netflix, Spotify, how DNA sequencing is being facilitated by machine learning algorithms, how data mining is helping in better understanding human-environment interactions and social economic dynamics, how predictive analytics is helping today’s business leaders in making key decisions and transforming their businesses into huge success stories and many more.

They say “Data is the new oil”. The amount of data in the world is increasing at incredible rates. Over 90% of the world’s data was generated in the last two years alone. This overwhelming explosion of data is expected to increase in the next five to ten years as more and more data sources become available and the digital world is infused deeper in our daily lives. With the increase of data, comes the increase of difficulty in managing huge amounts of datasets which are usually unstructured in nature. This includes data from social networks, log files, sensors, etc. This unstructured data posses a new challenge in terms of storage using current database technologies that adhere to the relational model. Furthermore, there are new challenges in terms of managing huge amounts of data in a timely fashion. Computers have a limited capacity as to the amount of data they can process at any one time. Therefore, computing power must be distributed across a number computing devices, making processing of data more efficient. Moreover, once data has been stored and processed, meaningful actionable insights should be generated so that the end user will benefit in one way or another.

The whole idea of Data Science is to ultimately “turn data into insights into action”. The term Data Science is used interchangeably with the term Big Data. The whole spectrum of Data Science ‘processes’ can be highly complicated to implement and put together. However, the benefits far outweigh the cost and time spent. As such, companies from all sectors and industries have started on a hiring spree to recruit the best in the field to provide Data Science solutions to their clients. The most interesting aspect of Data Science is that it is applicable to almost every domain/industry where data is a factor. Financial services use Data Science to detect and prevent fraud, saving tens of millions of dollars. The games industry uses Data Science to estimate the value of customers coming through different marketing channels, improve game levels by analysing gamers’ behaviour, and encourage users to upgrade to paid versions. In online movie distribution networks such as Netflix, data about online viewers such as movies watched, likes, and user preferences are gathered to generate meaningful recommendations. Cyber security uses data science to analyse network logs to detect and predict network intrusions. Data Science has helped hospitals to offer better treatments to their patients, and many more.

Companies across a range of industries are investing heavily in Data Science. There is an increasing demand for data scientist. The following graph illustrates this demand over a period of five years with a notable growth happening between 2011 and 2013. There has been a 15,000% increase in demand for data scientists between the summers of 2011 and 2012 alone.

Screen Shot 2013-10-07 at 18.40.31

Fig 1. (source: indeed.com)

However, despite the huge demand, there is a huge skills shortage in this area. It is estimated that the United States alone could face a skills shortage in this area of 140,000 to 190,000 people by 2018. Part of the reason for this shortage is the lack of university courses necessary to equip students with the right skills to enter this market. The good news is that you don’t necessarily need a university degree to become a data scientists. In addition, you don’t necessarily need to have any programming experience to become a data scientist. In other words, data scientists generally come from a number of different disciplines such as biostatistics, econometrics, engineering, computer science, physics, applied mathematics, statistics, and other interrelated disciplines. Here is a rough illustrative guide on how to become a data scientist:

RoadToDataScientist

Fig 2. (source: Swami Chandrasekaran)

As you can see from the illustrative guide above Data Science has a spectrum of dimensions. Starting from data warehousing and data integration, statistics, machine learning and data mining, visualization, etc. One great way to kick-start your education and career in Data Science is to join the Graduate Data Science Initiative (GDSI). The GDSI’s main aim is to encourage and help students learn data science by learning from experts in the field who are already involved in Big Data projects and some of them even lead Data Science companies. You will have the opportunity to learn by listening to technical presentations, industry use case scenarios, participate in workshops, hackathons, and many more activities. The best part of it is that you’ll be doing all of this in a social environment where you’ll have the chance to follow up with questions and have fun in the meantime.

We at GDSI are really passionate about Data Science. We hope you are too. We can’t wait for you to join us and be part of our community.

Join us at the following link: http://www.meetup.com/Graduate-Data-Science-Initiative/