Kiriti Amaravadi
7 min readOct 13, 2023

Introduction to Data Science

DATA SCIENCE & MACHINE LEARNING SERIES

There are three ‘W’s that we need to focus on before starting to give an attempt at understanding.

- WHAT | WHY | WHERE -

Let us start with the very basic question.

WHAT is Data Science?

Data Science! Let us Google* it first. You don’t have to take the trouble, I did that and Google says, “Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.”

Now, let us take it through common sense. I hope everyone has some, lol. Data Science is nothing but Data and Science. What is Science — Observing, Analyzing and Testing of anything. Doing this on any data, is called Data Science. Data Science is like purchasing a fruit. You get a lot of them, segregate them, and clean them before processing. You divide them into structured pieces before processing and later give feedback.

Another important question.

WHY do we need Data Science?

First thing, it uses something like divide and conquer strategy to understand more about data. It visualizes the background detail of the data so that we can clearly understand about what is what is and what is not.

- Data Science is not just about numbers, rather is it foreseeing of the future. Considering the statistics i.e., past, we make use of the present to get to know about the future well.

- We get to draw several insights about anything that we are working on, using several data science techniques and methodologies. More clearly speaking, we can understand what is going on behind the scenes of any data.

- Any field in the world, can make use of Data Science to make themselves better. Patterns can be recognized amongst a lot of features; various opinions of people can be understood and in turn be used at various other domains that can be a lot helpful.

In this way, there can be many more reasons why we need data science. But we can stop with this for a while.

WHERE is it mostly used?

It is omnipresent, just like air. Where air is needed, there data science is also needed lol. Coming to the point, let us divide the question into sub parts so it becomes easier to understand.

It can be used at several places, but what areas need it the most is the important factor to be considered.

Health Care — Patients can be more accurately cured using several data science techniques. Patients’ data can be documented and as machines doesn’t make errors that humans do sometimes, it can be accurate enough. Various other factors can also be individually focused while using such techniques.

In this way, it can be used in e-Commerce, Finance, Manufacturing sector and many more.

- COMPONENTS OF DATA SCIENCE -

Data Science is such a big word that it contains various important processes. They contribute equally and more importantly they cannot be missed else the process needs to be restarted again.

Starting with data collection,

Data Collection — Anything can be done or at least can be given an attempt to be done on the data, but before that we should be having data with us. Right?

So, that data should be collected.

From where?

There are plenty of sources to collect data. One can manually collect data from people, from websites, archived databases, medical database and what not. Anything that you hope that it might help the process, can be collected either manually or implicitly using various techniques.

Data Collection is such an important task that, there are specialized websites designed only for storing data that belongs to various domains. They are Kaggle, GitHub, UCI Machine Learning Repository, Reddit Datasets etc. These are some of the most used websites, aka repositories that store data in them and are open source so that anyone can have access to them.

Data Cleaning — Data Cleaning is a process in which, we make the data look neat. This might sound silly but understand that the data that we are collecting cannot always be structured. Science says that any type of data be it structured or unstructured, needs to be analyzed and observed to give proper insights. Even if the data is structured there are plenty of things to be done on that data to make it clean. Clean here means that handling missing values, removing null values, managing the inconsistencies, removing invaluable elements, normalization, and standardization etc.

Unstructured data can also be called as Raw data. As I already told that, having data is like getting some fruit. You get it or collect it first, clean it, and then go through the further processes.

Now as we move on to the next process which is, Data analysis.

Data Analysis — This is the most important part of all the components of Data Science process. Several other domains merge, to make the analysis even more interesting.

Data needs to be analyzed, okay. Here, Statistics can be of great help. They are the only resources that we can 100% rely on. Another important domain is, Machine Learning. The data that is yet to be analyzed can be a best friend to Machine Learning. They always go hand in hand and try not to make a mistake. The extent to which they go together and doesn’t make mistake is termed as ‘Accuracy’.

Now again, if you remember our approach when we were trying to learn what Data Science was? Yes, we divided the word itself into parts and then tried to understand each of them more clearly. That is what is going to happen even now.

Google* says ‘Machine learning is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines “discover” their “own” algorithms, without needing to be explicitly told what to do by any human-developed algorithms.’

Now, let us do some reverse engineering. Machine Learning = Machine + Learning.

What is a Machine? Anything that takes an input, processes it, and gives an output with certain accuracy is called as a machine. More clearly speaking, it reduces a lot of human effort during the process.

In the olden days, any physical machine such as a motor or anything, human intervention used to be one of the most important factors in determining the output. But, now as the technology has improved itself pretty much, Algorithms are the new machines that we would be talking about rather than motors, regulators, and stuff.

An algorithm is a structural set of rules that solve a problem. So, if the problem belongs to the domain of machine learning, then it is a machine learning algorithm.

There are several types of machine learning algorithms, which I’ll be discussing in the next blog.

So, Machine Learning algorithm takes data as input processes it and achieves a necessary task. Just as we have the neatly cleaned data with us, we make use of machine learning algorithms by feeding our data to them and later getting the necessary results that we desired.

Once we get proper results, we then move on to the Data Visualization.

Data Visualization — This process is the last generic process of all the components of Data Science.

Visualizing anything is far more important than just understanding it. Suppose you have a completely messed up 40 feet wire in a bag. If you want to make it straight and clear, you cannot do that just by understanding how to do that right? You need to take it out, put it on a table and then slowly go through each part of it so that you can make it straight.

In the whole above process, important thing was putting it on the table. That made us clearly visualize what happened and how’s the wire messed up. In the same way, data after being analyzed, needs to go through such a visualization process that can help viewers understand various things from that data. Patterns can be drawn; segregations can be made and many more.

Life Cycle of Data Science

The whole process of Data Science, starting with the birth of a new problem and ending with implementing its solution in the real-world.

This can be called as a summarizing explanation of all the above-mentioned processes are.

Formulation of Problem — Anything which is necessarily done to find a solution. A solution is found to a problem only. So, for a problem to be solved, it needs to be formulated first. More clearly speaking, we need to understand what the problem is first. So that, later we can move forward with what all needs to be done to find the necessary solution to that problem.

Data Collection — As mentioned above, the data needs to be collected for sure. But here the data needs to be specific to that problem. One cannot collect irrelevant data for the problem.

Data preprocessing — Data Cleaning and all other processes that the data is gone through before making it ready for analysis, is called as preprocessing. The word “pre” means before.

Analysis and visualization — As we talked about this in the components of Data Science, the preprocessed data is analyzed to understand about the trends and also, they are more clearly visualized to get a proper picture of what is happening in the background.

Model or Algorithm building — This part links Data Science with Machine Learning where a Machine Learning Algorithm is developed to be fed with the well analyzed data to give desired results.

Evaluation — The results from the algorithm are then evaluated so that they can be of some use while implementing the solution of the above-mentioned problem in the real world.

Implementation of solution into the real world — The solution that was achieved previously from the machine learning algorithm is then put into implementation in the real world.

- NEEDS AND WANTS -

Every process that was mentioned above can be implemented using “Python” programming language. I’ve posted a blog that explains how to install python and how to use it.

If you have python installed and all the necessary libraries installed, then you are good to go.

Kiriti Amaravadi
Kiriti Amaravadi

Written by Kiriti Amaravadi

Passionate tech enthusiast and CS master's student at UNC Charlotte. Exploring AI, data science, and innovation to make a positive impact. Let's connect!

No responses yet