INTRODUCTION TO DATA SCIENCE
Data Science
What is data science?
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.[mducolleges]
Data science combines math and statistics, specialized programming, advanced analytics, artificial intelligence (AI), and machine learning with specific subject matter expertise to uncover actionable insights hidden in an organization’s data. These insights can be used to guide decision making and strategic planning.
The accelerating volume of data sources, and subsequently data, has made data science is one of the fastest growing field across every industry. As a result, it is no surprise that the role of the data scientist was dubbed the “sexiest job of the 21st century” by Harvard Business Review. Organizations are increasingly reliant on them to interpret data and provide actionable recommendations to improve business outcomes.[mducolleges]
The data science lifecycle involves various roles, tools, and processes, which enables analysts to glean actionable insights. Typically, a data science project undergoes the following stages:
- Data ingestion: The lifecycle begins with the data collection–both raw structured and unstructured data from all relevant sources using a variety of methods. These methods can include manual entry, web scraping, and real-time streaming data from systems and devices. Data sources can include structured data, such as customer data, along with unstructured data like log files, video, audio, pictures, the Internet of Things (IoT), social media, and more.
- Data storage and data processing: Since data can have different formats and structures, companies need to consider different storage systems based on the type of data that needs to be captured. Data management teams help to set standards around data storage and structure, which facilitate workflows around analytics, machine learning and deep learning models. This stage includes cleaning data, deduplicating, transforming and combining the data using ETL (extract, transform, load) jobs or other data integration technologies. This data preparation is essential for promoting data quality before loading into a data warehouse, data lake, or other repository.
- Data analysis: Here, data scientists conduct an exploratory data analysis to examine biases, patterns, ranges, and distributions of values within the data. This data analytics exploration drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s relevance for use within modeling efforts for predictive analytics, machine learning, and/or deep learning. Depending on a model’s accuracy, organizations can become reliant on these insights for business decision making, allowing them to drive more scalability.
- Communicate: Finally, insights are presented as reports and other data visualizations that make the insights—and their impact on business—easier for business analysts and other decision-makers to understand. A data science programming language such as R or Python includes components for generating visualizations; alternately, data scientists can use dedicated visualization tools.
Concepts of Data Science
Data science involves the use of scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. The goal of data science is to use data to answer questions and solve problems, and to provide actionable insights that can inform decision-making. Data science is a multidisciplinary field that combines elements
of computer science, statistics, and domain expertise to analyse and interpret data.
Some key concepts in data science include:
Data cleansing and preparation: This involves cleaning and organizing data to prepare it for analysis. This can include tasks such as identifying and correcting errors, filling in missing values, and removing duplicates.
Data exploration and visualization: This involves using visual tools to explore and understand patterns and trends in the data. Data visualization can help to identify relationships and patterns that may not be immediately apparent in the raw data.
Data modelling and machine learning: This involves building predictive models using algorithms and techniques such as linear regression, logistic regression, and decision trees.These models can be used to make predictions or classify data based on patterns and trends identified in the data.
Data communication and presentation: This involves presenting the results of data analysis in a clear and effective way, so that the insights can be understood and used by others. This caninclude creating reports, dashboards, and visualizations to communicate the findings.
What is data science used for?
Data science is used to study data in four main ways:
1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is happening in the data environment. It is characterized by data visualizations such as pie charts, bar charts, line graphs, tables, or generated narratives. For example, a flight booking service may record data like the number of tickets booked each day. Descriptive analysis will reveal booking spikes, booking slumps, and high-performing months for this service.
2. Diagnostic analysis
Diagnostic analysis is a deep-dive or detailed data examination to understand why something happened. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations. Multiple data operations and transformations may be performed on a given data set to discover unique patterns in each of these techniques.For example, the flight service might drill down on a particularly high-performing month to better understand the booking spike. This may lead to the discovery that many customers visit a particular city to attend a monthly sporting event.
3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that may occur in the future. It is characterized by techniques such as machine learning, forecasting, pattern matching, and predictive modeling. In each of these techniques, computers are trained to reverse engineer causality connections in the data.For example, the flight service team might use data science to predict flight booking patterns for the coming year at the start of each year. The computer program or algorithm may look at past data and predict booking spikes for certain destinations in May. Having anticipated their customer’s future travel requirements, the company could start targeted advertising for those cities from February.
4. Prescriptive analysis
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely to happen but also suggests an optimum response to that outcome. It can analyze the potential implications of different choices and recommend the best course of action. It uses graph analysis, simulation, complex event processing, neural networks, and recommendation engines from machine learning.
Back to the flight booking example, prescriptive analysis could look at historical marketing campaigns to maximize the advantage of the upcoming booking spike. A data scientist could project booking outcomes for different levels of marketing spend on various marketing channels. These data forecasts would give the flight booking company greater confidence in their marketing decisions.[mducolleges]
What is big data and its traits? - Data science
There are several traits that are often used to describe big data:
Volume: Big data is characterized by its large size, with datasets often being measured in petabytes or even exabytes.
Variety: Big data can come from a wide range of sources, including social media, sensors, and transactional data, and can be structured, unstructured, or semi-structured.
Velocity: Big data often needs to be processed and analysed quickly, in real-time or near realtime.
Veracity: The quality and accuracy of big data can vary, and data scientists must often deal with issues such as missing or incorrect data.
Value: The goal of working with big data is to extract value and insights that can be used to inform decision-making or solve problems.Big data can be challenging to work with due to its size and complexity, but the insights it can provide can be invaluable for organizations.
What is web scraping in data science?
Web scraping is the process of automatically extracting data from websites using software or scripts. In data science, web scraping can be used to gather large amounts of data from the internet for analysis and modelling. This data can then be used to train machine learning models, discover insights, and make predictions. Web scraping can be used to gather data from a wide variety of sources, such as online news articles, social media posts, and product reviews.
There are several ways to perform web scraping, including:
Using a web scraping tool or software: These tools are specifically designed for web scraping and can make the process much easier and more efficient. Some popular web scraping tools include Scrapy, Beautiful Soup, and Selenium.
Writing your own code: You can use programming languages such as Python or R to write your own code for web scraping. This method requires a bit more technical knowledge but gives you more control over the scraping process.
Using an API: Some websites provide APIs (Application Programming Interface) that allow you to access their data in a structured format. This is the preferred method as it is less likely
to break the website’s terms of service.Using a browser extension: Some browser extensions like Web Scraper, Scraper, and Data Miner can be used to scrape data from a website by selecting the data directly from the webpage.
In all cases, it is important to be aware of and comply with the website’s terms of service and privacy policy before scraping any data.
Analysis vs reporting in data science
Analysis and reporting are two different stages in the data science process, each with its own distinct purpose and focus.
Analysis: Data analysis is the process of examining, cleaning, transforming, and modelling data to extract insights and knowledge. This stage typically includes tasks such as data
exploration, visualization, and statistical modelling. The goal of data analysis is to discover patterns, trends, and relationships in the data, and to use this information to make predictions
or inform business decisions.
Reporting: Data reporting is the process of communicating the results of the analysis to stakeholders and decision-makers. This stage typically includes tasks such as summarizing the findings, creating visualizations, and preparing presentations or reports. The goal of data reporting is to make the insights and findings from the analysis accessible and understandable to non-technical audiences, and to use this information to drive action. In summary, analysis is focused on finding insights from data and reporting is focused on
communicating those insights to others. While both are important steps in the data science process, the skills, tools, and techniques used in each stage are quite different.
Differences between analysis and reporting in terms of purpose, approach, and outcome - Data science
Purpose: The main purpose of data analysis is to extract insights and knowledge from data, while the main purpose of reporting is to communicate those insights to a specific audience in a clear and concise manner. Approach: Data analysis typically involves a more in-depth and iterative process, including tasks such as data cleaning, feature engineering, and model selection. Reporting, on the other hand, focuses on presenting the findings of the analysis in a clear and understandable format, such as through charts, tables, and narratives. Outcome: The outcome of data analysis is typically a set of insights, findings, and recommendations that inform decision-making. The outcome of reporting is typically a document, presentation, or other deliverable that communicates those insights to a specific audience, such as management or stakeholders. Data Analysis typically aims to extract insights from the data and knowledge from it, the main goal is to understand the data, identify patterns, and make predictions. Reporting on the other hand, is about communicating the insights and information to specific audiences, it aims to be clear and concise, using charts, tables and narratives. In terms of approach, data analysis requires more exploration and experimentation, it involves tasks like data cleaning, feature engineering, model selection, and hypothesis testing. Reporting, in contrast, is more focused on presenting the findings of the analysis in an easy-tounderstand format. The outcome of data analysis is typically a set of insights, findings and recommendations that can inform decision-making. The outcome of reporting is a document, presentation or other deliverable that communicates the insights and information to specific audiences such as management or stakeholders.