About Jack Zheng
Faculty of IT at Kennesaw.edu
Data
http www
Data Engineer
Data science
process
structure
About Jack Zheng
Faculty of IT at Kennesaw.edu
Data
http www
Data Engineer
Data science
process
structure
The World of Data A general introduction to data and information processing Jack G. Zheng Fall 2022 http://zheng.kennesaw.edu/teaching/it3703 http://zheng.kennesaw.edu/teaching/it7123 IT 3703 Intro to Analytics and Technology IT 7123 Business Intelligence Overview • Fundamental data related concepts and terms – What is data, what is the data as we concern in business processing – Data, information, knowledge • The data and data technology in today’s world – Data sources – Value and uses – Data characteristics – Data challenges – Data technology and capabilities • Data knowledge areas – Data management, data engineering, data analytics, and data science – DAMA data knowledge areas • Practical job roles and career paths – Focus on data management, data engineering, data analytics, business intelligence, and data science, information/knowledge management – Data related jobs and careers, corresponding to knowledge areas 2 This lecture focus on fundamental, high-level concepts and philosophical view of data (instead of technical details), and a general view of the industry and technology. It sets a context for our main topic on data analytics and technology. Basic Data Terms and Concepts • DIKW • Data type • Data format • Data model • Data structure • Data processing 3 DIKW • The DIKW hierarchy depicts relationships between data, information, knowledge (and wisdom). – Data: raw value elements or facts – Information: the result of collecting and organizing data that provides context and meaning – Knowledge: the concept of understanding information that provides insight to information, thus useful and actionable – Wisdom: the understanding of interactions and an integrated view, and the understanding of implications and indirect results beyond a target domain. 4 Extended readings on DIKW • https://en.wikipedia.org/wiki/DIKW_pyramid • https://www.i-scoop.eu/big-data-action-value- context/dikw-model/ • https://towardsdatascience.com/rootstrap- dikw-model-32cef9ae6dfb • https://www.youtube.com/watch?v=K4i2FK52 698 • https://www.youtube.com/watch?v=jSWC23m HXJM Image from https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/ Core readings on DIKW • https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/ • DIKW Pyramid https://www.youtube.com/watch?v=u9DoQ9gY4z4 • DIKW Pyramid with Sample Data https://www.youtube.com/watch?v=MFUyQsJyKgg Data Related Terms • Raw data – Directly generated from an event, preserved in its original format • Processed data – Transformed or cleaned using any techniques for the purpose of storage, analysis, or presentation • Facts – Data represents reality or things that actually happened • Simulated/forecast/estimated data – Data generated based on mathematical or statistical models • Measures/metrics – A piece of data that is used for measuring an activity or a phenomenon – Calculated results from underlying factors/data; a kind of processed data 5 Data Type • Data type is an attribute of data that tells a computer system how to interpret and process its value, and how a human can use its value. • Data types may include – Simple (primitive) numeric types: int, decimal, etc. – Qualitative (textual) and composite types: string, text, etc. – Extended (app specific) data types: date/time, Boolean, money, geo, etc. – Abstract data types: object, array, etc. – It even can include more digital multimedia forms like sound, image, and video. • The exact definition and application of data types depend on the system. 6 Extended reading: https://en.wikipedia.org/wiki/Data_type Data Model • Data model is about representation of data, with a set of business concepts and rules. – A data model conceptualizes data elements and standardizes how the data elements relate to one another • Extended reading: https://en.wikipedia.org/wiki/Data_model – Answers questions like: How is data grouped and associated? What is this data about? How are they related? • Data models depict and enable an organization to understand its data assets through core building blocks such as entities, relationships, and attributes. These represent the core concepts of the business such as customer, product, employee, and more. • Typical examples: flat model, relational model, network model, dimensional model 7 Data model will be covered with more details in IT 3703 module 3 and IT 7123 module 3. Example: relational model Data Format or File Format • Data format defines the way how data and information is structured and recorded in a computer file, particularly flat file. • Flat files are machine readable, meaning data in flat files is formatted in a way that it can be automatically read and processed by a computer program. Machine-readable data must have some structures, even if they are implicit and may be defined outside the file. – http://opendatahandbook.org/glossary/en/terms/machine-readable/ • Three major types of data formats are used in flat files in today’s analytics – Comma Separated Values (CSV) https://en.wikipedia.org/wiki/Comma- separated_values – JavaScript Object Notation (JSON) – eXtensible Markup Language (XML) • They are constantly used in data download/export, transfer, and storage. – Example: https://schoolgrades.georgia.gov/dataset 8 Data format will be covered with more details in IT 3703 module 5 and IT 7123 module 5. Data Structure • In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data. – https://en.wikipedia.org/wiki/Data_structure • A data structure is a more technical and lower-level term. It emphasizes a structure or a model that targets computer system (vs. human) for optimal processing. • For example, an array is a data structure, as well as dictionaries. Also, the classes that form your data model, are data structures too, any representation of a specific data object has to be in form of a data structure. – https://stackoverflow.com/questions/24228038/difference-between- datastructure-and-datamodel-with-example 9 Structured, Semi-structured, and Unstructured • Structured data – Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in a table with rows and columns. They have relational keys and can easily be mapped into pre- designed fields. Today, those data are most processed in the development and simplest way to manage information. Example: Relational data. • Semi-Structured data – Semi-structured data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. With some process, you can store them in the relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exist to ease space. Example: XML data. • Unstructured data – Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model, thus it is not a good fit for a mainstream relational database. So, for Unstructured data, there are alternative platforms for storing and managing, it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs. • Key readings – https://www.geeksforgeeks.org/difference-between-structured-semi-structured-and-unstructured-data/ – https://www.g2.com/articles/structured-vs-unstructured-data 10 Data Processing • Data processing is, generally, "the collection and manipulation of items of data to produce meaningful information.” (particularly using the computing technologies and systems). – https://en.wikipedia.org/wiki/Data_processing • Data processing can be considered as two broad categories – Transactional processing – Analytical processing 11 Types of Data/Information Processing Transactional Processing • Focus on data item processing (storage, insertion, modification, deletion), transmission, and even some non-analytical query Analytical Processing • Focus on queries, calculation, reporting, analysis, and decision support 12 For a more detailed comparison of OLTP and OLAP: https://techdifferences.com/difference-between-oltp-and-olap.html https://www.ibm.com/cloud/blog/olap-vs-oltp • Change product price. • Increase customer credit limit. • Import data from another source • What are the top 10 most profitable products? • Is there a significant increase of operational cost? Notice the difference between these terms as general concepts vs. as particular technologies/systems. DIKW and Data Processing • The DIKW model can be loosely related to the levels of transactional processing and analytical processing 13 Transactional Processing Analytical Processing For more extensive reading: http://en.wikipedia.org/wiki/DIKW_Pyramid Different opinion: https://hbr.org/2010/02/data-is-to-info-as-info-is-not Data in Today’s World • Data is the new oil (of the digital economy) – https://towardsdatascience.com/is-data-really-the-new-oil-in-the- 21st-century-17d014811b88 – Considered as one of important resources, just like financial resource and human resource. • Big data – Four V’s: Volume (Scale), Variety, Velocity (Speed), Veracity (Uncertainty) • Data technology and capabilities advancement – Computing power and storage capacity increases – New processing techniques such as parallel and distributed processing, AI, etc. • Evolving user needs – Analytical needs – Communication needs 14 Big Data • Big Data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. • Basic 4Vs (Gartner) • “Big Data is not a system; it is simply a way to say that you have a lot of data. – https://www.linkedin.com/pulse/big-data-silver-bullet-tomas-kratky 15 Volume (Scale) Data volume is increasing exponentially, not linearly Even large amounts of small data can result into Big Data . Variety (Complexity) Various formats, types, and structures. Big data covers non- structure and various data formats including text, blob, multimedia, etc. Information and knowledge management is the management of both structured data (15% of information) and unstructured data (85% of information), according to the Butler Group. 80 percent of business is conducted on unstructured information (Gartner Group). Velocity (Speed) Data is being generated fast and needs to be processed fast. Veracity (Uncertainty) Uncertainty due to inconsistency, incompleteness, latency, ambiguities, or approximations. 16 Figures from https://www.ksi.mff.cuni.cz/~svoboda/courses/192-MDK/lectures/MDK-Lecture-01-Introduction.pdf Developments in Data Technology Capabilities • Computing power and capacity (hardware) – Computing power for data processing increased: https://towardsdatascience.com/the-future-of- computation-for-machine-learning-and-data-science-fad7062bc27d – Data storage capacity increased: https://www.frontierinternet.com/gateway/data-storage-timeline/ – Data generation method proliferate; automated data collection devices and sensors, such as IoT devices. • Algorithm and techniques advancement such as parallel processing, AI, machine learning. • Self-service tools expand the user base. • Cloud-based systems lower the cost of ownership. 17 https://cacm.acm.org/magazines/2011/8/114953-an- overview-of-business-intelligence-technology/fulltext Evolving User Needs • Data democratization – Consumption of data reaches to the general public – https://www.techtarget.com/whatis/definition/data- democratization • Prevalence and wide expectation of data visualizations • Evolving analytical needs – Need real-time and most recent data – Business user driven, agile, instant – Exploratory and interactive 18 Common Data Use Challenges • Information overloading – too much data and information with varied formats and structure – difficulty of data organization for effective access and retrieval – difficult to find useful information (knowledge) from them – multiple copies of data exists sometimes with conflicts • Data everywhere – Data in separate systems and different sources; internal and external – Problem of spreadmart http://en.wikipedia.org/wiki/Spreadmart – Over 43 percent of organizations have more than six content stores. (Forrester Research). • Difficulty of access – We may have that data, but we cannot access it (or difficult to get it), because of technical issues or administrative issues. • Don’t have that data – The data is simply not available. – The collection of data may need additional process and is costly. • The organizational data problem: https://www.youtube.com/watch?v=y5- 3Pjbk8Zk 19 Data Knowledge Areas • The Data Management Association (DAMA) is a non-profit and vendor- independent association of business and technical professionals that is dedicated to the advancement of data resource management (DRM) and information resource management (IRM). https://www.dama.org • DAMA publishes a guidebook called "The DAMA Guide to the Data Management Body of Knowledge" (DAMA-DMBOK). It defines 11 data management knowledge areas. – OVERVIEW OF DMBOK https://www.dama- dk.org/onewebmedia/DAMA%20DMBOK 2_PDF.pdf 20 11 Knowledge Areas by DAMA 1. Data Governance – planning, oversight, and control over management of data and the use of data and data-related resources. While we understand that governance covers ‘processes’, not ‘things’, the common term for Data Management Governance is Data Governance, and so we will use this term. 2. Data Architecture – the overall structure of data and data-related resources as an integral part of the enterprise architecture 3. Data Modeling & Design – analysis, design, building, testing, and maintenance (was Data Development in the DAMA-DMBOK 1st edition) 4. Data Storage & Operations – structured physical data assets storage deployment and management (was Data Operations in the DAMA- DMBOK 1st edition) 5. Data Security – ensuring privacy, confidentiality and appropriate access 6. Data Integration & Interoperability –acquisition, extraction, transformation, movement, delivery, replication, federation, virtualization and operational support ( a Knowledge Area new in DMBOK2) 7. Documents & Content – storing, protecting, indexing, and enabling access to data found in unstructured sources (electronic files and physical records), and making this data available for integration and interoperability with structured (database) data. 8. Reference & Master Data – Managing shared data to reduce redundancy and ensure better data quality through standardized definition and use of data values. 9. Data Warehousing & Business Intelligence – managing analytical data processing and enabling access to decision support data for reporting and analysis 10. Metadata – collecting, categorizing, maintaining, integrating, controlling, managing, and delivering metadata 11. Data Quality – defining, monitoring, maintaining data integrity, and improving data quality 21 Four Categories of Job Roles • Management role – Focusing on the management of data and information assets, setting policies and processes; usually less technical • Administration role – Technical administration of data systems, including data storage (database, data warehouse, etc.), BI systems, reporting system, and other analytics and application systems – Maintain and monitor the security, integrity, reliability, and performance of systems – Usually are system specific • Development role – Data engineering • Designing data models, systems, architectures • Build data pipelines and move data – Application development • Build reports, dashboards, and other applications that help analysts, managers, and customers. • Consume data APIs and services • Analysis role – Focusing on analysis and reporting; building analytical models and presenting results – Involving various degree of math, statistics, and computing algorithm 22 Three Careers of Focus Data Analyst Data Engineer/Developer* Data Scientist Data Query Data Warehousing & ETL Statistical & Analytical skills Business Domain Knowledge Business intelligence, reporting Data Mining Programming knowledge Data Analytics Machine Learning & Deep learning principles Scripting & Statistical skills In-depth knowledge of SQL/ database In-depth programming knowledge (SAS/R/ Python coding) Reporting & data visualization Data architecture & pipelining Hadoop-based analytics SQL/ database knowledge Machine learning concept knowledge Data optimization Spread-Sheet knowledge Scripting, reporting & data visualization Decision making and soft skills 23 Table adapted based on https://www.edureka.co/blog/data-analyst-vs-data-engineer-vs-data-scientist/ BSIT/MSIT focus Data Engineer • The role of the data engineer is mostly to ensure the quality and availability of the data. This include the following most important tasks – Build and maintain data pipeline systems – Clean and wrangle data into a usable state – Design/build data storage systems, architectures, and infrastructures • Data engineer skills – Programming – Knowledge of tools and systems – Data model, structure, format, architecture – Relational and non-relational database design – Data storage system design – Data/information flow – SQL, query execution and optimization 24 Reference reading: https://www.oreilly.com/content/data-engineering-a-quick-and-simple-definition/ Extended reading: https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data- warehouse-and-data-engineer-role/ Data Science • Data Science is multidisciplinary – Computer Scientists – Information Technologist – Statisticians/Mathematicians – Domain Experts • Data in Data Science – Pretty much similar to “data” in data analytics • Science in Data Science – Implying scientific methods – More exploratory 25 Another view of jobs and careers • https://search datamanage ment.techtar get.com/feat ure/Data- management -roles-Data- architect-vs- data- engineer- others 26 Data developer Also include BI analyst, or BI developer. Data Education at KSU • BSIT - the new concentration on “data analytics and technology” – https://www.edocr.com/v/0jmn189y/jgzheng/ksu-bsit-data- concentration-overview • MSIT/BSIT - Graduate Certificate in Data Analytics and Intelligent Technology – https://www.edocr.com/v/z1dwxbpy/jgzheng/ksu-msit-data- certificate – https://msit.kennesaw.edu/future-students/program- requirements.php • Other departments – Ph.D. in Analytics and Data Science https://datascience.kennesaw.edu – ACS 8310 Data Warehousing – IS 8935 Business Intelligence - Traditional and Big Data Analytics – Certificate in High Performance Cluster Computing http://ccse.kennesaw.edu/cs/programs/cert-hpcc.php • For more information – http://zheng.kennesaw.edu/advising – Lecture notes on BI and Data Visualization https://www.edocr.com/user/jgzheng 27 Industry Certifications • Certiport - Marketing Resource Library (filecamp.com) 28 Core Readings • DIKW pyramid: https://www.ontotext.com/knowledgehub/fundamentals/dikw-pyramid/ • Short video lectures on DIKW – DIKW Pyramid https://www.youtube.com/watch?v=u9DoQ9gY4z4 – DIKW Pyramid with Sample Data https://www.youtube.com/watch?v=MFUyQsJyKgg • Overview of DMBOK V2 – 11 knowledge areas: https://www.dama- dk.org/onewebmedia/DAMA%20DMBOK2_PDF.pdf • Data engineering: https://www.oreilly.com/content/data-engineering-a- quick-and-simple-definition/ • Data Analyst vs Data Engineer vs Data Scientist: Skills, Responsibilities, Salary https://www.edureka.co/blog/data-analyst-vs- data-engineer-vs-data-scientist/ - from some job and career perspectives. 29 Additional Good Resource • https://techcrunch.com/2021/05/02/data-was-the-new-oil-until-the-oil-caught-fire/ • Data governance https://www.youtube.com/watch?v=sHPY8zIhy60&list=RDCMUCrR22MmDd5- cKP2jTVKpBcQ&index=13 • Data science – Why Data Science Matters and How It Powers Business Value https://www.simplilearn.com/why-and-how-data-science- matters-to-business-article – Programming Languages for Data Scientists https://towardsdatascience.com/programming-languages-for-data-scientists- afde2eaf5cc5 – Data Science Tutorial for Beginners https://www.guru99.com/data-science-tutorial.html – https://www.dataversity.net/ten-myths-about-data-science/ – https://www.dataversity.net/data-science-trends-in-2020/ – https://www.mastersindatascience.org/careers/data-scientist/ – https://towardsdatascience.com/learn-the-art-of-data-science-programming-languages-of-the-decade-a2850830ab76 – https://www.simplilearn.com/tutorials/data-science-tutorial/what-is-data-science – https://www.simplilearn.com/data-analyst-vs-data-scientist-article • https://ischoolonline.berkeley.edu/data-science/what-is-data-science/ • More about jobs and careers – https://www.discoverdatascience.org/career-information/ – Data engineer: https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-pipeline-data- warehouse-and-data-engineer-role/ – https://searchdatamanagement.techtarget.com/feature/Data-management-roles-Data-architect-vs-data-engineer-others – https://dzone.com/articles/five-data-tasks-that-keep-data-engineers-awake-at – Data analyst: https://www.investopedia.com/articles/professionals/121515/data-analyst-career-path-qualifications.asp – https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html 30