Chapter 6 Big Data and Analytics – Flashcards

Unlock all answers in this set

Unlock answers
question
Define Big Data and where does Big Data come from?
answer
Data that exceeds the reach of commonly used hardware environments and/or capabilities of software tools to capture, manage, and process it within a tolerable time span. Big Data can come from any and everywhere Web logs, RFID, GPS systems, sensor networks, social networks, Internet-based text documents, Internet search indexes, detail call records, astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, medical records, scientific research, military surveillance, multimedia archives...
question
What are the three "V"S of Big Data
answer
Volume - Refers the amount data being produced Variety - Data today comes in all types of formats Velocity - how fast data is being produced and how fast the data must be processed. We see some of the leading Big Data solution providers adding other Vs, such as veracity(IBM), Variability (SAS), and Value proposition.
question
What does "big data" mean to Luxottica?
answer
Big data is data captured from their websites and retail chain stores in the form of transactions, click streams, product reviews, and social media postings.
question
What were their main challenges?
answer
The data captured from interactions constitutes a massive source of business intelligence for potential product, marketing and sales opportunities. However, Luxottica outsourced both data storage and promotional campaign development and management. The outsource model hampered access to current, actionable data, limiting its marketing value.
question
What were the proposed solution and the obtained results?
answer
Luxottica implemented an advanced big data analytics solution: Customer intelligence appliance (CIA) from IBM Business Partner Aginity LLC. This tool allowed them to integrate all data from its multiple internal and external application sources and gain visibility into its customers. The benefits are as follows: *Anticipates a 10 percent improvement in marketing effectiveness *Identifies the highest-value customers out of nearly 100 million *Targets individual customers based on unique preferences and histories
question
Explain the value proposition of Big Data and the potential challenges you might encounter
answer
Big Data by itself, regardless of the size, type, or speed, is worthless unless business users do something with it that delivers value to their organizations. Big Data + Big Analytics = Value. With the value proposition, Big Data also brought about big challenges: *Effectively and efficiently capturing, storing, and analyzing Big Data *New breed of technologies needed (developed (or purchased or hired or outsourced ...) If any of the following are true you might to consider embarking on a Big Data journey. *You can't process the amount of data that you want to because of the limitations of your current platform. *You can't include new/contemporary data sources (e.g., social media, RFID, Sensory, Web, GPS, textual data) because it does not comply with the data storage schema. *You need to (or want to) integrate data as quickly as possible to be current on your analysis. *You want to work with a schema-on-demand data storage paradigm because the variety of data types involved. *The data is arriving so fast at your organization's doorstep that your traditional analytics platform cannot handle it.
question
What are the most critical success factors for Big Data analytics?
answer
*A clear business need (alignment with the vision and the strategy) *Strong, committed sponsorship (executive champion) *Alignment between the business and IT strategy *A fact-based decision-making culture *A strong data infrastructure *The right analytics tools *Right people with right skills
question
What is high-performance computing?
answer
In order to keep up with the computational needs of Big Data, a number of new and innovative computational techniques and platforms have been developed. These techniques are collectively called high performance computing, which includes the following: *In-memory analytics Storing and processing the complete data set in RAM *In-database analytics Placing analytic procedures close to where data is stored *Grid computing & MPP Use of many machines and processors in parallel (MPP-massively parallel processing) *Appliances Combining hardware, software and storage in a single unit for performance and scalability
question
What are some of the most common challenges that have a significant impact on successful implementation of Big Data analytics?
answer
*Data volume The ability to capture, store, and process the huge volume of data in a timely manner *Data integration The ability to combine data quickly and at reasonable cost *Processing capabilities The ability to process the data quickly, as it is captured (i.e., stream analytics) *Data governance (... security, privacy, access) *Skill availability (... data scientist) *Solution cost (ROI)
question
What are some business problems addressed by big data analytics?
answer
*Process efficiency and cost reduction *Brand management *Revenue maximization, cross-selling/up-selling *Enhanced customer experience *Churn identification, customer recruiting *Improved customer service *Identifying new products and market opportunities *Risk management *Regulatory compliance *Enhanced security capabilities
question
How can Big Data benefit large-scale trading banks?
answer
Big data analytics helped this bank respond to growing business needs and requirements. With its significant derivatives exposure the bank's management recognized the importance of having a real-time gloabl view of its positions. Prior to adopting MarkLogic it was difficult to identify financial exposure across many systems (Seperate copies of derivatives trade store) and After it was possible to analyze all contracts in a single database (MarkLogic Server eliminates the need for 20 database copies).
question
How did MarkLogic infrastructure help ease the leveraging of Big Data? (Top 5 Investment Bank)
answer
The Bank built a derivatives trade stored based on the MarkLogic Server, replacing the incumbent technologies. Replacing the 20 disparate batch-processing servers with a single operational trade store enabled the Bank to know its market and credit counter-party positions in real time, providing the ability to act quickly to mitigate risk. The accuracy and completeness of data allowed the bank and its regulators to confidently rely on the metrics and stress test results in reports.
question
What were the challenges, the proposed solution, and obtained results (Top 5 Investment Banks)?
answer
Challenge * The existing system, based on a relational database, was comprised of multiple installations around the world. Due to the gradual expansions to accommodate the increasing data volume varieties, the legacy system was not fast enough to respond to growing business needs and requirements. Solution *The bank build a derivative trade store based on the MarkLogic server. Replacing the 20 disparate batch-processing servers with a single operational trade store. Results *Trade data is now aggregated accurately across the Bank's entire derivatives portfolio, allowing risk management stakeholders to know the true enterprise risk profile, to conduct predictive analyses using accurate data, and to adopt a forward-looking approach.
question
List some of the Big Data technologies out there
answer
*MapReduce... *Hadoop ... *Hive *Pig *Hbase *Flume *Oozie *Ambari *Avro *Mahout, *Sqoop *Hcatalog
question
What is MapReduce?
answer
MapReduce is a technique popularized by Google that distributes the processing of very large multi-structured data files across a large cluster of machines. Goal -achieving high performance with "simple" computers. Good at processing and analyzing large volumes of multi- structured data in a timely manner. Examples: indexing the Web for search, graph analysis, text analysis, machine learning, ...
question
What is Hadoop?
answer
Hadoop is an open source framework for processing, storing, and analyzing massive amounts of distributed, unstructured data. Originally created by Doug Cutting at Yahoo! Hadoop was inspired by MapReduce. Hadoop clusters run on inexpensive commodity hardware so projects can scale-out inexpensively *Hadoop is now part of Apache Software Foundation *Open source -hundreds of contributors continuously improve the core technology *MapReduce+ Hadoop = Big Data core technology
question
How does Hadoop work?
answer
A client acesses unstructured and semi structured data from sources including log files, social media feeds, and internal data stores. Break the data up into "parts," which are then loaded into a file system made up of multiple nodes running on commodity hardware using HDFS. Each "part" is replicated multiple times and loaded into the file system for replication and failsafe processing. A node acts as the Facilitator and another as Job Tracker. Jobs are distributed to the clients, and once completed the results are collected and aggregated using MapReduce.
question
A Hadoop "Stack" is made up of a number of technical components, which include what?
answer
*Hadoop Distributed File System (HDFS) *Name Node (primary facilitator) *Secondary Node (backup to Name Node) *Job Tracker *Slave Nodes (the grunts of any Hadoop cluster) *Additionally, Hadoop ecosystem is made- up of a number of complementary sub-projects: NoSQL(Cassandra, Hbase), DW (Hive), ... *NoSQL= not only SQL
question
Although Hadoop and related technologies have been around for more than 5 years now, most people still have several misconceptions about Hadoop and related technologies. What are some facts that clarify what Hadoop is and does relative to BI?
answer
*Hadoop consists of multiple products *Hadoop is open source but available from vendors, too *Hadoop is an ecosystem, not a single product *HDFS is a file system, not a DBMS *Hive resembles SQL but is not standard SQL *Hadoop and MapReduce are related but not the same *MapReduce provides control for analytics, not analytics *Hadoop is about data diversity, not just data volume. *Hadoop complements a DW; it's rarely a replacement. *Hadoop enables many types of analytics, not just Web analytics
question
What are some of the skills that define a data scientist?
answer
1.Domain Expertise, problem definition, and decision making. 2. Data Access and Management [both traditional and new data systems] 3. Programming scripting, and hacking 4. Internet and social Media/ Social Networking technologies 5. Curiosity and Creativity 6. Communication and interpersonal
question
What is the role of analytics and Big Data in modern day politics?
answer
Analytics is used to process Census data, data from Election databases, Market research, social media data. Through Data mining, web mining text mining, and Multi media mining they can predict outcomes and trends, Identify association between events and outcomes, Assessing and measuring sentiments and profiling groups with similar behavioral patterns.
question
Do you think Big Data analytics could change the outcome of an election?
answer
yes, candidates can use big data to inform their decisions on to run their campaigns. Information gleaned from big data analytics can lead a candidate to make decisions about a number of things like who to target, for what reason, with what message, on a continuous basis. In the 2012 election obama had a data advantage and the depth and breadth of the campaign's digital operation, from political and demographic data mining to voter sentiment and behavioral analysis, reached beyond anything politcs have ever seen.
question
Will Hadoop replace data warehousing/RDBMS?
answer
No such a claim has no basis in reality. The following are some use cases for Hadoop and RDBMS: Use Cases for Hadoop *Hadoop as the repository and refinery *Hadoop as the active archive Use Cases for Data Warehousing *Data warehouse performance *Integrating data that provides business value *Interactive BI tools
question
Name some situations when you where you would use A data warehouse over Hadoop and vise versa.
answer
Data Warehouse *Low Latency, interactive reports, and OLAP *Analysis of provisional data *CPU intense analysis *Online archives alternative to tape Hadoop *Preprocessing or exploraion of raw unstructured data *Discover unknown relationships in the data *Many flexible programming languages running in parallel *System, users, and data governance *unrestricted, ungoverned sandbox explorations Both *ANSI 2003 SQL Compliance is required *High-quality cleansed and consistent data *100s to 1,000s of concurrent users *parallel complex process logic *Extensive security and regulatory compliance
question
How can Hadoop and DW coexist?
answer
1.Use Hadoop for storing and archiving multi-structured data 2.Use Hadoop for filtering, transforming, and/or consolidating multi-structured data 3.Use Hadoop to analyze large volumes of multi- structured data and publish the analytical results 4.Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform 5.Use a front-end query tool to access and analyze data
question
What is Stream analytics?
answer
Stream analytics is a term commonly used for the analytic process of extracting actionable informaton from continuously flowing/streaming data. A stream can be defined as a continuous sequence of data elements.
question
what is critical events processing?
answer
Critical event processing is a method of capturing, tracking and analzing streams of data to detect events of certain types that are worthy of effort.
question
Compare and contrast Streaming and perpetual analytics
answer
Streaming analytics involves applying tansaction-level logic to real-time observations. Th rules applied to the observations take into account previous observations as long as the occured in the prescribed window; these windows has some arbitrary size. Perpetual analytics , on the other hand, evaluates every incoming observation against all prior observations, where there is no window size.
question
What is big about the data mentioned in the cases and what are the Vs that you notice? (Application 6.5) Dublin City Council
answer
*Large data sets from different sources *1.2 million citizens *1000 buses transmits every 20 seconds *Real-time, always changing *Structured & Unstructured Data The V's Volume: 1000 buses, data constantly updated Variety: Cameras across Dublin, different sensors, rain gauges, closed circuit television, under-road sensors (pulling data from many different sources) Variability: Weather statistics, rush-hour, accidents (many different variables) Value: 66 Million Euros
question
What big data technologies are used in the cases?(Application 6.5) Dublin City Council
answer
*Central Geographic Information System *InfoSphere Streams & mapping software *Digital maps *Real-time maps of city buses *GPS Tracking
question
What challenges are faced in the cases? (Application 6.5) Dublin City Council
answer
*Inefficient tracking *Could only track one bus route at time *Did not see big picture *Only tracking bus routes - not tracking private traffic data, meteorological data *Vast amount of data to analyze
question
What solutions are proposed in the cases? What other solutions can you think of? (Application 6.5) Dublin City Council
answer
*Research partnership with IBM *Integrated geospatial data *InfoSphere Streams and mapping software >>Broader view of all bus routes in real time *Ultimately helping to visualize data through dashboards *Allowed analysis of transport data
question
What results are observed in the cases? (Application 6.5) Dublin City Council
answer
*Better monitor and generate reports *Predict and prevent traffic congestion *Adding additional bus lanes and bus-only traffic signals *Road-infrastructure improvements *Begin focusing on future data sources >>Under-road sensors, meteorological data and the city's free bicycles docks
question
What is big about the data mentioned in the cases and what are the Vs that you notice? (Application 6.3) Ebay
answer
Big Data *Over 112 million active users *400+ Million items listed for sale *Process 50TB of data per day accessed by *7,000 analysts *14PB Data warehouse The V's *High volume due to many transactions (40 billion pieces of data) *High velocity due to transaction processing speed *High variety due to differences in product listings *High value as consumers are selling direct to one another
question
What big data technologies are used in the cases?(Application 6.3) Ebay
answer
Technologies *NoSQL Technologies >>Apache Cassandra >>DataStax Enterprise >>Hadoop Analytics *Scalable architecture allowing for deployment of clusters across multiple data centers *Able to utilize commodity hardware, reducing cost of processing
question
What challenges are faced in the cases? (Application 6.3) Ebay
answer
Challenges *Billions of reads and writes daily, increasing demand to process data at high speeds *Need to balance application availability and uptime *Need for a solution without typical bottlenecks, scalability issues, and transactional constraints *Need for rapid analysis on demand of structured and unstructured data
question
What were the Solutions proposed in this case? (Application 6.3) Ebay
answer
*Integrated real-time data and analytics using NoSQL solutions/technologies previously mentioned *Utilize DataStax Enterprises for many different use cases >>Read/Write traffic >>Quantifying social data >>Load balancing and application availability >>Hunch "taste graph" >>Mobile notification logging and tracking *Common uptime requirement *Utilize AWS(S3) *Look at other 3rd party hosted database solutions *MongoDB for unstructured data *Integrate S3 database with Amazon Quick Sight, an online analytics tool that utilizes in memory processing
question
What results are observed in the cases? (Application 6.3) Ebay
answer
*eBay was able to successfully implement the NoSQL solutions *The solutions provide a stable, scalable, secure, and cost effective integrated real-time data processing system *Allows eBay to efficiently process transactions, as well as store and analyze large amounts of data quickly
question
Whats big about the data in this case and the different V's used? (Discovery Health).
answer
Big Data *Volume of Data - 2.7 million members >>Generate a million rows of new claims data each day >>Analyze three years of historical data The V's *Volume >> 2.7 million members >> Three years historical data 1.095 billion rows of data being processed *Velocity >> Analyze three years' worth of data in >>matter of minutes *Value >>Constantly developing new analytical applications >> Predictive modeling of members' medical needs, fraud detection
question
What were the Big Data technologies used in this case? (Discovery Health)
answer
*IBM SPSS >>Forms core of data-mining and predictive analysis capability *IBM PureData - Powered by Netezza >>Transforms performance of the predictive models >>Embedded analytics into day-to-day operational decision making *BITanium >> Provided support & training >> Product evaluation, software license management, technical support, etc
question
What were the challenges faced in this case (Discovery Health)?
answer
*Volume of Data >> Data warehouse so large, some queries would take 18 hours or more to process and often crash >>Identifying key insights that business & members could rely on >> Needed infrastructure with the power to deliver results quickly
question
What were the solution faced in this case (Discovery Health)?
answer
*Project to build a new accelerated analytics landscape >> Combined IBM SPSS software with IBM PureData >> BITanium fine tuned application
question
What were the results in this case (Discovery Health)?
answer
*Increased speed & efficiency *Discovery's plan contributions up to *15% lower than any others *Predicting & preventing health risks *Identifying & eliminating fraud *Tweak model and re-run analysis in matter of minutes >>More development cycles faster and release new analyses to the business in days rather than weeks
question
Whats big about the data in this case and the different V's used? (Top 5 Investment Bank).
answer
Big Data *1/3 of the world's total derivative trades The V's *Volume >>1/3 of the world derivative trades *Variety >>Credit, interest rate, equity derivatives *Velocity >>Faster development
question
Whats big about the data in this case and the different V's used? (Turning Machine-Generated Streaming Data into Valuable Business Insights).
answer
Big Data *24 million customers *Billion daily events >>running on distributed hardware/software infrastructure supporting millions of cable, online, and interactive media customers. The V's *Volume >>24 Million customers >>Billion daily events *Variety >>Offer digital voice, high-speed internet, and cable services *Velocity >>Billion daily events *Veracity >>Error prone traditional search methods Overwhelming to just gather and view
question
What technologies are used in the case? (Turning Machine-Generated Streaming Data into Valuable Business Insights).
answer
Splunk Data streaming Real time
question
What challenges are faced in this case (Turning Machine-Generated Streaming Data into Valuable Business Insights)?
answer
Challenges *Gathering and viewing data in one place *Performing diagnostics *Real time intelligence *Using time-consuming and error-prone *traditional search methods *Problems in Application troubleshooting, *Operations, Compliance, Security
question
What were the solutions proposed in this case (Turning Machine-Generated Streaming Data into Valuable Business Insights)?
answer
*Splunk >>Application trouble shooting >>Operations >>Compliance >>Security
question
What were the results proposed in this case (Turning Machine-Generated Streaming Data into Valuable Business Insights)?
answer
*Volume of data not overwhelming *Able to seek out historical data >>Identify trends and unique patterns *Growing knowledge base and awareness >>Increase ability to deliver continuous quality customer experiences *Predictive Capability >>Helps see what will happen and make the right decisions
Get an explanation on any task
Get unstuck with the help of our AI assistant in seconds
New