ruk·si

🥼 Data Science

Updated at 2019-09-24 16:33

Data science is a field that analyzes data. Proper analysis is never done in a vacuum, it always has context. Understanding the organization's values and business is very important aspect of data science.

  • Data Engineers build the data storage and transfer logic.
  • Data Scientists work with data and code to create insights.
  • Data Strategists understand the business and provide actions based on insights.

Strong knowledge of statistics is essential for data science. What can be exploited from your data? Is data good enough? Is the data unbiased enough or are the enough of it?

Data science requires intuition. Asking the right question is more an art than science. You must be able to ask the right question. Must understand the business and the data.

Data science is important for companies to all sizes. Data driven companies perform better than others. Helps to steer decision making by giving hints as data patterns.

Data research is well paid skill currently. Many companies currently collect data and are missing ways to harness the power of it.

Data science frequently has ethnicity and privacy considerations. Good to keep in mind what you are working with and notice when you are potentially crossing the line.

  • Data collection should ALWAYS be opt-in when an individual is identifiable, which can be an issue as you want to store everything possible.
  • You should always clearly notify what data you collect and what you it for.
Google is well known to collect everything about an individuals web presence.

But data collection is also frequently helpful in many contexts:
- Google is so good in giving web results because of user behavior analysis.
- Amazon can frequently suggest items you didn't even remember you wanted.
- LinkedIn gives good business connection suggestions.
- Facebook shows actually accurate friend suggestions.
- Spotify finds music you will like through Discover Weekly.

The most popular language for data science is Python. Python is a lot cleaner than the other alternatives like R and MATLAB. R and MATLAB are more scientific but are dropping in popularity, which results that all the new data science libraries are written in Python first.

The Process

Data science process has 6 majors steps:

  1. Data collection: getting the data
  2. Data management: storing the data
  3. Data cleaning: making the data more usable
  4. Data modeling: adding context to the data
  5. Data analysis: finding patterns in the data and relations
  6. Data visualization: presenting your findings to others

Data management is very important in data science:

  • Sheer volume is turned into a resource.
  • Dividing data to analyzable chunks.
  • Data reduction, summarizing the relevant data for analysis.
  • Create a process of using big data.
  • Decentralize all information.

Data cleaning means removing corrupt or inaccurate records:

  • Check for improbable values e.g. website visitor age is 6.
  • Check for impossible values e.g. person age is 1034 years.
  • Check data point types e.g. '19' as person name.
  • Check for missing values, usually NULL or ''.
  • Check for empty values e.g. NA, 0, -, .
  • Check for outliers, value that is clearly out of range from others.
  • Contradictory values like duplicate records of a person.
  • Next, remove noise, all pieces of the data that are irrelevant for the current analysis.
New York, NY, New Yorl

Data modeling is exploring data oriented structures. Don't theorize before you have the data and have a chance to look at it. Includes identification of relations of these objects that are to be mapped. Provides context for the information so they can be used for decision making.

Avoid analyzing points, search for patterns. Keep thinking the bigger picture, avoid getting trapped too close to the data. Adding data models might hinder the analysis process.

Patterns can be very unexpected, for example: - You can analyze Twitter messages to find out booming films. - Global mood of Twitter users predicts DOW index.

Let data speaking by itself. Schemaless or otherwise mutable representation is usually the best. Some variants of data might still need data modeling.

Stream processing is real time analytic processing. Finding patterns in data streams.

Streaming processing is required for close future forecasting. Using past patterns and outcomes with active stream of data to find patterns that predict the future.

Data visualization is the first outcome of data research. Useful for presenting information for decision makers or for data exploration. It is important to get the right information to the right hands as data scientists rarely make business decisions themselves.

  • Univariant analysis
  • Bivariant analysis
  • Multivariant analysis

Big Data

Big Data: Data so vast or variable that it cannot be contained and analyzed with everyday database tools.

Big data has 4 potential characteristics:

  • High Volume: Enormous amounts of data, talking about tera- and petabytes of data.
  • High Variety: Data extends beyond forms of normal structured data. Can contain mixed text, audio, video, images.
  • High Veracity: How to remove totally useless data and noise?
  • High Velocity: Might be time sensitive, must be used same time as it is collected.

Big data is frequently utter chaos. Data scientist takes the chaos out of messy source data and turns it into actionable business information.

Big data is useless without a human element. The one querying the data must have domain expertise on the subject. You must understand the business to be an efficient data scientist

Big data can give previously hidden knowledge. Peer influence of customers is a good example, buying behavior influences those around you.

Big data analytics doesn't have to be expensive. You can schedule 4 hour big data analytics run that costs less than $3. Continuously running analytics can be quite expensive though.

Big data analytics utilize wide array of tools. Most used tool is Hadoop but all the other stuff varies quite a lot. Here is most of the common big data tools jargon:

Hadoop: like an operating system for big data that runs services below
  YARN: resource manager that distributes work across a cluster
  HDFS: Hadoop distributed file system, abstraction of a shared file system
  Avro: data serialization framework
  Parquet: columnar storage format for your data
  Hive: data warehouse infrastructure, built on top of Hadoop
  HiveQL: SQL-like syntax for analytics queries
  Pig: high-level data flow language and execution framework
    UDF: Pig extension functions defined by user e.g. in Python or Ruby
  Spark: in-place data processing jobs
  Spark Streaming: take streaming data and manipulate it
  Spark SQL: use HiveQL to create Spark jobs
  MapReduce: similar to Spark but older and inferior
  Sqoop: transfers data between Hadoop and relational databases
  Mahout: scalable machine learning and data mining library, usually with Hadoop

Databases:
  Cassandra: distributed database, good for real-time big data
  HBase: distributed database that runs on top of HDFS
    Phoenix: provides SQL syntax for HBase
  AWS:
    EMR: launches, setups and manages Hadoop clusters
         that allow processing S3 data
    Redshift: database for vast data volumes, fork from PostgreSQL 8,
              can take inputs from numerous sources
    S3: good storage for vast volumes of unprocessed data
    Glacier: place for data backups or archiving
    Kinesis: streaming endpoint that can write to S3 and Redshift

Presto: distributed query engine, used with Hive, Cassandra, Redshift
        and can combine multiple data sources

Data Science Work Environments

Data Science in Academia:

  • Your target is to write a paper. It means that your approach should be reproducible but you will lack productization.
  • Data scientists use notebooks more frequently as it is similar to scientific manuscripts; the end product.
  • You rarely have the luxury of a machine learning or data engineer which can hurt the code quality and maintainability.

Data Science Consulting:

  • Your target is to fill a contact so you must act more pragmatic; you focus on doing exactly what was agreed upon.
  • Reporting your findings is more about highlighting important aspects than giving full details.
  • Automated data cleaning and model building tools are much more in use; you try to find quick wins and the details are less important.
  • You will hand-off your proof-of-concept to the engineering team if it looks promising.

In-house Data Scientist:

  • In general, you have to wear many additional (engineering) hats.
  • You put more emphasis on good software engineering practices.
  • Everything you write will be in code version control.
  • You don't simply hand-off your proof-of-concept but work with the engineering team to productize it.

Sources