Wednesday, 22 June 2022

MathWithNaziaa : Popular Applications of Mathematics in Data Science : Natural Language Processing (NLP)

Popular Applications of Mathematics in Data Science

Businesses across all industries need data scientists to help them function and be successful on a daily basis. Understanding how you can use math in practical scenarios can help you understand why businesses need data scientists and how mathematics comes into play. 

Let’s look at some practical uses of mathematics in popular data science and machine learning applications and technologies being utilized by leading organizations today:

Natural Language Processing (NLP)

Linear algebra is used in NLP for word embeddings, and unsupervised learning techniques like topic modeling and predictive analytics. Examples of uses of NLP include chatbots, language translation, speech recognition, and sentiment analysis. 

Data is increasing at an alarming rate. A large portion of the data available today is in the form of text. Natural Language Processing is a popular branch of AI which helps Data Science in extracting insights from the textual data. Following this, Industry experts have predicted that there will be a huge demand for Natural Language Processing professionals in the near future. In this tutorial, we will discuss some of the important NLP Techniques used in the field of Data Science.

Natural Language Processing or NLP is a branch that focuses on teaching computers how to read and interpret the text in the same way as humans do. It is a field that is developing methodologies for filling the gap between Data Science and human languages.

Everything we speak or express holds great information and can be useful in making valuable decisions. But extracting this information is not that easy as humans can use a number of languages, words, tones, etc. All these data that we are generating through our conversations, tweets, etc is highly unstructured. The traditional techniques are not capable of extracting insights from this data. But thanks to the advanced technologies like machine learning and NLP that have brought a revolution in the field of Data Science.

Many areas like Healthcare, Finance, Media, Human Resources, etc are using NLP for utilizing the data available in the form of text and speech. Many text and speech recognition applications are built using NLP. For example, personal voice assistants like Siri, Cortana, Alexa, etc.

 

NLP techniques in Data Science

Let us see some of the most widely used NLP techniques in Data Science.

1. Bag of Words

This model counts the number of words in a piece of text. This model works by generating an occurrence matrix for the sentences. The underlying grammar and the order of words are not considered while generating the matrix.

These occurrences or counts are then fed into a classifier as features.

A compound sentence is a sentence formed with two or main clauses joined by a coordinating conjunction.
A simple sentence is a sentence formed with one main clause.

Now let’s generate the occurrence matrix for this:

NLP bag of words

This approach is very simple to understand but it has several drawbacks also. This model gives no idea about the semantics and the context in which the words are used. Also, some words like “a” or “the” which appear frequently but are not that important may create noise during analysis. Another problem is that in the above example, the word “then” holds more weight than the word “universe” i.e words are not weighted according to their importance.

To overcome these issues, we use an approach called Term Frequency-Inverse Document Frequency (TF-IDF).

2. Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency or TF-IDF overcomes the drawbacks of Bag of Words by using a weighting factor. It uses statistics for calculating the importance of a word in a document. Let us understand the statistics of TF-IDF.

TF or Term frequency: It measures the frequency of a word in a document. This is calculated by counting the total number of occurrences of the word and dividing it by the total length of the document.

IDF or Inverse Document Frequency: It measures the importance of a word in a document. For example, words such as is, a, of, etc which occur frequently in the document but they do not hold much importance as they are not adjectives or verbs. Thus this technique assigns a weight to any string according to its importance. It is calculated by taking the log of the total number of documents in the dataset divided by the number of documents containing that particular word (also 1 is added to the denominator so that it is never 0).

TF-IDF: Finally it calculates the importance of any word by multiplying the TF and IDF terms i.e TF*IDF.

Thus the words having more importance are assigned higher weights by using these statistics. This technique is mostly used by search engines for scoring and ranking the relevance of any document according to the given input keywords.

Popular Applications of Mathematics in Data Science

Businesses across all industries need data scientists to help them function and be successful on a daily basis. Understanding how you can use math in practical scenarios can help you understand why businesses need data scientists and how mathematics comes into play. 

Let’s look at some practical uses of mathematics in popular data science and machine learning applications and technologies being utilized by leading organizations today:

Natural Language Processing (NLP)

Linear algebra is used in NLP for word embeddings, and unsupervised learning techniques like topic modeling and predictive analytics. Examples of uses of NLP include chatbots, language translation, speech recognition, and sentiment analysis. 

Data is increasing at an alarming rate. A large portion of the data available today is in the form of text. Natural Language Processing is a popular branch of AI which helps Data Science in extracting insights from the textual data. Following this, Industry experts have predicted that there will be a huge demand for Natural Language Processing professionals in the near future. In this tutorial, we will discuss some of the important NLP Techniques used in the field of Data Science.

Natural Language Processing or NLP is a branch that focuses on teaching computers how to read and interpret the text in the same way as humans do. It is a field that is developing methodologies for filling the gap between Data Science and human languages.

Everything we speak or express holds great information and can be useful in making valuable decisions. But extracting this information is not that easy as humans can use a number of languages, words, tones, etc. All these data that we are generating through our conversations, tweets, etc is highly unstructured. The traditional techniques are not capable of extracting insights from this data. But thanks to the advanced technologies like machine learning and NLP that have brought a revolution in the field of Data Science.

Many areas like Healthcare, Finance, Media, Human Resources, etc are using NLP for utilizing the data available in the form of text and speech. Many text and speech recognition applications are built using NLP. For example, personal voice assistants like Siri, Cortana, Alexa, etc.

 

NLP techniques in Data Science

Let us see some of the most widely used NLP techniques in Data Science.

1. Bag of Words

This model counts the number of words in a piece of text. This model works by generating an occurrence matrix for the sentences. The underlying grammar and the order of words are not considered while generating the matrix.

These occurrences or counts are then fed into a classifier as features.

A compound sentence is a sentence formed with two or main clauses joined by a coordinating conjunction.
A simple sentence is a sentence formed with one main clause.

Now let’s generate the occurrence matrix for this:

NLP bag of words

This approach is very simple to understand but it has several drawbacks also. This model gives no idea about the semantics and the context in which the words are used. Also, some words like “a” or “the” which appear frequently but are not that important may create noise during analysis. Another problem is that in the above example, the word “then” holds more weight than the word “universe” i.e words are not weighted according to their importance.

To overcome these issues, we use an approach called Term Frequency-Inverse Document Frequency (TF-IDF).

2. Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency or TF-IDF overcomes the drawbacks of Bag of Words by using a weighting factor. It uses statistics for calculating the importance of a word in a document. Let us understand the statistics of TF-IDF.

TF or Term frequency: It measures the frequency of a word in a document. This is calculated by counting the total number of occurrences of the word and dividing it by the total length of the document.

IDF or Inverse Document Frequency: It measures the importance of a word in a document. For example, words such as is, a, of, etc which occur frequently in the document but they do not hold much importance as they are not adjectives or verbs. Thus this technique assigns a weight to any string according to its importance. It is calculated by taking the log of the total number of documents in the dataset divided by the number of documents containing that particular word (also 1 is added to the denominator so that it is never 0).

TF-IDF: Finally it calculates the importance of any word by multiplying the TF and IDF terms i.e TF*IDF.

Thus the words having more importance are assigned higher weights by using these statistics. This technique is mostly used by search engines for scoring and ranking the relevance of any document according to the given input keywords.

 

Computer Vision

Linear algebra is also used for computer vision such as image representation and image processing. When people think about computer vision, companies like Tesla come to mind for their self-driving cars. Computer vision is also frequently used in industries like agriculture to improve yields, or healthcare to classify illnesses and improve diagnoses. 

Marketing and Sales

Statistics is useful for testing the effectiveness of marketing campaigns such as hypothesis testing. It’s also used to understand consumer behavior, such as why consumers are purchasing from a specific brand, in techniques like causal effect analysis or survey design, and personalization recommendations via predictive modeling or clustering. 

Pursue Your Math and Data Science Education

Math is a core educational pillar for data scientists, regardless of your future industry career path. It ensures you can help an organization solve problems and innovate more quickly, optimize model performance, and effectively apply complex data towards business challenges.

Ensure that you’re building the right skill sets and mathematical capabilities through a leading online bootcamp provider like Simplilearn. They offer Data Science Certification Courses that guide you through everything you need to know in pursuit of your data science career—including courses dedicated to mathematics.

 

 

Computer Vision

Linear algebra is also used for computer vision such as image representation and image processing. When people think about computer vision, companies like Tesla come to mind for their self-driving cars. Computer vision is also frequently used in industries like agriculture to improve yields, or healthcare to classify illnesses and improve diagnoses. 

Marketing and Sales

Statistics is useful for testing the effectiveness of marketing campaigns such as hypothesis testing. It’s also used to understand consumer behavior, such as why consumers are purchasing from a specific brand, in techniques like causal effect analysis or survey design, and personalization recommendations via predictive modeling or clustering. 

Pursue Your Math and Data Science Education

Math is a core educational pillar for data scientists, regardless of your future industry career path. It ensures you can help an organization solve problems and innovate more quickly, optimize model performance, and effectively apply complex data towards business challenges.

Ensure that you’re building the right skill sets and mathematical capabilities through a leading online bootcamp provider like Simplilearn. They offer Data Science Certification Courses that guide you through everything you need to know in pursuit of your data science career—including courses dedicated to mathematics.

 


No comments:

Post a Comment