To detect slang words from a text file in Python, you generally don’t need any external modules, it can be achieved using Python’s built-in functions and standard libraries.
Here’s the list of built-in functions :
os: For handling file paths.re: For regular expressions if you need more advanced text processingstring: For string manipulation (though not always necessary).
Here’s the list of standard libraries:
-
- NLTK (Natural Language Toolkit): For advanced text processing and tokenization.
- SpaCy: Another powerful library for NLP (Natural Language Processing).
- Pandas: For handling and processing data, although it’s more useful for structured data.
- Collections: Specifically,
Counterfromcollectionscan be useful for counting occurrences of words.
Detect slang words using built-in functions in python
Here we can have the list of slang words. I have read the file, split the file content and detect slang words and detect slang
slang_words = [
'bruh', 'lit', 'fam', 'dope', 'bae', 'yolo', 'gucci', 'savage',
'salty', 'thirsty', 'ghost', 'throwing shade', 'woke', 'fomo',
'stan', 'slay', 'goat', 'sus', 'flex', 'tea', 'clap back', 'basic'
]
def read_file(file_path):
with open(file_path, 'r') as file:
contents = file.read().lower()
return contents
def split_into_words(text):
words = text.split()
return words
def detect_slang_words(words, slang_words):
detected_slang = set()
for word in words:
if word in slang_words:
detected_slang.add(word)
return detected_slang
def detect_slang(file_path):
contents = read_file(file_path)
words = split_into_words(contents)
detected_slang = detect_slang_words(words, slang_words)
return detected_slang
Output : Detected slang words: {'lit', 'yolo', 'woke'}
Detect slang words using NLTK
To use the module, first I installed the package on my local system using the command prompt on Windows (Terminal for macOS/Linux users).
pip install nltk
Once installed, import the package to your code
from nltk.tokenize import word_tokenize
I have created a script to read the file, tokenize the contents, and detect slang words using NLTK.
import nltk
from nltk.tokenize import word_tokenize
import string
nltk.download('punkt')
slang_words = {
'bruh', 'lit', 'fam', 'dope', 'bae', 'yolo', 'gucci', 'savage',
'salty', 'thirsty', 'ghost', 'throwing shade', 'woke', 'fomo',
'stan', 'slay', 'goat', 'sus', 'flex', 'tea', 'clap back', 'basic'
}
def detect_slang(file_path):
with open(file_path, 'r') as file:
contents = file.read().lower()
tokens = word_tokenize(contents)
words = [word for word in tokens if word.isalnum()]
detected_slang = set()
for word in words:
if word in slang_words:
detected_slang.add(word)
return detected_slang
Output : Detected slang words: {'lit', 'fam', 'dope'}