To detect slang words from a text file in Python, you generally don’t need any external modules, it can be achieved using Python’s built-in functions and standard libraries.
Here’s the list of built-in functions :
os
: For handling file paths.re
: For regular expressions if you need more advanced text processingstring
: For string manipulation (though not always necessary).
Here’s the list of standard libraries:
-
- NLTK (Natural Language Toolkit): For advanced text processing and tokenization.
- SpaCy: Another powerful library for NLP (Natural Language Processing).
- Pandas: For handling and processing data, although it’s more useful for structured data.
- Collections: Specifically,
Counter
fromcollections
can be useful for counting occurrences of words.
Detect slang words using built-in functions in python
Here we can have the list of slang words. I have read the file, split the file content and detect slang words and detect slang
slang_words = [ 'bruh', 'lit', 'fam', 'dope', 'bae', 'yolo', 'gucci', 'savage', 'salty', 'thirsty', 'ghost', 'throwing shade', 'woke', 'fomo', 'stan', 'slay', 'goat', 'sus', 'flex', 'tea', 'clap back', 'basic' ] def read_file(file_path): with open(file_path, 'r') as file: contents = file.read().lower() return contents def split_into_words(text): words = text.split() return words def detect_slang_words(words, slang_words): detected_slang = set() for word in words: if word in slang_words: detected_slang.add(word) return detected_slang def detect_slang(file_path): contents = read_file(file_path) words = split_into_words(contents) detected_slang = detect_slang_words(words, slang_words) return detected_slang
Output : Detected slang words: {'lit', 'yolo', 'woke'}
Detect slang words using NLTK
To use the module, first I installed the package on my local system using the command prompt on Windows (Terminal for macOS/Linux users).
pip install nltk
Once installed, import the package to your code
from nltk.tokenize import word_tokenize
I have created a script to read the file, tokenize the contents, and detect slang words using NLTK.
import nltk from nltk.tokenize import word_tokenize import string nltk.download('punkt') slang_words = { 'bruh', 'lit', 'fam', 'dope', 'bae', 'yolo', 'gucci', 'savage', 'salty', 'thirsty', 'ghost', 'throwing shade', 'woke', 'fomo', 'stan', 'slay', 'goat', 'sus', 'flex', 'tea', 'clap back', 'basic' } def detect_slang(file_path): with open(file_path, 'r') as file: contents = file.read().lower() tokens = word_tokenize(contents) words = [word for word in tokens if word.isalnum()] detected_slang = set() for word in words: if word in slang_words: detected_slang.add(word) return detected_slang
Output : Detected slang words: {'lit', 'fam', 'dope'}