Slang word detection in Python
Method 1: Detect Slang Words Using Built-in Functions
A predefined list of slang words.
slang_words = [ 'bruh', 'lit', 'fam', 'dope', 'bae', 'yolo', 'gucci', 'savage', 'salty', 'thirsty', 'ghost', 'throwing shade', 'woke', 'fomo', 'stan', 'slay', 'goat', 'sus', 'flex', 'tea', 'clap back', 'basic' ]
The below function opens the specified file file_path
, reads its contents, converts them to lowercase (to ensure case insensitivity), and returns the contents as a string.
def read_file(file_path): with open(file_path, 'r') as file: contents = file.read().lower() return contents
The below function splits the given text into individual words and returns them as a list.
def split_into_words(text): words = text.split() return words
The below function takes a list of words and a list of slang words, checks if any word from the list is in the slang words list, and adds the detected slang words to a set. It returns the set of detected slang words.
def detect_slang_words(words, slang_words): detected_slang = set() for word in words: if word in slang_words: detected_slang.add(word) return detected_slang
The below function combines the previous functions to read the file, split the contents into words, and detect slang words. It returns the detected slang words.
def detect_slang(file_path): contents = read_file(file_path) words = split_into_words(contents) detected_slang = detect_slang_words(words, slang_words) return detected_slang
Assume a text file file.txt
that consists of a paragraph to detect the slang words in it.
The below piece of code specifies the path to the file, detects slang words in the file, and prints the detected slang words.
file_path = 'file.txt' detected_slang = detect_slang(file_path) print("Detected slang words:", detected_slang)
The output is as follows:
Method 2: Detect Slang Words Using NLTK
The following piece of code imports the NLTK library, word_tokenize
function for tokenization, string
module (not used here, but often useful for text processing) and downloads the tokenizer models required for word_tokenize
.
import nltk from nltk.tokenize import word_tokenize import string nltk.download('punkt')
A predefined list of slang words.
slang_words = [ 'bruh', 'lit', 'fam', 'dope', 'bae', 'yolo', 'gucci', 'savage', 'salty', 'thirsty', 'ghost', 'throwing shade', 'woke', 'fomo', 'stan', 'slay', 'goat', 'sus', 'flex', 'tea', 'clap back', 'basic' ]
The below function opens the file specified by file_path
, reads its contents, converts them to lowercase, tokenizes the contents into words using NLTK’s word_tokenize
, filters out non-alphanumeric tokens, and detects slang words by checking each token against the slang words set. It returns the detected slang words.
def detect_slang(file_path): with open(file_path, 'r') as file: contents = file.read().lower() tokens = word_tokenize(contents) words = [word for word in tokens if word.isalnum()] detected_slang = set() for word in words: if word in slang_words: detected_slang.add(word) return detected_slang
Assume a text file file.txt
that consists of a paragraph to detect the slang words in it.
The below piece of code specifies the path to the file, detects slang words in the file, and prints the detected slang words.
file_path = 'file.txt' detected_slang = detect_slang(file_path) print("Detected slang words:", detected_slang)
The output is as follows:
Summary
- Method 1 uses built-in functions for file reading, text splitting, and slang detection.
- Method 2 uses NLTK for more advanced text processing, specifically tokenization.