Extraction New Features
DOI:
https://doi.org/10.29304/jqcsm.2025.17.11960Keywords:
Extraction FeaturesAbstract
Feature extraction is a key part of machine learning, aiming to transform raw facts into more representative and effective features for models. This process involves selecting features that facilitate classification or prediction tasks. Feature extraction helps reduce the dimensions of the data while keeping basic information, which contributes to reducing computational complexity, improving the accuracy of predictions, and increasing the efficiency and speed of models. Common feature extraction methods include several methodologies, including statistical techniques such as calculating means, standard deviation, and variance, principal component analysis (PCA) to reduce dimensions while preserving as much variance in the data as possible, and independent factor analysis (ICA) is used to separate mixed signals and extract statistically independent features, and using advanced techniques such as linear discriminant analyses (LDA) and autoencoders to extract important features. In addition, clustering techniques such as K-Means also play an important role in identifying hidden patterns in data by grouping them into clusters and then using the properties of these clusters as features.
Feature extraction is an essential process for improving the effectiveness of models in terms of performance. It is indispensable in various analytical applications, such as medical diagnosis, image analysis, fraud detection, voice recognition, and text analysis
Downloads
References
R. Imamguluyev, "The Rise of GPT-3: Implications for Natural Language Processing and Beyond," International Journal of Research Publication and Reviews, vol. 4, pp. 4893-4903, 03/03 2023, doi: 10.55248/gengpi.2023.4.33987.
L. Fröhling and A. Zubiaga, "Feature-based detection of automated language models: tackling GPT-2, GPT-3 and Grover," PeerJ Computer Science, vol. 7, p. e443, 2021.
E. Sadikoğlu, M. Gök, M. Mijwil, and I. Kosesoy, "The Evolution and Impact of Large Language Model Chatbots in Social Media: A Comprehensive Review of Past, Present, and Future Applications," vol. 6, pp. 67-76, 12/21 2023.
D. Valiaiev, "Detection of Machine-Generated Text: Literature Survey," arXiv preprint arXiv:2402.01642, 2024.
L. Dugan, D. Ippolito, A. Kirubarajan, and C. Callison-Burch, "RoFT: A tool for evaluating human detection of machine-generated text," arXiv preprint arXiv:2010.03070, 2020.
A. Das and R. M. Verma, "Can machines tell stories? a comparative study of deep neural language models and metrics," IEEE Access, vol. 8, pp. 181258-181292, 2020.
I. Dergaa, K. Chamari, P. Zmijewski, and H. Ben Saad, "From Human Writing to Artificial Intelligence Generated Text: Examining the Prospects and potential threats of ChatGPT in Academic Writing," Biology of Sport, vol. 40, pp. 615-622, 03/07 2023, doi: 10.5114/biolsport.2023.125623.
A. Rauchfleisch, M. Sele, and C. Caspar, "Digital astroturfing in politics: Definition, typology, and countermeasures," Studies in Communication Sciences, vol. 18, pp. 69-85, 11/14 2018, doi: 10.24434/j.scoms.2018.01.005.
J. Peng, R. K. K. Choo, and H. Ashman, "Astroturfing Detection in Social Media: Using Binary n-Gram Analysis for Authorship Attribution," in 2016 IEEE Trustcom/BigDataSE/ISPA, 23-26 Aug. 2016 2016, pp. 121-128, doi: 10.1109/TrustCom.2016.0054.
S. S. Ghosal, S. Chakraborty, J. Geiping, F. Huang, D. Manocha, and A. Bedi, "A Survey on the Possibilities & Impossibilities of AI-generated Text Detection," Transactions on Machine Learning Research, 2023.
S. Katzenbeisser and F. Petitcolas, Information hiding. Artech house, 2016.
U. Topkara, M. Topkara, and M. J. Atallah, "The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions," in Proceedings of the 8th workshop on Multimedia and security, 2006, pp. 164-174.
M. J. Atallah et al., "Natural language watermarking: Design, analysis, and a proof-of-concept implementation," in Information Hiding: 4th International Workshop, IH 2001 Pittsburgh, PA, USA, April 25–27, 2001 Proceedings 4, 2001: Springer, pp. 185-200.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
H. Ueoka, Y. Murawaki, and S. Kurohashi, "Frustratingly easy edit-based linguistic steganography with a masked language model," arXiv preprint arXiv:2104.09833, 2021.
Z. M. Ziegler, Y. Deng, and A. M. Rush, "Neural linguistic steganography," arXiv preprint arXiv:1909.01496, 2019.
X. Zhao, P. Ananth, L. Li, and Y.-X. Wang, "Provable robust watermarking for ai-generated text," arXiv preprint arXiv:2306.17439, 2023.
S. S. Ghosal, S. Chakraborty, J. Geiping, F. Huang, D. Manocha, and A. S. Bedi, "Towards possibilities & impossibilities of ai-generated text detection: A survey," arXiv preprint arXiv:2310.15264, 2023.
J. Qiang, S. Zhu, Y. Li, Y. Zhu, Y. Yuan, and X. Wu, "Natural language watermarking via paraphraser-based lexical substitution," Artificial Intelligence, vol. 317, p. 103859, 2023/04/01/ 2023, doi: https://doi.org/10.1016/j.artint.2023.103859.
X. Zhao, Y.-X. Wang, and L. Li, "Protecting language generation models via invisible watermarking," arXiv preprint arXiv:2302.03162, 2023.
R. Kuditipudi, J. Thickstun, T. Hashimoto, and P. Liang, "Robust distortion-free watermarks for language models," arXiv preprint arXiv:2307.15593, 2023.
M. Christ, S. Gunn, and O. Zamir, "Undetectable Watermarks for Language Models," arXiv preprint arXiv:2306.09194, 2023.
B. Y. Idrissi, M. Millunzi, A. Sorrenti, L. Baraldi, and D. Dementieva, "Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks," 2023.
A. B. Hou et al., "Semstamp: A semantic watermark with paraphrastic robustness for text generation," arXiv preprint arXiv:2310.03991, 2023.
J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, "A watermark for large language models," arXiv preprint arXiv:2301.10226, 2023.
E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, and C. Finn, "Detectgpt: Zero-shot machine-generated text detection using probability curvature," arXiv preprint arXiv:2301.11305, 2023.
N. Nashid, M. Sintaha, and A. Mesbah, "Retrieval-based prompt selection for code-related few-shot learning," in Proceedings of the 45th International Conference on Software Engineering (ICSE’23), 2023.
K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, "Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. arXiv," arXiv preprint arXiv:2303.13408, 2023.
M. H. I. Abdalla, S. Malberg, D. Dementieva, E. Mosca, and G. Groh, "A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers," Information, vol. 14, no. 10, p. 522, 2023. [Online]. Available: https://www.mdpi.com/2078-2489/14/10/522.
D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, Automatic Detection of Generated Text is Easiest when Humans are Fooled. 2020, pp. 1808-1822.
G. Bao, Y. Zhao, Z. Teng, L. Yang, and Y. Zhang, "Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature," arXiv preprint arXiv:2310.05130, 2023.
R. Zellers et al., "Defending against neural fake news," Advances in neural information processing systems, vol. 32, 2019.
K. Mohammadi, "Human vs machine generated text detection in Persian," ed, 2023.
F. Xiong, T. Markchom, Z. Zheng, S. Jung, V. Ojha, and H. Liang, "Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection," arXiv preprint arXiv:2401.12326, 2024.
F. Rangel and P. Rosso, "Overview of the 7th author profiling task at PAN 2019: bots and gender profiling in twitter," Working notes papers of the CLEF 2019 evaluation labs, vol. 2380, pp. 1-7, 2019.
E. Psomakelis, K. Tserpes, D. Anagnostopoulos, and T. Varvarigou, "Comparing methods for twitter sentiment analysis," arXiv preprint arXiv:1505.02973, 2015.
D. Effrosynidis, S. Symeonidis, and A. Arampatzis, A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis. 2017.
M. Siino, I. Tinnirello, and M. La Cascia, "Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers," Information Systems, vol. 121, p. 102342, 2024/03/01/ 2024, doi: https://doi.org/10.1016/j.is.2023.102342.
S. Vasudevan, "Enhancing the Sentiment Classification Accuracy of Twitter Data using Machine Learning Algorithms," 2021, pp. 189-199.
T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, "Large language models are zero-shot reasoners," Advances in neural information processing systems, vol. 35, pp. 22199-22213, 2022.
H. Cheng, S. Liu, W. Sun, and Q. Sun, "A Neural Topic Modeling Study Integrating SBERT and Data Augmentation," Applied Sciences, vol. 13, no. 7, p. 4595, 2023. [Online]. Available: https://www.mdpi.com/2076-3417/13/7/4595.
H. Face, "all-MiniLM ". [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2.
H. Face, "mpnet " 2022. [Online]. Available: https://huggingface.co/sentence-transformers/all-mpnet-base-v2.
N. a. G. Reimers, Iryna, "paraphrase-multilingual-mpnet-base-v2," 2019. [Online]. Available: https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2.
"sentence-transformers/gtr-t5-large," 2021. [Online]. Available: https://huggingface.co/sentence-transformers/gtr-t5-large.
L. Tunstall et al., "Efficient few-shot learning without prompts," arXiv preprint arXiv:2209.11055, 2022.
P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, "Graph attention networks," arXiv preprint arXiv:1710.10903, 2017.
Y. Yang, Z. Wu, Y. Yang, S. Lian, F. Guo, and Z. Wang, "A survey of information extraction based on deep learning," Applied Sciences, vol. 12, no. 19, p. 9691, 2022.
B. Jiang, L. Wang, J. Tang, and B. Luo, "Context-Aware Graph Attention Networks," arXiv preprint arXiv:1910.01736, 2019.
J. Chen and H. Chen, "Edge-featured graph attention network," arXiv preprint arXiv:2101.07671, 2021.
E. Alothali, M. Salih, K. Hayawi, and H. Alashwal, "Bot-mgat: A transfer learning model based on a multi-view graph attention network to detect social bots," Applied Sciences, vol. 12, no. 16, p. 8117, 2022.
S. Zhang, H. Tong, J. Xu, and R. Maciejewski, "Graph convolutional networks: a comprehensive review," Computational Social Networks, vol. 6, no. 1, pp. 1-23, 2019.
S. Zhang, C. Zhou, Y. Li, X. Zhang, L. Ye, and Y. Wei, "Irregular Scene Text Detection Based on a Graph Convolutional Network," Sensors, vol. 23, no. 3, p. 1070, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/3/1070.
N. R. Aljohani, A. Fayoumi, and S.-U. Hassan, "Bot prediction on social networks of Twitter in altmetrics using deep graph convolutional networks," Soft Computing, vol. 24, no. 15, pp. 11109-11120, 2020.
L. Zhang, H. Song, N. Aletras, and H. Lu, "Node-Feature Convolution for Graph Convolutional Networks," Pattern Recognition, vol. 128, p. 108661, 2022/08/01/ 2022, doi: https://doi.org/10.1016/j.patcog.2022.108661.
X. Zhu, L. Zhu, J. Guo, S. Liang, and S. Dietze, "GL-GCN: Global and local dependency guided graph convolutional networks for aspect-based sentiment classification," Expert Systems with Applications, vol. 186, p. 115712, 2021.
L. Zhang and H. Lu, "A feature-importance-aware and robust aggregator for GCN," in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 1813-1822.
H. B. Giglou, M. Rahgouy, A. Rahmati, T. Rahgooy, and C. D. Seals, "Profiling Irony and Stereotype Spreaders with Encoding Dependency Information using Graph Convolutional Network," in CLEF, 2022, pp. 1613-0073.
V. Jimenez-Villar, J. Sánchez-Junquera, M. Montes-y-Gómez, L. Villaseñor-Pineda, and S. P. Ponzetto, "Bots and gender profiling using masking techniques: Notebook for PAN at CLEF 2019," in CEUR Workshop Proceedings, 2019, vol. 2380: RWTH Aachen, pp. 1-8.
A. Mahmood and P. Srinivasan, "Twitter Bots and Gender Detection using Tf-idf," in CLEF (Working Notes), 2019.
I. Vogel and P. Jiang, "Bot and Gender Identification in Twitter using Word and Character N-Grams," in CLEF (Working Notes), 2019.
R. Goubin, D. Lefeuvre, A. Alhamzeh, J. Mitrovic, E. Egyed-Zsigmond, and L. G. Fossi, "Bots and Gender Profiling using a Multi-layer Architecture," in CLEF (Working Notes), 2019.
M. Polignano, M. G. de Pinto, P. Lops, and G. Semeraro, "Identification Of Bot Accounts In Twitter Using 2D CNNs On User-generated Contents," in Clef (working notes), 2019.
E. Puertas, L. G. Moreno-Sandoval, F. M. Plaza-del Arco, J. A. Alvarado-Valencia, A. Pomares-Quimbaya, and L. Alfonso, "Bots and gender profiling on twitter using sociolinguistic features," CLEF (Working Notes), pp. 1-8, 2019.
D. Kosmajac and V. Keselj, "Twitter User Profiling: Bot and Gender Identification: Notebook for PAN at CLEF 2019," in Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings 11, 2020: Springer, pp. 141-153.
T. Fagni and M. Tesconi, "Profiling Twitter Users Using Autogenerated Features Invariant to Data
W. Qu et al., "Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code," arXiv preprint arXiv:2401.16820, 2024.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Nidhal Hasan Hasaan, Lamia Abed Noor Muhammed

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.