Search Options

Results per page
Sort
Preferred Languages
Advance

Results 1 - 10 of 56 for Extraction (0.06 sec)

  1. fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/PdfExtractor.java

     *
     * <p>The extractor runs text extraction in a separate thread with a configurable timeout
     * to prevent hanging on problematic PDF files. It also extracts metadata from the PDF
     * document and includes it in the extraction result.
     *
     * <p>Features:
     * <ul>
     *   <li>Text extraction from PDF pages</li>
     *   <li>Embedded document extraction</li>
     *   <li>Annotation extraction (file attachments)</li>
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Sun Nov 23 12:19:14 UTC 2025
    - 12.8K bytes
    - Viewed (0)
  2. fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/FilenameExtractorEnhancedTest.java

        }
    
        /**
         * Test extraction with null parameters map.
         */
        public void test_getText_withNullParams() {
            final InputStream in = new ByteArrayInputStream(new byte[0]);
    
            final ExtractData result = filenameExtractor.getText(in, null);
    
            assertNotNull(result);
            assertEquals("", result.getContent());
        }
    
        /**
         * Test extraction with empty parameters map.
         */
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Mon Nov 24 03:59:47 UTC 2025
    - 7K bytes
    - Viewed (0)
  3. fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/TextExtractorEnhancedTest.java

                assertTrue("Error message should indicate extraction failure", e.getMessage().contains("Failed to extract"));
            } finally {
                // Reset to default encoding
                textExtractor.setEncoding("UTF-8");
            }
        }
    
        /**
         * Test extraction with empty input stream.
         */
        public void test_getText_emptyInputStream_returnsEmptyContent() {
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Mon Nov 24 03:59:47 UTC 2025
    - 8.9K bytes
    - Viewed (0)
  4. fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/JsonExtractor.java

     * This extractor provides better structured data extraction compared to Tika's generic text extraction.
     *
     * <p>Features:
     * <ul>
     *   <li>Structured text extraction with key-value pairs</li>
     *   <li>Top-level field extraction as metadata</li>
     *   <li>Nested structure flattening with configurable depth</li>
     *   <li>Array element extraction</li>
     *   <li>Configurable field separator and array formatting</li>
     * </ul>
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Sun Nov 23 03:46:53 UTC 2025
    - 9.7K bytes
    - Viewed (0)
  5. fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/ExtractorBuilder.java

     * The builder allows setting parameters such as MIME type, filename, extractor name, maximum content length,
     * and cache file size to optimize the extraction process.
     *
     * <p>
     * The main purpose of this class is to simplify the extraction process by providing a fluent interface
     * for configuring the extraction parameters and handling the underlying complexities of content processing,
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Sun Jul 06 02:13:03 UTC 2025
    - 10.1K bytes
    - Viewed (0)
  6. fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/MarkdownExtractor.java

     * This extractor provides better structured data extraction compared to Tika's generic text extraction.
     *
     * <p>Features:
     * <ul>
     *   <li>YAML front matter metadata extraction</li>
     *   <li>Heading structure extraction</li>
     *   <li>Link URL extraction</li>
     *   <li>Code block content extraction</li>
     *   <li>Clean text conversion from Markdown</li>
     *   <li>Configurable encoding</li>
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Sun Nov 23 03:46:53 UTC 2025
    - 8.2K bytes
    - Viewed (0)
  7. fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/ExtractorResourceManagementTest.java

                    .singleton("textExtractor", TextExtractor.class);
        }
    
        /**
         * Test that MsWordExtractor properly closes resources on successful extraction.
         */
        public void test_MsWordExtractor_closesResourcesOnSuccess() throws IOException {
            final MsWordExtractor extractor = container.getComponent("msWordExtractor");
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Mon Nov 24 03:59:47 UTC 2025
    - 10.4K bytes
    - Viewed (0)
  8. fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/LhaExtractor.java

         *
         * @param in the input stream containing the LHA archive
         * @param params extraction parameters
         * @return the extracted text data
         * @throws CrawlerSystemException if the input stream is null
         * @throws ExtractException if an error occurs during extraction
         * @throws MaxLengthExceededException if the extracted content size exceeds the maximum limit
         */
        @Override
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Sun Nov 23 12:19:14 UTC 2025
    - 5.9K bytes
    - Viewed (0)
  9. fess-crawler/src/main/java/org/codelibs/fess/crawler/extractor/impl/AbstractXmlExtractor.java

        /**
         * Default character encoding for content extraction.
         */
        protected String encoding = Constants.UTF_8;
    
        /**
         * The preload size for charset detection.
         */
        protected int preloadSizeForCharset = 2048;
    
        /**
         * Indicates whether comment tags should be ignored during extraction.
         */
        protected boolean ignoreCommentTag = false;
    
        /**
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Sun Nov 23 12:19:14 UTC 2025
    - 8.6K bytes
    - Viewed (0)
  10. fess-crawler/src/test/java/org/codelibs/fess/crawler/extractor/impl/EXTRACTOR_TESTS_README.md

    **Key Test Areas**:
    - Resource closure on successful extraction (MS Office extractors)
    - Resource closure on failed extraction
    - Improved error messages with context
    - Input validation using `validateInputStream()`
    
    **Covered Extractors**:
    - MsWordExtractor
    - MsExcelExtractor
    - MsPowerPointExtractor
    - TextExtractor
    
    **Test Count**: 8 tests
    
    **Key Scenarios**:
    - ✅ Successful extraction closes resources properly
    Registered: Sat Dec 20 11:21:39 UTC 2025
    - Last Modified: Wed Nov 19 08:55:01 UTC 2025
    - 5.7K bytes
    - Viewed (0)
Back to top