Skip to content

FSCrawler 2.7 🌈

Compare
Choose a tag to compare
@release-drafter release-drafter released this 05 Aug 11:10
· 1281 commits to master since this release

The FSCrawler team is pleased to announce the FSCrawler 2.7 release!

FSCrawler

FS Crawler offers a simple way to index binary files into elasticsearch.

Usage

Download FSCrawler 2.7:

wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7/fscrawler-es7-2.7.zip

Start FS crawler with:

bin/fscrawler job_name

FS crawler will read a local file (default to ~/.fscrawler/{job_name}/_settings.json).
If the file does not exist, FS crawler will propose to create your first job.

$ bin/fscrawler job_name
18:28:58,174 WARN  [f.p.e.c.f.FsCrawler] job [job_name] does not exist
18:28:58,177 INFO  [f.p.e.c.f.FsCrawler] Do you want to create it (Y/N)?
y
18:29:05,711 INFO  [f.p.e.c.f.FsCrawler] Settings have been created in [~/.fscrawler/job_name/_settings.json]. Please review and edit before relaunch

Create a directory named /tmp/es or c:\tmp\es, add some files you want to index in it and start again:

$ bin/fscrawler job_name
18:30:34,330 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:30:34,332 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
18:30:34,682 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]

More details in the documentation.

New features

  • #991: Add Workplace Search connector.
  • #1203: Add FTP crawler. By helsonxiao.
  • #1211: Add file.content_type field on folders.
  • #1210: Add file.filename field on folders.
  • #1179: Automatically create Custom Sources.
  • #1037: Split console logs and actual logs and add a banner :).
  • #1036: Support ssl verification configurable. By TommyLike.
  • #1035: Log index errors in documents.log.
  • #1031: Add an external Log4J2 configuration file.
  • #907: Add path_prefix option.
  • #820: Generate FSCrawler docker images. By toto1310.
  • #776: Report HEAP size at startup.
  • #752: Add option to ignore symlinks. By budachst.
  • #715: Allow custom index name in the REST API. By kikkauz.
  • #698: Add Cross-Origin Resource Sharing (CORS) headers to RestServer. By isaac-ipl.
  • #692: Allow running OCR but not on PDF files.
  • #673: Add support for YAML configuration.
  • #663: Add Patterns table to includes and excludes. By wrathagom.

Fixed Bugs

  • #1224: Fix NPE in Console when running with Docker.
  • #1217: Check if date is null when formatting it to RFC3339.
  • #1204: Split build and deploy phases for Docker images.
  • #1201: 2.7 - Docker image broken. By agrantdeakin.
  • #1194: Elasticsearch node settings should not be null by default.
  • #1193: Corrupt PDF can lead to a StackOverflow.
  • #1137: Ignore errors when parsing a 0 byte file.
  • #1085: fscrawler.bat added a CD to move to the appropriate directory. By CircuitGuy.
  • #1084: InputStream must have > 0 bytes. By yuanzhian.
  • #1066: Start fscrawler instead of internal services.
  • #1041: Fixed an issue that caused an error when running in a windows environment. By muraken720.
  • #1006: Running fscrawler with no argument now lists existing jobs. By janhoy.
  • #1005: Fix ENTRYPOINT in Dockerfile to allow variable substitution. By Maijin.
  • #994: Using cloud id gives "invalid IPv6 Address". By tdaroly.
  • #973: Fix SSH crawling from Windows machine.
  • #899: FSCrawler can't index .doc or .docx elements. By LaaKii.
  • #895: java.lang.NoSuchMethodError: parsing some Word files. By mwaltersbmc.
  • #860: Bug Syntax error in fscrawler file, to init fscrawler. By CarlosRCDev.
  • #847: sun.jnu.encoding=UTF-8 added in .bat and .sh both. By shahariaazam.
  • #834: FS Crawler freezes when crawling a 0 byte TXT file. By dansfelix.
  • #819: Fix Percentage computation.
  • #760: Allow passing test parameters to Maven CLI.
  • #714: fix release-drafter. By jetersen.
  • #701: Change log level and display logs only if filters on content.
  • #691: OCR without pdf_ocr. By Newmski.
  • #686: Wait for healthy index when creating the index.
  • #681: SSH dirs should be seen as dirs and not files.
  • #680: trying to index remote files with ssh - files seen as folder. By sblanc0054.
  • #660: Fix authentication when sending announcement email.

Main changes

  • #1218: Isolate WorkplaceSearchClient and ElasticsearchClient.
  • #1213: Switch back to Java 11.
  • #1049: Update Dockerfile to use JDK14. By mario-89.
  • #1212: Let's use JsonPath.
  • #1207: Generate only 2 docker images.
  • #1206: Detect when fscrawler runs in foreground and adapt logs.
  • #1205: Add logs to the console when running a Docker instance.
  • #1172: Move CI from Travis to GitHub actions.
  • #872: Add more information to the _simulate API.
  • #700: Add dependency convergence checks.
  • #695: Exclude the PDFParser from the DefaultParser.
  • #694: Display full names when catching parsing errors.
  • #693: Move fs.pdf_ocr setting to fs.ocr.pdf_strategy.
  • #675: Warn in case of Tika error.
  • #1219: Update to Elasticsearch 7.14.0 and 6.8.18.
  • #1180: Bump tika.version from 1.26 to 1.27.

Removed

  • #978: files lost. By bluebell1990.

Have fun!
-FSCrawler team